Narrow Down - Efficient near-duplicate search#
Narrow Down offers a flexible but easy-to-use Python API to finding duplicates or similar documents also in very large datasets. It reduces the O(n²) problem of comparing all strings with each other to linear scale by using approximation algorithms like Locality Sensitive Hashing.
GitHub repo: https://github.com/chr1st1ank/narrow-down.git
Documentation: https://chr1st1ank.github.io/narrow-down
Features#
Document indexing and search based on the Minhash LSH algorithm
High performance thanks to a native extension module in Rust
Easy-to-use API with automated parameter tuning
Works with exchangeable storage backends. Currently implemented:
In-Memory
Cassandra / ScyllaDB
SQLite
User defined backends (by implementing a small interface)
Native asyncio interface
Installation#
The Python package can be installed with pip:
pip install narrow-down
Extras#
Some of the heavier functionality is available as extra:
pip install narrow-down[scylladb] # Cassandra / ScyllaDB storage backend
Similar projects#
pylsh offers a good implementation of the classic Minhash LSH scheme in Python and Cython. If you only need this and you don’t need a database backend it can be a good choice.
Datasketch implements an interesting collection of different data sketching algorithms for similarity matching, cardinality estimation and k-nearest-neighbour search. The implementation is not highly optimized but very well usable, the documentation rich and multiple database backends can be used for some of the sketches
Milvus offers a full database stack for vector search, a different approach for fast searching. It can also be applied to text search when an embedding like Word2Vec or Bert is used to vectorize the text.
Credits#
This package was created with Cookiecutter and the fedejaure/cookiecutter-modern-pypackage project template.