Narrow Down - Efficient near-duplicate search#

Narrow Down offers a flexible but easy-to-use Python API to finding duplicates or similar documents also in very large datasets. It reduces the O(n²) problem of comparing all strings with each other to linear scale by using approximation algorithms like Locality Sensitive Hashing.

GitHub repo: https://github.com/chr1st1ank/narrow-down.git
Documentation: https://chr1st1ank.github.io/narrow-down

Features#

Document indexing and search based on the Minhash LSH algorithm
High performance thanks to a native extension module in Rust
Easy-to-use API with automated parameter tuning
Works with exchangeable storage backends. Currently implemented:
- In-Memory
- Cassandra / ScyllaDB
- SQLite
- User defined backends (by implementing a small interface)
Native asyncio interface

Installation#

The Python package can be installed with pip:

pip install narrow-down

Extras#

Some of the heavier functionality is available as extra:

pip install narrow-down[scylladb]   # Cassandra / ScyllaDB storage backend

Similar projects#

pylsh offers a good implementation of the classic Minhash LSH scheme in Python and Cython. If you only need this and you don’t need a database backend it can be a good choice.
Datasketch implements an interesting collection of different data sketching algorithms for similarity matching, cardinality estimation and k-nearest-neighbour search. The implementation is not highly optimized but very well usable, the documentation rich and multiple database backends can be used for some of the sketches
Milvus offers a full database stack for vector search, a different approach for fast searching. It can also be applied to text search when an embedding like Word2Vec or Bert is used to vectorize the text.

Credits#

This package was created with Cookiecutter and the fedejaure/cookiecutter-modern-pypackage project template.