Contents Menu Expand Light mode Dark mode Auto light/dark mode
narrow-down 1.1.0
narrow-down 1.1.0

Introduction

  • Narrow Down - Efficient near-duplicate search

User Guide

  • Basic Usage
  • Configuration of Indexing and Search
  • Storage Backends

Reference

  • API Documentation
    • narrow_down package
    • narrow_down.hash module
    • narrow_down.scylladb module
    • narrow_down.similarity_store module
    • narrow_down.sqlite module
    • narrow_down.storage module

Project

  • Changelog
Back to top

Narrow Down - Efficient near-duplicate search#

PyPI - Version PyPI - Python Version Tests Codecov License

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Black pre-commit Contributor Covenant

Narrow Down offers a flexible but easy-to-use Python API to finding duplicates or similar documents also in very large datasets. It reduces the O(n²) problem of comparing all strings with each other to linear scale by using approximation algorithms like Locality Sensitive Hashing.

  • GitHub repo: https://github.com/chr1st1ank/narrow-down.git

  • Documentation: https://chr1st1ank.github.io/narrow-down

Features#

  • Document indexing and search based on the Minhash LSH algorithm

  • High performance thanks to a native extension module in Rust

  • Easy-to-use API with automated parameter tuning

  • Works with exchangeable storage backends. Currently implemented:

    • In-Memory

    • Cassandra / ScyllaDB

    • SQLite

    • User defined backends (by implementing a small interface)

  • Native asyncio interface

Installation#

The Python package can be installed with pip:

pip install narrow-down

Extras#

Some of the heavier functionality is available as extra:

pip install narrow-down[scylladb]   # Cassandra / ScyllaDB storage backend

Similar projects#

  • pylsh offers a good implementation of the classic Minhash LSH scheme in Python and Cython. If you only need this and you don’t need a database backend it can be a good choice.

  • Datasketch implements an interesting collection of different data sketching algorithms for similarity matching, cardinality estimation and k-nearest-neighbour search. The implementation is not highly optimized but very well usable, the documentation rich and multiple database backends can be used for some of the sketches

  • Milvus offers a full database stack for vector search, a different approach for fast searching. It can also be applied to text search when an embedding like Word2Vec or Bert is used to vectorize the text.

Credits#

This package was created with Cookiecutter and the fedejaure/cookiecutter-modern-pypackage project template.

Next
Basic Usage
Copyright © 2022, Christian Krudewig
Made with Sphinx and @pradyunsg's Furo
On this page
  • Narrow Down - Efficient near-duplicate search
    • Features
    • Installation
      • Extras
    • Similar projects
    • Credits