narrow_down.similarity_store module#

High-level API for indexing and retrieval of documents.

class narrow_down.similarity_store.SimilarityStore[source]#

Bases: object

Storage class for indexing and fuzzy search of documents.

async classmethod create(*, storage=None, storage_level=StorageLevel.Minimal, tokenize=None, max_false_negative_proba=0.05, max_false_positive_proba=0.05, similarity_threshold=0.75)[source]#

Create a new SimilarityStore object.

Parameters:
  • storage (StorageBackend | None) – Storage backend to use for persisting the data. Per default this is an in-memory backend.

  • storage_level (StorageLevel) – The granularity of the internal storage mechanism. Per default nothing than the document IDs are stored.

  • tokenize (str | Callable[[str], Collection[str]] | None) –

    The tokenization function to use to split the documents into smaller parts. E.g. the document may be split into words or into character n-grams. Per default word 3-grams are used.

    Built-in tokenizers can be used by passing their name and parameters as string. Options:

    • "word_ngrams(n)" enables the word-ngram tokenizer narrow_down._tokenize.word_ngrams()

    • "char_ngrams(n)" or "char_ngrams(n,c)" enables the character-ngram tokenizer narrow_down._tokenize.char_ngrams().

    It is also possible to pass a custom function (not as string in this case, but the function itself). In this case it needs to be taken care to specify the same function again when saving and re-creating the SimilarityStore object.

  • max_false_negative_proba (float) – The target probability for false negatives. Setting this higher decreases the risk of not finding a similar document, but it leads to slower processing and more storage consumption.

  • max_false_positive_proba (float) – The target probability for false positives. Setting this higher decreases the risk of finding documents which are in reality not similar, but it leads to slower processing and more storage consumption.

  • similarity_threshold (float) – The minimum Jaccard similarity threshold used to identify two documents as being similar.

Raises:

ValueError – If the function specified with tokenize cannot be found.

Returns:

A new SimilarityStore object with already initialized storage.

Return type:

SimilarityStore

async classmethod load_from_storage(storage, tokenize=None)[source]#

Load a SimilarityStore object from already initialized storage.

Parameters:
  • storage (StorageBackend) – A StorageBackend object which must already have been initialized by a SimilarityStore object before.

  • tokenize (str | Callable[[str], Collection[str]] | None) – The tokenization function originally specified in the init when initializing the Similarity Store. See narrow_down.SimilarityStore.__init__().

Returns:

A SimilarityStore object using the given storage backend and with the settings stored in the storage.

Raises:
  • TypeError – If settings in the storage are missing, corrupt or cannot be deserialized.

  • ValueError – If the function specified with tokenize cannot be found.

Return type:

SimilarityStore

async insert(document, *, document_id=None, exact_part=None, data=None)[source]#

Index a new document.

Parameters:
  • document (str) – A document (as string to index).

  • document_id (int | None) – Optional ID to assign to the document.

  • exact_part (str | None) – Optional exact string to match when searching for the document.

  • data (str | None) – Optional additional payload to save together with the document.

Returns:

The ID under which the document was indexed.

Return type:

int

async remove_by_id(document_id, check_if_exists=False)[source]#

Remove the document with the given ID from the internal data structures.

Parameters:
  • document_id (int) – ID of the document to remove.

  • check_if_exists (bool) – Raise a KeyError if the document does not exist.

Raises:
  • KeyError – If no document with the given ID is stored.

  • TooLowStorageLevel – If the storage level is too low and fingerprints are not available.

Return type:

None

Notes

This method is only usable with StorageLevel ‘Fingerprint’ or higher.

async query(document, *, exact_part=None, validate=None)[source]#

Query all similar documents.

Parameters:
  • document (str) – A document for which to search similar items.

  • exact_part (str | None) – Part that should be exactly matched.

  • validate (bool | None) – Whether to validate if the results are really above the similarity threshold. This is only possible if the storage level is at least “Document”. Per default validation is done if the data is available, otherwise not.

Returns:

A List of StoredDocument objects with all elements which are estimated to be above the similarity threshold.

Return type:

Collection[StoredDocument]

async query_top_n(n, document, *, exact_part=None, validate=None)[source]#

Query the top n similar documents.

Parameters:
  • n (int) – The number of similar documents to retrieve.

  • document (str) – A document for which to search similar items.

  • exact_part (str | None) – Part that should be exactly matched.

  • validate (bool | None) – Whether to validate if the results are really above the similarity threshold. This is only possible if the storage level is at least “Document”. Per default validation is done if the data is available, otherwise not.

Returns:

A List of StoredDocument objects with the n elements which are most likely above the similarity threshold.

Return type:

Collection[StoredDocument]

Note that the results are probabilistic. The documents are assumed to be the most likely candidates if they have the most likely fingerprint. But the actual similarity of the documents themselves might differ. However, if validate is True the ordering of the results is correct, because the actual documents are compared with each other.