narrow_down.similarity_store module#
High-level API for indexing and retrieval of documents.
- class narrow_down.similarity_store.SimilarityStore[source]#
Bases:
object
Storage class for indexing and fuzzy search of documents.
- async classmethod create(*, storage=None, storage_level=StorageLevel.Minimal, tokenize=None, max_false_negative_proba=0.05, max_false_positive_proba=0.05, similarity_threshold=0.75)[source]#
Create a new SimilarityStore object.
- Parameters:
storage (StorageBackend | None) – Storage backend to use for persisting the data. Per default this is an in-memory backend.
storage_level (StorageLevel) – The granularity of the internal storage mechanism. Per default nothing than the document IDs are stored.
tokenize (str | Callable[[str], Collection[str]] | None) –
The tokenization function to use to split the documents into smaller parts. E.g. the document may be split into words or into character n-grams. Per default word 3-grams are used.
Built-in tokenizers can be used by passing their name and parameters as string. Options:
"word_ngrams(n)"
enables the word-ngram tokenizernarrow_down._tokenize.word_ngrams()
"char_ngrams(n)"
or"char_ngrams(n,c)"
enables the character-ngram tokenizernarrow_down._tokenize.char_ngrams()
.
It is also possible to pass a custom function (not as string in this case, but the function itself). In this case it needs to be taken care to specify the same function again when saving and re-creating the SimilarityStore object.
max_false_negative_proba (float) – The target probability for false negatives. Setting this higher decreases the risk of not finding a similar document, but it leads to slower processing and more storage consumption.
max_false_positive_proba (float) – The target probability for false positives. Setting this higher decreases the risk of finding documents which are in reality not similar, but it leads to slower processing and more storage consumption.
similarity_threshold (float) – The minimum Jaccard similarity threshold used to identify two documents as being similar.
- Raises:
ValueError – If the function specified with
tokenize
cannot be found.- Returns:
A new SimilarityStore object with already initialized storage.
- Return type:
- async classmethod load_from_storage(storage, tokenize=None)[source]#
Load a SimilarityStore object from already initialized storage.
- Parameters:
storage (StorageBackend) – A StorageBackend object which must already have been initialized by a SimilarityStore object before.
tokenize (str | Callable[[str], Collection[str]] | None) – The tokenization function originally specified in the init when initializing the Similarity Store. See
narrow_down.SimilarityStore.__init__()
.
- Returns:
A SimilarityStore object using the given storage backend and with the settings stored in the storage.
- Raises:
TypeError – If settings in the storage are missing, corrupt or cannot be deserialized.
ValueError – If the function specified with
tokenize
cannot be found.
- Return type:
- async insert(document, *, document_id=None, exact_part=None, data=None)[source]#
Index a new document.
- Parameters:
- Returns:
The ID under which the document was indexed.
- Return type:
- async remove_by_id(document_id, check_if_exists=False)[source]#
Remove the document with the given ID from the internal data structures.
- Parameters:
- Raises:
KeyError – If no document with the given ID is stored.
TooLowStorageLevel – If the storage level is too low and fingerprints are not available.
- Return type:
None
Notes
This method is only usable with StorageLevel ‘Fingerprint’ or higher.
- async query(document, *, exact_part=None, validate=None)[source]#
Query all similar documents.
- Parameters:
document (str) – A document for which to search similar items.
exact_part (str | None) – Part that should be exactly matched.
validate (bool | None) – Whether to validate if the results are really above the similarity threshold. This is only possible if the storage level is at least “Document”. Per default validation is done if the data is available, otherwise not.
- Returns:
A List of
StoredDocument
objects with all elements which are estimated to be above the similarity threshold.- Return type:
- async query_top_n(n, document, *, exact_part=None, validate=None)[source]#
Query the top n similar documents.
- Parameters:
n (int) – The number of similar documents to retrieve.
document (str) – A document for which to search similar items.
exact_part (str | None) – Part that should be exactly matched.
validate (bool | None) – Whether to validate if the results are really above the similarity threshold. This is only possible if the storage level is at least “Document”. Per default validation is done if the data is available, otherwise not.
- Returns:
A List of
StoredDocument
objects with the n elements which are most likely above the similarity threshold.- Return type:
Note that the results are probabilistic. The documents are assumed to be the most likely candidates if they have the most likely fingerprint. But the actual similarity of the documents themselves might differ. However, if validate is True the ordering of the results is correct, because the actual documents are compared with each other.