Basic Usage#

The class SimilarityStore allows to incrementally index and search documents. Both is demonstrated in the sections below.

The API is fully asynchronous. That means all relevant methods can directly called with await from coroutine functions. But it can also be called from synchronous code with asyncio.run(). This creates a little overhead to establish an event loop. So it is better to call run() not to often but rather on a higher level in the call chain.

Indexing#

The code block below shows how to create and configure a SimilarityStore() object.

Here we choose the StorageLevel Document, so that the whole document is stored in the SimilarityStore and can be returned from it. A target similarity threshold of 75% is defined, which means that we want to search for documents which have a Jaccard similarity of at least 75% with the input document. To calculate the similarity, the document is first preprocessed by a tokenizer function. Here we choose the character 3-grams of a document for this.

import narrow_down as nd
from narrow_down.storage import StorageLevel, StoredDocument

similarity_store = await nd.similarity_store.SimilarityStore.create(
    storage_level=StorageLevel.Document,
    similarity_threshold=0.75,
    tokenize="char_ngrams(3)",
)

Now the object can be filled with documents. As example reviews of a popular oatmeal cookie are used:

strings_to_index = [
    "Delicious!",
    "Great Anytime of Day!",
    "Very good!",
    "Quick, simple HEALTHY snack for the kiddos!!!",
    "Quaker Soft Baked Oatmeal Cookies",
    "Yummy",
    "Wow!!!!!",
    "soft, chewy, yummy!",
    "so soft and good",
    "Chewy deliciousness",
    "the bomb",
    "Deliciousness",
    "Yummy",
    "awesome cookies",
    "Home-baked taste without the fuss",
    "Yummy Whole Grain Goodness!!!",
    "Yummy",
    "Amazing Cookies!",
    "Good, but not homemade.",
    "Very Good Oatmeal Cookie",
    "Very good cookie",
    "Love these cookies especially for the kids",
    "My kids loved them.",
    "Lunchbox or Work Staple",
    "So Delious as no other",
    "Over-Packaged Product",
    "yum",
    "Great taste",
    "Yummy!!",
    "Well, the foil packet is handy...",
    "TOTALLY DIFFERENT!",
]


for i, doc in enumerate(strings_to_index):
    await similarity_store.insert(doc.lower(), document_id=i)

Querying#

Now that some data is indexed, the SimilarityStore is ready to execute searches:

search_result = await similarity_store.query("Awesome cookies".lower())
search_result == [StoredDocument(id_=13, document="awesome cookies")]

True

search_result = await similarity_store.query("So Delicious as no other".lower())
search_result == [StoredDocument(id_=24, document="so delious as no other")]

True

search_result = await similarity_store.query("Very, very good cookie".lower())
search_result == [StoredDocument(id_=20, document="very good cookie")]

True

search_result = await similarity_store.query("Loving every bit of it!".lower())
search_result == []

True

Adding more documents#

There is no split between a training and a prediction phase. More documents can be added at any time:

await similarity_store.insert("Good cookie", document_id=42)