Storage Backends#

Narrow-down is based on a flexible storage abstraction. The common interface is the abstract class StorageBackend. Per default, a new SimilarityStore object starts with an empty index and uses in-memory storage. The lifetime of the index is bound to the lifetime of the SimilarityStore object in this case. This can be changed by explicitly specifying a storage backend.

The following backends are built in:

InMemoryStore
ScyllaDBStore (for ScyllaDB or Cassandra)
SQLiteStore

Using storage backends#

Explicitly specifying a storage backend#

A storage backend can be explicitly defined and handed over to a SimilarityStore object. For this it can be specified as argument to the create method. It takes care of initializing the backend, e.g. by creating the necessary database tables:

from narrow_down.similarity_store import SimilarityStore
from narrow_down.storage import InMemoryStore

storage_backend = InMemoryStore()

similarity_store = await SimilarityStore.create(storage=storage_backend)

Loading SimilarityStore from storage#

All the settings of a SimilarityStore object are also persisted in the storage backend. Therefore one can re-create a SimilarityStore from the storage:

similarity_store = await SimilarityStore.load_from_storage(storage=storage_backend)

StoredDocument#

The documents are represented in storage as StoredDocument objects. This is a simple dataclass with the following attributes:


id_	Unique identifier
document	The actual text to use for fuzzy matching
exact_part	An optional string which should be matched exactly
data	Payload to persist together with the document
fingerprint	A fuzzy fingerprint of the document, e.g. a Minhash
`
The `id_` can be either generated or specified by the user. The attributes `document`, `exact_part` and `data` are user specified input. Only the first two are used for searching, the `data` is some optional payload that can be stored together with the rest.
`fingerprint` finally, is mostly of internal value. This is calculated from the input data on indexing.

A StoredDocument is also the kind of object which is returned as search result.

Storage levels#

Depending on the usecase, different levels of persistence may be preferable. Sometimes it is enough to only store just enough to be able to get the IDs of matching documents. In other cases, it can be better to store the whole documents or some additional data in the index. This way it is not necessary to have a second database for this data.

The available storage levels as defined in the enum StorageLevel are:

Storage level	Effect
Minimal	Minimal storage level. Only store the necessary data to perform the search, namely only the “id” and (if given) the “data”.
Fingerprint	Store in addition to “Minimal” also the “fingerprint” attribute.
Document	Store in addition to “Minimal” also the “document” and the “exact_part” attributes.
Full	Stores all attributes of the “StoredDocument”.

The example below shows how to set and use the storage levels with a SimilarityStore.

With “Minimal” (the default value), we only get the id and data out as query result. All other attributes are None:

storage_backend = InMemoryStore()

similarity_store = await SimilarityStore.create(storage=storage_backend)

await similarity_store.insert(
    document="the document text", exact_part="the exact part", data="additional data"
)
result = await similarity_store.query(document="the document text", exact_part="the exact part")
result[0]

StoredDocument(id_=1, document=None, exact_part=None, fingerprint=None, data='additional data')

If we use the storage level “Document” instead, also the document and exact_part attributes are stored and retrieved:

from narrow_down.storage import StorageLevel

storage_backend = InMemoryStore()

similarity_store = await SimilarityStore.create(
    storage=storage_backend, storage_level=StorageLevel.Document
)

await similarity_store.insert(
    document="the document text", exact_part="the exact part", data="additional data"
)
result = await similarity_store.query(document="the document text", exact_part="the exact part")
result[0]

StoredDocument(id_=1, document='the document text', exact_part='the exact part', fingerprint=None, data='additional data')

Note: The storage level needs to be defined when creating a SimilarityStore and cannot be changed later

InMemoryStore#

The simplest backend and also the fastest both for indexing and querying is InMemoryStore. It stores all data in in-memory data structures. Therefore it is only accessible within the process which holds the memory. But it also offers a way to persist the data to disk. It can be serialized into a file in efficient binary MessagePack format.

Advantages:

Fastest backend
Easy setup

Disadvantages:

Data size is limited by the physical memory
Only one process can access the data for writing at the same time

# Initialize and use:
storage_backend = InMemoryStore()
similarity_store = await SimilarityStore.create(
    storage=storage_backend, storage_level=StorageLevel.Document
)
await similarity_store.insert(
    document="the document text", exact_part="the exact part", data="additional data"
)

# Store to a file:
storage_backend.to_file("/tmp/storage-backend.msgpack")

# Load again:
storage_backend = InMemoryStore.from_file("/tmp/storage-backend.msgpack")
similarity_store = await SimilarityStore.load_from_storage(storage=storage_backend)

result = await similarity_store.query(document="the document text", exact_part="the exact part")
result[0]

StoredDocument(id_=1, document='the document text', exact_part='the exact part', fingerprint=None, data='additional data')

ScyllaDB or Cassandra#

With access to Apache Cassandra or ScyllaDB (a reimplementation of Cassandra in C++), it is possible to use Narrow-down in a distributed system and beyond the boundaries of a single system’s memory.

See the API documentation of ScyllaDBStore for more details.

Advantages:

Can be used across multiple processes or services
Low memory footprint for the application
Asynchronous implementation allows concurrent usage

Disadvantages:

Database server needed
Slower than in-memory storage

import cassandra

from narrow_down.scylladb import ScyllaDBStore

cassandra_cluster = cassandra.cluster.Cluster(contact_points=["localhost"], port=9042)
session = cassandra_cluster.connect()
session.execute(
    "CREATE KEYSPACE IF NOT EXISTS test_ks "
    "WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1} "
    "AND durable_writes = False"
)

cassandra_storage = ScyllaDBStore(session, keyspace="test_ks")

similarity_store = await SimilarityStore.create(storage=cassandra_storage)

So the actual connection is created and managed outside of narrow_down and is passed to it. After first initialization of the database one should not use the create() method anymore, but rather the load_from_storage() method:

similarity_store = await SimilarityStore.load_from_storage(storage=cassandra_storage)

SQLite#

Narrow-Down also supports using a local SQLite database as storage backend. This offers a simple setup for an amount of data which exceeds memory limit. It is fairly fast on Linux, leveraging the file system’s write cache. On Windows indexing is very slow, because every commit operation is directly flushed to disk.

Configuration options are documentend in the API documentation of SQLiteStore.

Advantages:

Low memory footprint for the application
Easy setup without external dependencies

Disadvantages:

Slow indexing, especially on Windows

Usage example:

from narrow_down.sqlite import SQLiteStore

# Initialize and use:
storage_backend = SQLiteStore("/tmp/storage-backend.sqlite")
similarity_store = await SimilarityStore.create(
    storage=storage_backend, storage_level=StorageLevel.Document
)
await similarity_store.insert(
    document="the document text", exact_part="the exact part", data="additional data"
)

# Reopen and continue later:
storage_backend = SQLiteStore("/tmp/storage-backend.sqlite")
similarity_store = await SimilarityStore.load_from_storage(storage=storage_backend)

result = await similarity_store.query(document="the document text", exact_part="the exact part")
result[0]

StoredDocument(id_=1, document='the document text', exact_part='the exact part', fingerprint=None, data='additional data')

Custom backend#

Narrow-down is designed to make it easy to implement storage backend. This allows to plug-in a custom backend which for example use a database which is not supported out-of-the-box.

In order to do so, create a class which inherits from the abstract StorageBackend class and implement the methods. The implementation of the built-in backends can serve as examples.

Advantages:

Also unsupported databases can be used

Disadvantages:

Some implementation effort (typically 100-200 lines of code)