Storage Backends#
Narrow-down is based on a flexible storage abstraction. The common interface is the abstract class StorageBackend. Per default, a new SimilarityStore object starts with an empty index and uses in-memory storage. The lifetime of the index is bound to the lifetime of the SimilarityStore object in this case. This can be changed by explicitly specifying a storage backend.
The following backends are built in:
ScyllaDBStore (for ScyllaDB or Cassandra)
Using storage backends#
Explicitly specifying a storage backend#
A storage backend can be explicitly defined and handed over to a SimilarityStore object. For this it can be specified as argument to the create method. It takes care of initializing the backend, e.g. by creating the necessary database tables:
from narrow_down.similarity_store import SimilarityStore
from narrow_down.storage import InMemoryStore
storage_backend = InMemoryStore()
similarity_store = await SimilarityStore.create(storage=storage_backend)
Loading SimilarityStore from storage#
All the settings of a SimilarityStore
object are also persisted in the storage backend. Therefore one can re-create a SimilarityStore
from the storage:
similarity_store = await SimilarityStore.load_from_storage(storage=storage_backend)
StoredDocument#
The documents are represented in storage as StoredDocument objects. This is a simple dataclass with the following attributes:
id_ |
Unique identifier |
document |
The actual text to use for fuzzy matching |
exact_part |
An optional string which should be matched exactly |
data |
Payload to persist together with the document |
fingerprint |
A fuzzy fingerprint of the document, e.g. a Minhash |
` |
|
The |
|
|
A StoredDocument
is also the kind of object which is returned as search result.
Storage levels#
Depending on the usecase, different levels of persistence may be preferable. Sometimes it is enough to only store just enough to be able to get the IDs of matching documents. In other cases, it can be better to store the whole documents or some additional data in the index. This way it is not necessary to have a second database for this data.
The available storage levels as defined in the enum StorageLevel are:
Storage level |
Effect |
---|---|
Minimal |
Minimal storage level. |
Fingerprint |
Store in addition to “Minimal” also the “fingerprint” attribute. |
Document |
Store in addition to “Minimal” also the “document” and the “exact_part” attributes. |
Full |
Stores all attributes of the “StoredDocument”. |
The example below shows how to set and use the storage levels with a SimilarityStore.
With “Minimal” (the default value), we only get the id and data out as query result. All other attributes are None
:
storage_backend = InMemoryStore()
similarity_store = await SimilarityStore.create(storage=storage_backend)
await similarity_store.insert(
document="the document text", exact_part="the exact part", data="additional data"
)
result = await similarity_store.query(document="the document text", exact_part="the exact part")
result[0]
StoredDocument(id_=1, document=None, exact_part=None, fingerprint=None, data='additional data')
If we use the storage level “Document” instead, also the document and exact_part attributes are stored and retrieved:
from narrow_down.storage import StorageLevel
storage_backend = InMemoryStore()
similarity_store = await SimilarityStore.create(
storage=storage_backend, storage_level=StorageLevel.Document
)
await similarity_store.insert(
document="the document text", exact_part="the exact part", data="additional data"
)
result = await similarity_store.query(document="the document text", exact_part="the exact part")
result[0]
StoredDocument(id_=1, document='the document text', exact_part='the exact part', fingerprint=None, data='additional data')
Note: The storage level needs to be defined when creating a SimilarityStore
and cannot be changed later
InMemoryStore#
The simplest backend and also the fastest both for indexing and querying is InMemoryStore. It stores all data in in-memory data structures. Therefore it is only accessible within the process which holds the memory. But it also offers a way to persist the data to disk. It can be serialized into a file in efficient binary MessagePack format.
Advantages:
Fastest backend
Easy setup
Disadvantages:
Data size is limited by the physical memory
Only one process can access the data for writing at the same time
# Initialize and use:
storage_backend = InMemoryStore()
similarity_store = await SimilarityStore.create(
storage=storage_backend, storage_level=StorageLevel.Document
)
await similarity_store.insert(
document="the document text", exact_part="the exact part", data="additional data"
)
# Store to a file:
storage_backend.to_file("/tmp/storage-backend.msgpack")
# Load again:
storage_backend = InMemoryStore.from_file("/tmp/storage-backend.msgpack")
similarity_store = await SimilarityStore.load_from_storage(storage=storage_backend)
result = await similarity_store.query(document="the document text", exact_part="the exact part")
result[0]
StoredDocument(id_=1, document='the document text', exact_part='the exact part', fingerprint=None, data='additional data')
ScyllaDB or Cassandra#
With access to Apache Cassandra or ScyllaDB (a reimplementation of Cassandra in C++), it is possible to use Narrow-down in a distributed system and beyond the boundaries of a single system’s memory.
See the API documentation of ScyllaDBStore for more details.
Advantages:
Can be used across multiple processes or services
Low memory footprint for the application
Asynchronous implementation allows concurrent usage
Disadvantages:
Database server needed
Slower than in-memory storage
import cassandra
from narrow_down.scylladb import ScyllaDBStore
cassandra_cluster = cassandra.cluster.Cluster(contact_points=["localhost"], port=9042)
session = cassandra_cluster.connect()
session.execute(
"CREATE KEYSPACE IF NOT EXISTS test_ks "
"WITH replication = {'class': 'SimpleStrategy', 'replication_factor' : 1} "
"AND durable_writes = False"
)
cassandra_storage = ScyllaDBStore(session, keyspace="test_ks")
similarity_store = await SimilarityStore.create(storage=cassandra_storage)
So the actual connection is created and managed outside of narrow_down and is passed to it.
After first initialization of the database one should not use the create()
method anymore, but rather the load_from_storage()
method:
similarity_store = await SimilarityStore.load_from_storage(storage=cassandra_storage)
SQLite#
Narrow-Down also supports using a local SQLite database as storage backend. This offers a simple setup for an amount of data which exceeds memory limit. It is fairly fast on Linux, leveraging the file system’s write cache. On Windows indexing is very slow, because every commit operation is directly flushed to disk.
Configuration options are documentend in the API documentation of SQLiteStore.
Advantages:
Low memory footprint for the application
Easy setup without external dependencies
Disadvantages:
Slow indexing, especially on Windows
Usage example:
from narrow_down.sqlite import SQLiteStore
# Initialize and use:
storage_backend = SQLiteStore("/tmp/storage-backend.sqlite")
similarity_store = await SimilarityStore.create(
storage=storage_backend, storage_level=StorageLevel.Document
)
await similarity_store.insert(
document="the document text", exact_part="the exact part", data="additional data"
)
# Reopen and continue later:
storage_backend = SQLiteStore("/tmp/storage-backend.sqlite")
similarity_store = await SimilarityStore.load_from_storage(storage=storage_backend)
result = await similarity_store.query(document="the document text", exact_part="the exact part")
result[0]
StoredDocument(id_=1, document='the document text', exact_part='the exact part', fingerprint=None, data='additional data')
Custom backend#
Narrow-down is designed to make it easy to implement storage backend. This allows to plug-in a custom backend which for example use a database which is not supported out-of-the-box.
In order to do so, create a class which inherits from the abstract StorageBackend class and implement the methods. The implementation of the built-in backends can serve as examples.
Advantages:
Also unsupported databases can be used
Disadvantages:
Some implementation effort (typically 100-200 lines of code)