Changelog#
All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Unreleased#
1.1.0 - 2023-05-01#
Changed#
Remove dependency on the protobuf package by using a Rust implementation for serialization
Fixed#
Tests were failing because of a breaking change in Nox
1.0.1 - 2023-04-30#
Changed#
Update pre-commit hooks and dependencies
Allow to use also protobuf 4
1.0.0 - 2022-05-17#
Added#
The storage backends do have now a method query_documents() to leverage economies of scale when querying multiple documents at once.
Changed#
The public interface of the library is declared stable, hence it is ready for version 1.0.
char_ngrams() is now fully implemented in Rust, giving a speedup of 2x.
minhash LSH uses the new query_documents() of the storage backends instead of running concurrent queries.
Fixed#
Wrong operator precedence in minhash implementation, which lead to incorrect results.
Incorrect parsing of tokenize argument for SimilarityStore for char_ngrams without padding.
0.10.0 - 2022-05-08#
Changed#
Improved performance of SimilarityStore.query_top_n().
0.9.3 - 2022-04-05#
Fixed#
Fixes #63 which led to Exceptions in case of empty documents.
0.9.2 - 2022-03-29#
Fixed#
Fixes #62 which led to TypeErrors in case of multiple identical results.
0.9.1 - 2022-03-25#
Changed#
Minimum number of hash permutations for Minhash LSH set to 16 to avoid artifacts as described in #61.
0.9.0 - 2022-03-13#
Added#
ScyllaDBStore now accepts a
table_prefix
setting.
Changed#
The classes in narrow_down.data_types were moved to narrow_down.storage.
The
initialize()
method of the storage backends can now be called multiple times without issues.
Fixed#
A use of collections.Counter as typehint broke mypy checks.
0.8.0 - 2022-02-23#
Added#
Direct InMemoryStore file serialization in the Rust backend. This avoids a memory peak and also improves the performance of the operation compared to (de-)serialization via the detour of a Python bytes object.
0.7.0 - 2022-02-06#
Added#
InMemoryStore can be serialized to and deserialized from MessagePack.
SimilarityStore.top_n_query() now allows to find a limited number of most similar documents.
SimilarityStore offers the option to validate the similarity score if the document is available to avoid false positives.
Changed#
SimilarityStore objects can now be created by a factory coroutine
create()
instead of calling first__init__()
and theninitialize()
. This makes the usage of the class more straight-forward.The exact_part of a document is now also stored in storage level “Document”.
The InMemoryStore no longer uses Python dictionaries as storage, but rather a class in the Rust extension to reduce the memory footprint by a lot.
Fixed#
The number of partitions is now stored in the database for the SQLite backend. This way the DB is self-contained and the user doesn’t have to keep the number elsewhere.
0.6.0 - 2022-01-29#
Added#
Storage backend for ScyllaDB, a cassandra-like distributed key-value store.
Changed#
StoredDocument objects are now serialized with protobuf to increase speed and reduce storage consumption.
Storage queries are done concurrently where possible
ScyllaDB sessions are now reused which give a great performance benefit
Fixed#
Integer overflows in the minhash calculation which reduced the quality of the permutations (hash functions). Depending on the input effectively max_uint32 was used instead of a prime number in the modulo calculation.
Removed#
The backend AsyncSQLiteStore is removed, because it turned out that aiosqlite relies on the user to guarantee that only one coroutine at a time tries writing. Otherwise, a “Database locked” exception is thrown. As the performance was anyway worse than expected it was removed.
0.5.0 - 2022-01-17#
Changed#
The SQLite backends take now an init parameter “partitions” which leads to internally partitioned tables. This reduces query time by a lot.
Parameters for SQLite were optimized in order to increase insertion speed and reduce the number of disk write operations.
0.4.0 - 2022-01-16#
Added#
Synchronous and Asynchronous SQLite storage backend
Settings of SimilarityStore objects are now saved in the storage backend
Deserialization of SimilarityStore from an existing storage backend is now possible
0.3.1 - 2022-01-14#
Fixed#
Wrong order of parameters for Minhash LSH in the SimilarityStore class
0.3.0 - 2022-01-09#
Added#
Documents can now be removed from the index again with SimilarityStore.remove_by_id()
0.2.1 - 2022-01-09#
Fixed#
CI only: Wrong URL of test-pypi
0.2.0 - 2022-01-09#
Added#
A rust extension was added and therefore moving the build system to Maturin.
A Minhash-LSH data structure was implemented.
Different tokenizers for character- and word-n-grams were implemented.
0.1.1 - 2021-12-30#
0.1.0 - 2021-12-30#
Added#
First release on PyPI.