Configure an Encrypted Index
Cyborg Vector Search offers three index types, all of which offer varying characteristics:
Index Type | Speed | Recall | Index Size |
---|---|---|---|
IVFFlat | Fast | Highest | Biggest |
IVFPQ | Fast | High | Medium |
IVF | Fastest | Lowest | Smallest |
Generally-speaking, we recommend that you start with IVFFlat
, which provides the highest recall (accuracy) with high indexing and retrieval speeds. For most use-cases, this index type will scale well. If minimizing index size is important, IVFPQ
takes IVFFlat
and applies Product-Quantization (PQ, a form of lossy compression) to the vector embeddings which can greatly reduce index size.
IVFFlat
Index Type
The IVFFlat
index type improves IVF
significantly by storing encrypted vector embeddings in the index. In addition to selecting the closest clusters for a query vector, the exact distance can be computed between each candidate vector and the query, yielding very high recall rates (up to >99%). This comes at the cost of index size and some search speed.
We recommend IVFFlat
indexes for most applications, as it provides the highest recall rates, and it’s possible to mitigate index size constraints later via IVFPQ
.
To create an IVFFlat
index, you can use its configuration constructor:
n_lists
is the number of clusters into which each vector in the index can be categorized. Typically, the higher the value, the higher the recall (but also the slower the indexing process). As a good rule of thumbs, n_lists
should be:
- A base-2 number (e.g.,
2,048
,4,096
). Not a requirement, but yields performance optimizations. - Each cluster should have between
100
-10,000
vectors; son_lists
should be roughly between1/100
-1/10,000
of the total number of items which will be indexed.
IVFPQ
Index Type
The IVFPQ
index type builds upon IVFFlat
by applying Product Quantization (PQ) - a form of lossy compression - to reduce the index size. When applied correctly, IVFPQ
indexes can maintain high recall (>95%) while reducing index size significantly (2-4x).
We recommend IVFPQ
indexes for mature applications, where the dataset and query distributions are well-established. This is because IVFPQ
requires the most tuning to yield an ideal balance between recall and index size. It is possible to go from IVFFlat
to IVFPQ
on the same index, but not vice-versa.
To create an IVFPQ
index, you can use its configuration constructor:
n_lists
is the number of clusters into which each vector in the index can be categorized. Typically, the higher the value, the higher the recall (but also the slower the indexing process). As a good rule of thumbs, n_lists
should be:
- A base-2 number (e.g.,
2,048
,4,096
). Not a requirement, but yields performance optimizations. - Each cluster should have between
100
-10,000
vectors; son_lists
should be roughly between1/100
-1/10,000
of the total number of items which will be indexed.
pq_dim
is the number of dimensionality for each vector after product-quantization. It must be between 1
and dimension
, and dimension
must be cleanly divisible by pq_dim
. Lower pq_dim
will yield smaller index sizes but lower recall.
pq_bits
is the number of bits that will be used to represent each dimension of the product-quantized vector embeddings. It must be between 1
and 16
, with lower values yielding smaller index sizes but lower recall.
IVF
Index Type
The IVF
index type (Inverted File Index) is the simplest offered by Cyborg Vector Search. It takes a single parameter, n_lists
, which defines the number of lists, or clusters, into which the index is segmented. In other words, vectors that are close together will be clustered together into a single list. At retrieval time, vectors in the clusters closest to the query vector will be returned. This yields a very fast, but rather inaccurate, search.
We recommend IVF
indexes for applications which require high-speed, low-latency search with low recall requirements (or where top_k
is rather large, i.e. >500).
To create an IVF
index, you can use its configuration constructor:
n_lists
is the number of clusters into which each vector in the index can be categorized. Typically, the higher the value, the higher the recall (but also the slower the indexing process). As a good rule of thumbs, n_lists
should be:
- A base-2 number (e.g.,
2,048
,4,096
). Not a requirement, but yields performance optimizations. - Each cluster should have between
100
-10,000
vectors; son_lists
should be roughly between1/100
-1/10,000
of the total number of items which will be indexed.
Customizing Distance Metrics
By default, Cyborg Vector Search uses euclidean
distance as its metric for all index types. You can override this default by providing a distance_metric
parameter to any of the index constructors. For example:
The currently supported distance metrics are:
"cosine"
: Cosine similarity."euclidean"
: Euclidean distance."squared_euclidean"
: Squared Euclidean distance.
API Reference
For more information on the IndexConfig
classes, refer to the API Reference:
Was this page helpful?