CyborgDB uses IVF* index types, which leverage clustering algorithms to segment the index into smaller sections for efficient querying. These clustering algorithms must be trained on the specific data being indexed in order to adequately represent that data. In the embedded library version of CyborgDB, this training must be explicitly called once enough vectors have been added:
# Train the encrypted index
index.train()

# Train the index with a specific number of clusters
index.train(n_lists=1024)

# Train the index with specific configuration
index.train(
    n_lists=1024, 
    batch_size=64, 
    max_iters=100
)
You must have at least 2 * n_lists or 10,000 (whichever is greater) vectors in the index (ingested via upsert) before you can call train.

Training Parameters

Parameters are available to customize the training process:
ParameterTypeDefaultDescription
n_listsint0(Optional) Number of inverted index lists to create in the index. If 0, it will auto-determine based on the number of vectors in the index.
batch_sizeint2048(Optional) Size of each batch for training.
max_itersint0(Optional) Maximum number of iterations for training. 0 auto-selects the iteration count.
tolerancefloat1e-6(Optional) Convergence tolerance for training.
max_memoryint0(Optional) Maximum memory to use for training. 0 sets no limit.
n_lists is the number of clusters into which each vector in the index can be categorized. Typically, the higher the value, the higher the recall (but also the slower the indexing process). As a good rule of thumbs, n_lists should be:
  • A base-2 number (e.g., 2,048, 4,096). Not a requirement, but yields performance optimizations.
  • Each cluster should have between 100 - 10,000 vectors; so n_lists should be roughly between 1/100 - 1/10,000 of the total number of items which will be indexed.
If not provided, n_lists will be auto-selected based on the number of vectors in the index.

Warnings with Large Untrained Queries

While training is technically optional (you can use CyborgDB without ever calling train), it is recommended that you do so once you have a large number of vectors in the index (e.g., > 50,000). If you don’t, and you call query, you will see a warning in the console, stating:
Warning: querying untrained index with more than 50000 indexed vectors.

API Reference

For more information on training an encrypted index, refer to the API reference: