IVF* index types, which leverage clustering algorithms to segment the index into smaller sections for efficient querying. These clustering algorithms must be trained on the specific data being indexed in order to adequately represent that data.
Training Parameters
Parameters are available to customize the training process:| Parameter | Type | Default | Description |
|---|---|---|---|
n_lists | int | 0 | (Optional) Number of inverted index lists to create in the index. If 0, it will auto-determine based on the number of vectors in the index. |
batch_size | int | 0 | (Optional) Size of each batch for training. 0 auto-selects the batch size. |
max_iters | int | 0 | (Optional) Maximum number of iterations for training. 0 auto-selects the iteration count. |
tolerance | float | 1e-6 | (Optional) Convergence tolerance for training. |
n_lists is the number of clusters into which each vector in the index can be categorized. Typically, the higher the value, the higher the recall (but also the slower the indexing process). As a good rule of thumbs, n_lists should be:
- A base-2 number (e.g.,
2,048,4,096). Not a requirement, but yields performance optimizations. - Each cluster should have between
100-10,000vectors; son_listsshould be roughly between1/100-1/10,000of the total number of items which will be indexed.
n_lists value based on the number of vectors in the index.
Warnings with Large Untrained Queries
While training is technically optional (you can use CyborgDB without ever callingtrain), it is recommended that you do so once you have a large number of vectors in the index (e.g., > 50,000). If you don’t, and you call query, you will see a warning in the console, stating:
API Reference
For more information on training an encrypted index, refer to the API reference:REST API Reference
REST API reference for
/v1/indexes/trainPython SDK Reference
API reference for
train() in PythonJS/TS SDK Reference
API reference for
train() in JavaScript/TypeScriptGo SDK Reference
API reference for
Train() in Go