Training Parameters
Parameters are available to customize the training process:| Parameter | Type | Default | Description |
|---|---|---|---|
n_lists | int | None | (Optional) Number of inverted index lists to create in the index. When None, auto-determines based on the number of vectors in the index. |
batch_size | int | None | (Optional) Size of each batch for training. When None, defaults to 0 (auto-selected). |
max_iters | int | None | (Optional) Maximum number of iterations for training. When None, defaults to 100. |
tolerance | float | None | (Optional) Convergence tolerance for training. When None, defaults to 1e-6. |
max_memory | int | None | (Optional) Maximum memory to use for training. When None, defaults to 0 (no limit). |
n_lists is the number of clusters into which each vector in the index can be categorized. Typically, the higher the value, the higher the recall (but also the slower the indexing process). As a good rule of thumb, n_lists should be:
- A base-2 number (e.g.,
2,048,4,096). Not a requirement, but yields performance optimizations. - Each cluster should have between
100-10,000vectors; son_listsshould be roughly between1/100-1/10,000of the total number of items which will be indexed.
n_lists will be auto-selected based on the number of vectors in the index.
Avoid the large-untrained-query warning
While training is technically optional (you can use CyborgDB without ever callingtrain), it is recommended that you do so once you have a large number of vectors in the index (e.g., > 50,000). If you don’t, and you call query, you will see a warning in the console, stating:
API Reference
For more information on training an encrypted index, refer to the API reference:Python API Reference
API reference for
train() in PythonC++ API Reference
API reference for
TrainIndex() in C++