Train an Encrypted Index

CyborgDB uses IVF* index types, which leverage clustering algorithms to segment the index into smaller sections for efficient querying. These clustering algorithms must be trained on the specific data being indexed in order to adequately represent that data. In CyborgDB Service, training is handled automatically after 10,000 vectors have been upserted. However, you can explicitly trigger training once enough vectors have been added, if you wish to specify training parameters.

You can adjust the number of vectors that will trigger automatic training by setting the RETRAIN_THRESHOLD environment variable. See more in the Environment Variables guide.

# Train the encrypted index
index.train()

# Or train with a specific number of clusters
index.train(n_lists=128)

// Train the encrypted index
await index.train();

// Or train with a specific number of clusters
await index.train({ nLists: 128 });

// Train the encrypted index
await index.train();

// Or train with a specific number of clusters
await index.train({ nLists: 128 });

// Train the encrypted index
err := index.Train(context.Background(), cyborgdb.TrainParams{})

// Or train with a specific number of clusters
nLists := int32(128)
err = index.Train(context.Background(), cyborgdb.TrainParams{
    NLists: &nLists,
})

curl -X POST "http://localhost:8000/v1/indexes/train" \
     -H "X-API-Key: your-api-key" \
     -H "Content-Type: application/json" \
     -d '{
       "index_name": "my_index",
       "index_key": "your_64_character_hex_key_here"
     }'

You must have at least 10,000 or 2 * n_lists number of vectors in the index (ingested via upsert) before you can call train.

Training Parameters

Parameters are available to customize the training process:

Parameter	Type	Default	Description
`n_lists`	`int`	`None` (auto)	(Optional) Number of inverted index lists to create in the index. When `None` or omitted, auto-determines based on the number of vectors in the index.
`batch_size`	`int`	`None`	(Optional) Number of vectors to process per training batch. When `None`, the server uses 2048.
`max_iters`	`int`	`None`	(Optional) Maximum number of training iterations. When `None`, the server uses 100.
`tolerance`	`float`	`None`	(Optional) Convergence tolerance for training completion. When `None`, the server uses 1e-6.
`max_memory`	`int`	`None` (0)	(Optional) Maximum memory usage in MB. When `None` or 0, no memory limit is applied.

n_lists is the number of clusters into which each vector in the index can be categorized. Typically, the higher the value, the higher the recall (but also the slower the indexing process). As a good rule of thumbs, n_lists should be:

A base-2 number (e.g., 2,048, 4,096). Not a requirement, but yields performance optimizations.
Each cluster should have between 100 - 10,000 vectors; so n_lists should be roughly between 1/100 - 1/10,000 of the total number of items which will be indexed.

If not specified, CyborgDB will auto-determine the best n_lists value based on the number of vectors in the index.

Warnings with Large Untrained Queries

While training is technically optional (you can use CyborgDB without ever calling train), it is recommended that you do so once you have a large number of vectors in the index (e.g., > 50,000). If you don’t, and you call query, you will see a warning in the console, stating:

Warning: querying untrained index with more than 50000 indexed vectors.

API Reference

For more information on training an encrypted index, refer to the API reference:

REST API Reference

REST API reference for /v1/indexes/train

Python SDK Reference

API reference for train() in Python

JS/TS SDK Reference

API reference for train() in JavaScript/TypeScript

Go SDK Reference

API reference for Train() in Go

​Training Parameters

​Warnings with Large Untrained Queries

​API Reference

REST API Reference

Python SDK Reference

JS/TS SDK Reference

Go SDK Reference

Training Parameters

Warnings with Large Untrained Queries

API Reference