> ## Documentation Index
> Fetch the complete documentation index at: https://docs.cyborg.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Train an Encrypted Index

CyborgDB uses a DiskIVF index, which leverages clustering algorithms to segment the index into smaller sections for efficient querying. These clustering algorithms must be trained on the specific data being indexed in order to adequately represent that data.

In the embedded library version of CyborgDB, this training must be explicitly called once enough vectors have been added:

<CodeGroup>
  ```python Python icon="python" theme={null}
  # Train the encrypted index
  index.train()

  # Train the index with a specific number of clusters
  index.train(n_lists=1024)

  # Train the index with specific configuration
  index.train(
      n_lists=1024, 
      batch_size=64, 
      max_iters=100
  )
  ```

  ```cpp C++ icon="brackets-curly" theme={null}
  // Train the encrypted index with default configuration
  index->TrainIndex(cyborg::TrainingConfig{}, index_key);

  // Train the index with specific configuration
  cyborg::TrainingConfig config(1024, 0, 100, 1e-6, 0);
  index->TrainIndex(config, index_key);
  ```
</CodeGroup>

<Tip>You must have at least `2 * n_lists` or `10,000` (whichever is greater) vectors in the index (ingested via `upsert`) before you can call `train`.</Tip>

## Training Parameters

Parameters are available to customize the training process:

| Parameter    | Type    | Default | Description                                                                                                                                   |
| ------------ | ------- | ------- | --------------------------------------------------------------------------------------------------------------------------------------------- |
| `n_lists`    | `int`   | `None`  | *(Optional)* Number of inverted index lists to create in the index. When `None`, auto-determines based on the number of vectors in the index. |
| `batch_size` | `int`   | `None`  | *(Optional)* Size of each batch for training. When `None`, defaults to `0` (auto-selected).                                                   |
| `max_iters`  | `int`   | `None`  | *(Optional)* Maximum number of iterations for training. When `None`, defaults to `100`.                                                       |
| `tolerance`  | `float` | `None`  | *(Optional)* Convergence tolerance for training. When `None`, defaults to `1e-6`.                                                             |
| `max_memory` | `int`   | `None`  | *(Optional)* Maximum memory to use for training. When `None`, defaults to `0` (no limit).                                                     |

`n_lists` is the number of clusters into which each vector in the index can be categorized. Typically, the higher the value, the higher the recall (but also the slower the indexing process). As a good rule of thumb, `n_lists` should be:

* A base-2 number (e.g., `2,048`, `4,096`). Not a requirement, but yields performance optimizations.
* Each cluster should have between `100` - `10,000` vectors; so `n_lists` should be roughly between `1/100` - `1/10,000` of the total number of items which will be indexed.

If not provided, `n_lists` will be auto-selected based on the number of vectors in the index.

## Avoid the large-untrained-query warning

While training is technically optional (you can use CyborgDB without ever calling `train`), it is recommended that you do so once you have a large number of vectors in the index (e.g., `> 50,000`). If you don't, and you call `query`, you will see a warning in the console, stating:

```
Warning: querying untrained index with more than 50000 indexed vectors.
```

## API Reference

For more information on training an encrypted index, refer to the API reference:

<CardGroup cols={2}>
  <Card title="Python API Reference" href="../../python/encrypted-index/train" icon="python">
    API reference for `train()` in Python
  </Card>

  <Card title="C++ API Reference" href="../../cpp/encrypted-index/train" icon="brackets-curly">
    API reference for `TrainIndex()` in C++
  </Card>
</CardGroup>
