Skip to main content
Index configuration is automatically handled by default. This guide allows you to override these defaults to customize index behavior & performance characteristics.
CyborgDB uses a single index type: DiskIVF, a disk-backed inverted-file index. Rather than choosing between several index variants, you tune one index with a small set of knobs. DiskIVF retrieves results in two stages: a fast first stage narrows down candidates using a compact Product-Quantized (PQ) representation, then a rerank stage recomputes exact distances against the stored vectors (in float32 or float16) to deliver high recall. This gives you the speed of a quantized index with the accuracy of an exact rerank, all within a single index that scales to disk. There are no longer any IVFFlat / IVFPQ / IVFSQ variants to choose between — a single DiskIVF index covers all of these use cases. You configure it at creation time (dimension, storage precision, metric), at training time (clustering parameters), and at query time (n_probes, rerank_mult).

Creating a DiskIVF Index

The most common path is to let CyborgDB choose sensible defaults. You only need an index name and a 32-byte key; the dimension is auto-detected from your first upsert (or derived from embedding_model if you provide one).
import cyborgdb_core as cyborgdb
import secrets

api_key = "your_api_key_here"  # Replace with your CyborgDB API key

client = cyborgdb.Client(api_key, cyborgdb.StorageConfig.memory())

index_key = secrets.token_bytes(32)  # 32-byte index KEK

# Create a DiskIVF index with defaults (dimension auto-detected on first upsert)
index = client.create_index("test_index", index_key)

Creation Parameters

You can override the defaults at creation time:
  • dimension: vector dimensionality. Optional — auto-detected from the first upsert, or derived from embedding_model if provided.
  • storage_precision: the on-disk dtype used for the rerank vectors. float32 (default) gives the highest recall; float16 roughly halves disk footprint with a slight precision loss. Acceptable values are numpy.float32 / numpy.float16 (or the strings "float32" / "float16") in Python, and StoragePrecision::Float32 / StoragePrecision::Float16 in C++.
  • embedding_model: an optional sentence-transformers model name (Python only) that enables automatic embedding generation and fixes the dimension.
  • metric: the distance metric — "euclidean" (default), "cosine", or "squared_euclidean".
import cyborgdb_core as cyborgdb
import numpy as np
import secrets

index_key = secrets.token_bytes(32)

# Create a DiskIVF index with explicit configuration
index = client.create_index(
    "test_index",
    index_key,
    dimension=768,
    storage_precision=np.float16,   # halve disk footprint vs. float32
    metric="cosine"
)
Use float16 storage precision when disk footprint matters more than the last fraction of a percent of recall. For most workloads the recall difference is negligible.

Training Parameters

For datasets larger than ~50,000 vectors, you should train the index to build the IVF clustering. Training accepts several optional tuning parameters:
  • n_lists: the number of clusters (inverted lists). 0 (default) auto-selects a value based on the dataset size. More lists make each list smaller (faster, finer-grained search) but require more n_probes at query time to maintain recall.
  • max_iters: maximum k-means iterations (default 100).
  • tolerance: convergence tolerance for k-means (default 1e-6).
  • max_memory: a soft cap (in MB) on memory used during training; 0 (default) means no limit.
  • batch_size: training batch size; 0 (default) lets CyborgDB choose automatically.
# Train the index with custom clustering parameters
index.train(
    n_lists=4096,
    max_iters=100,
    tolerance=1e-6,
    max_memory=0      # no limit
)
For more on the training lifecycle, see Training an Encrypted Index.

Query-Time Parameters

DiskIVF exposes two knobs at query time that trade recall against latency:
  • n_probes: how many clusters to search per query. Higher values increase recall at some latency cost. 0 (default) auto-selects based on n_lists.
  • rerank_mult: the stage-1 retrieval multiplier. CyborgDB first retrieves rerank_mult * top_k candidates using the compact PQ representation, then reranks them against the stored float32/float16 vectors. Higher values improve recall at some latency cost (default 10).
# Tune recall vs. latency at query time
results = index.query(
    query_vectors=[0.5, 0.9, 0.2, 0.7],
    top_k=10,
    n_probes=32,
    rerank_mult=10
)

Customizing Distance Metrics

By default, CyborgDB uses euclidean distance. You can override this by providing a metric parameter at index creation:
# Existing setup ...

index = client.create_index(
    "index_name",
    index_key,
    metric="cosine"
)
The currently supported distance metrics are:
  • "cosine": Cosine similarity.
  • "euclidean": Euclidean distance.
  • "squared_euclidean": Squared Euclidean distance.

API Reference

For more information on configuring an encrypted index, refer to the API Reference:

Python API Reference

API reference for StoragePrecision and index types in Python

C++ API Reference

API reference for IndexDiskIVF and StoragePrecision in C++