Index configuration is automatically handled by default. This guide allows you to override these defaults to customize index behavior & performance characteristics.
float32 or float16) to deliver high recall. This gives you the speed of a quantized index with the accuracy of an exact rerank, all within a single index that scales to disk.
There are no longer any IVFFlat / IVFPQ / IVFSQ variants to choose between — a single DiskIVF index covers all of these use cases. You configure it at creation time (dimension, storage precision, metric), at training time (clustering parameters), and at query time (n_probes, rerank_mult).
Creating a DiskIVF Index
The most common path is to let CyborgDB choose sensible defaults. You only need an index name and a 32-byte key; thedimension is auto-detected from your first upsert (or derived from embedding_model if you provide one).
Creation Parameters
You can override the defaults at creation time:dimension: vector dimensionality. Optional — auto-detected from the first upsert, or derived fromembedding_modelif provided.storage_precision: the on-disk dtype used for the rerank vectors.float32(default) gives the highest recall;float16roughly halves disk footprint with a slight precision loss. Acceptable values arenumpy.float32/numpy.float16(or the strings"float32"/"float16") in Python, andStoragePrecision::Float32/StoragePrecision::Float16in C++.embedding_model: an optionalsentence-transformersmodel name (Python only) that enables automatic embedding generation and fixes the dimension.metric: the distance metric —"euclidean"(default),"cosine", or"squared_euclidean".
Training Parameters
For datasets larger than ~50,000 vectors, you should train the index to build the IVF clustering. Training accepts several optional tuning parameters:n_lists: the number of clusters (inverted lists).0(default) auto-selects a value based on the dataset size. More lists make each list smaller (faster, finer-grained search) but require moren_probesat query time to maintain recall.max_iters: maximum k-means iterations (default100).tolerance: convergence tolerance for k-means (default1e-6).max_memory: a soft cap (in MB) on memory used during training;0(default) means no limit.batch_size: training batch size;0(default) lets CyborgDB choose automatically.
Query-Time Parameters
DiskIVF exposes two knobs at query time that trade recall against latency:n_probes: how many clusters to search per query. Higher values increase recall at some latency cost.0(default) auto-selects based onn_lists.rerank_mult: the stage-1 retrieval multiplier. CyborgDB first retrievesrerank_mult * top_kcandidates using the compact PQ representation, then reranks them against the storedfloat32/float16vectors. Higher values improve recall at some latency cost (default10).
Customizing Distance Metrics
By default, CyborgDB useseuclidean distance. You can override this by providing a metric parameter at index creation:
"cosine": Cosine similarity."euclidean": Euclidean distance."squared_euclidean": Squared Euclidean distance.
API Reference
For more information on configuring an encrypted index, refer to the API Reference:Python API Reference
API reference for
StoragePrecision and index types in PythonC++ API Reference
API reference for
IndexDiskIVF and StoragePrecision in C++