DBConfig

The DBConfig class specifies the storage location for the index, with options for in-memory storage, databases, or file-based storage.

Parameters

ParameterTypeDefaultDescription
locationstring-DB location (redis, postgres, memory, s3, gcs, local)
table_namestringNone(Optional) Table name (postgres-only)
connection_stringstringNone(Optional) Connection string to access DB
bucketstringNone(Optional) Bucket name for cloud storage (s3, gcs)
access_keystringNone(Optional) Access key for cloud storage
secret_keystringNone(Optional) Secret key for cloud storage
regionstringNone(Optional) Region for cloud storage
endpointstringNone(Optional) Custom endpoint for S3-compatible storage
pathstringNone(Optional) Path for local file storage
The supported location options are:
  • "redis": Use for high-speed, in-memory storage (recommended for index_location)
  • "postgres": Use for reliable, SQL-based storage (recommended for config_location)
  • "memory": Use for temporary in-memory storage (for benchmarking and evaluation purposes)
  • "s3": Use for Amazon S3 or S3-compatible storage
  • "gcs": Use for Google Cloud Storage
  • "local": Use for local file system storage

Example Usage

from cyborgdb_core import DBConfig

# Redis configuration
index_location = DBConfig(
    location="redis",
    connection_string="redis://localhost:6379"
)

# PostgreSQL configuration
config_location = DBConfig(
    location="postgres",
    table_name="config_table",
    connection_string="host=localhost dbname=vectordb user=postgres"
)

# S3 configuration
s3_location = DBConfig(
    location="s3",
    bucket="my-vector-index",
    access_key="YOUR_ACCESS_KEY",
    secret_key="YOUR_SECRET_KEY",
    region="us-east-1"
)

# Memory configuration (for testing)
memory_location = DBConfig(location="memory")

Embeddings

The LangChain integration supports multiple embedding model types:

Supported Embedding Types

TypeDescriptionExample
strModel name string for SentenceTransformers"sentence-transformers/all-MiniLM-L6-v2"
SentenceTransformerSentenceTransformer model instanceSentenceTransformer("all-MiniLM-L6-v2")
EmbeddingsAny LangChain Embeddings implementationOpenAIEmbeddings(), HuggingFaceEmbeddings()

Example Usage

from sentence_transformers import SentenceTransformer
from langchain_openai import OpenAIEmbeddings
from cyborgdb_core.langchain import CyborgVectorStore

# Using model name string
store1 = CyborgVectorStore(
    index_name="docs",
    index_key=key,
    api_key="your-api-key",
    embedding="sentence-transformers/all-MiniLM-L6-v2",  # String model name
    index_location=DBConfig("memory"),
    config_location=DBConfig("memory")
)

# Using SentenceTransformer instance
model = SentenceTransformer("all-mpnet-base-v2")
store2 = CyborgVectorStore(
    index_name="docs",
    index_key=key,
    api_key="your-api-key",
    embedding=model,  # SentenceTransformer instance
    index_location=DBConfig("memory"),
    config_location=DBConfig("memory")
)

# Using LangChain Embeddings
openai_embeddings = OpenAIEmbeddings()
store3 = CyborgVectorStore(
    index_name="docs",
    index_key=key,
    api_key="your-api-key",
    embedding=openai_embeddings,  # LangChain Embeddings
    index_location=DBConfig("memory"),
    config_location=DBConfig("memory")
)

DistanceMetric

DistanceMetric is a string representing the distance metric used for the index. Options include:
  • "cosine": Cosine similarity (recommended for normalized embeddings)
  • "euclidean": Euclidean distance
  • "squared_euclidean": Squared Euclidean distance

Metric Characteristics

MetricRangeBest MatchUse Case
cosine[0, 2]0Text embeddings, normalized vectors
euclidean[0, ∞)0Raw feature vectors
squared_euclidean[0, ∞)0When avoiding sqrt computation

IndexType

The index type determines the algorithm used for approximate nearest neighbor search.

Available Index Types

TypeDescriptionSpeedRecallIndex Size
"ivfflat"Inverted file with flat storageFastHighestBiggest
"ivf"Inverted file with compressionFastestLowestSmallest
"ivfpq"Inverted file with product quantizationFastHighMedium
Note: cyborgdb-lite only supports "ivfflat" index type.

Example Usage

# IVFFlat index (highest recall)
store = CyborgVectorStore(
    index_name="high_recall_index",
    index_key=key,
    api_key="your-api-key",
    embedding="all-MiniLM-L6-v2",
    index_location=DBConfig("memory"),
    config_location=DBConfig("memory"),
    index_type="ivfflat",
    index_config_params={"n_lists": 1024}
)

# IVFPQ index (balanced performance)
store = CyborgVectorStore(
    index_name="balanced_index",
    index_key=key,
    api_key="your-api-key",
    embedding="all-MiniLM-L6-v2",
    index_location=DBConfig("memory"),
    config_location=DBConfig("memory"),
    index_type="ivfpq",
    index_config_params={
        "n_lists": 1024,
        "pq_dim": 64,
        "pq_bits": 8
    }
)

IndexConfigParams

Optional parameters for configuring the index, passed as a dictionary.

Parameters by Index Type

IVFFlat & IVF

ParameterTypeDefaultDescription
n_listsint1024Number of inverted lists (clusters)

IVFPQ

ParameterTypeDefaultDescription
n_listsint1024Number of inverted lists (clusters)
pq_dimint8Dimensionality after product quantization
pq_bitsint8Bits per quantized dimension (1-16)

Tuning Guidelines

  • n_lists: Use √n where n is the expected number of vectors. Common values: 256, 512, 1024, 2048
  • pq_dim: Should divide the embedding dimension evenly. Lower values = more compression
  • pq_bits: 8 bits provides good balance. Lower = more compression, higher = better accuracy

Document

LangChain Document object used for storing text with metadata.

Attributes

AttributeTypeDescription
page_contentstrThe text content of the document
metadatadictOptional metadata associated with the document

Example Usage

from langchain_core.documents import Document

# Create a document
doc = Document(
    page_content="This is the content of my document",
    metadata={
        "source": "manual",
        "author": "John Doe",
        "timestamp": "2024-01-01"
    }
)

# Add to vector store
store.add_documents([doc])

Filter Format

Metadata filters use a dictionary format for querying documents.

Simple Filters

# Exact match
filter = {"category": "technology"}

# Multiple conditions (AND)
filter = {
    "category": "technology",
    "year": 2024
}

Advanced Filters

# Range queries
filter = {
    "price": {"$gte": 100, "$lte": 500}
}

# IN queries
filter = {
    "tags": {"$in": ["python", "machine-learning"]}
}

# Nested fields
filter = {
    "metadata.author": "John Doe"
}

Supported Operators

OperatorDescriptionExample
$eqEqual to{"age": {"$eq": 25}}
$neNot equal to{"status": {"$ne": "archived"}}
$gtGreater than{"price": {"$gt": 100}}
$gteGreater than or equal{"score": {"$gte": 0.8}}
$ltLess than{"quantity": {"$lt": 10}}
$lteLess than or equal{"rating": {"$lte": 5}}
$inIn array{"tags": {"$in": ["ai", "ml"]}}
$ninNot in array{"category": {"$nin": ["draft", "deleted"]}}

Return Types

Query Results

Query operations return documents with optional scores:
# similarity_search returns List[Document]
docs = store.similarity_search("query", k=5)
# Returns: [Document(...), Document(...), ...]

# similarity_search_with_score returns List[Tuple[Document, float]]
results = store.similarity_search_with_score("query", k=5)
# Returns: [(Document(...), 0.95), (Document(...), 0.87), ...]

Score Normalization

Scores are normalized to [0, 1] range where:
  • 1.0 = Perfect match
  • 0.0 = Worst match
The normalization depends on the distance metric used.

Async Support

All methods have async variants prefixed with a:
Sync MethodAsync Method
add_textsaadd_texts
add_documentsaadd_documents
similarity_searchasimilarity_search
similarity_search_with_scoreasimilarity_search_with_score
max_marginal_relevance_searchamax_marginal_relevance_search
deleteadelete

Example Usage

import asyncio

async def main():
    # Async text addition
    ids = await store.aadd_texts(["async text 1", "async text 2"])
    
    # Async search
    docs = await store.asimilarity_search("query", k=5)
    
    # Async deletion
    success = await store.adelete(ids)

asyncio.run(main())