Vector Databases

High-Performance Similarity Search for AI/ML

Overview

Vector databases are specialized databases designed for efficient similarity search on high-dimensional vector embeddings. They enable applications like semantic search, RAG (Retrieval-Augmented Generation), recommendation systems, and anomaly detection.

What are Vector Embeddings?

Embedding Concepts

Embedding Characteristics:

High-dimensional: 384-3072 dimensions
Dense: Most values are non-zero
Semantic: Similar concepts are close in vector space
Fixed-size: All embeddings from a model have same dimensions

Vector Database Comparison

Feature Comparison

| Feature | Pinecone | Milvus | pgvector | Weaviate | |---------|----------|--------|----------|----------| | Type | Managed Service | Open Source | Postgres Extension | Open Source | | Deployment | Cloud only | Self-hosted/Cloud | Self-hosted only | Self-hosted/Cloud | | Index Types | HNSW | IVF, HNSW, FLAT | IVFFlat, HNSW | HNSW | | Max Dimensions | 20,000 | 32,768 | 2,000 | Unlimited | | Metadata | Yes | Dynamic schema | JSONB columns | Schema | | Knowledge Graph | No | No | No | Yes | | Scalability | Auto-scaling | Horizontal scaling | Vertical scaling | Horizontal scaling | | Pricing | Pay-per-use | Free (self-hosted) | Free (self-hosted) | Free tier | | Best For | RAG, production | On-premises, K8s | PostgreSQL shops | Knowledge graphs |

Use Case Selection

Distance Metrics

Metric Comparison

Metric	Formula	Range	Use Case
Cosine	1 - cos(θ)	0-2	Normalized vectors (OpenAI)
Dot Product	-A·B	-∞ to ∞	Normalized vectors, faster
Euclidean (L2)	√Σ(Ai - Bi)²	0-∞	General purpose
Manhattan (L1)	Σ\|Ai - Bi\|	0-∞	Sparse vectors

Metric Selection

# Distance metric examples

import numpy as np

# Sample vectors
a = np.array([0.1, 0.2, 0.3, 0.4])
b = np.array([0.2, 0.3, 0.4, 0.5])

# 1. Cosine distance (recommended for OpenAI embeddings)
def cosine_distance(a, b):
    return 1 - np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# 2. Dot product (faster, requires normalized vectors)
def dot_product(a, b):
    return -np.dot(a, b)  # Negative for ascending sort

# 3. Euclidean distance (not recommended for high-dim vectors)
def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

# 4. Manhattan distance
def manhattan_distance(a, b):
    return np.sum(np.abs(a - b))

print(f"Cosine: {cosine_distance(a, b):.4f}")
print(f"Dot: {dot_product(a, b):.4f}")
print(f"Euclidean: {euclidean_distance(a, b):.4f}")
print(f"Manhattan: {manhattan_distance(a, b):.4f}")

ANN (Approximate Nearest Neighbor)

ANN vs. Exact Search

Method	Speed	Accuracy	Use Case
Exact (FLAT)	Slow (O(n))	100%	Small datasets (< 10K)
ANN (IVF)	Fast (O(log n))	95-99%	Medium datasets (10K-1M)
ANN (HNSW)	Very fast	99-100%	Large datasets (> 1M)

HNSW Algorithm

HNSW Parameters:

M: Number of bidirectional links (16-32)
efConstruction: Size of dynamic list during build (32-512)
ef: Size of dynamic list during search (top_k to 1000)

Vector Database Operations

CRUD Operations

# Common CRUD operations across vector databases

# 1. CREATE (Insert/Update)
# Pinecone: index.upsert()
# Milvus: client.insert()
# pgvector: INSERT INTO
# Weaviate: client.data_object.create()

# 2. READ (Query)
# Pinecone: index.query()
# Milvus: client.search()
# pgvector: SELECT ... ORDER BY embedding <=>
# Weaviate: client.query.get().with_near_vector()

# 3. UPDATE
# Pinecone: index.upsert() (same as create)
# Milvus: client.upsert()
# pgvector: UPDATE ... SET embedding = ...
# Weaviate: client.data_object.update()

# 4. DELETE
# Pinecone: index.delete()
# Milvus: client.delete()
# pgvector: DELETE FROM
# Weaviate: client.data_object.delete()

Vector Database Patterns

Metadata Filtering

# Metadata filtering across databases

# 1. Pinecone (pre-filter)
results = index.query(
    vector=query_vector,
    top_k=10,
    filter={
        "category": {"$eq": "AI"},
        "date": {"$gte": "2025-01-01"}
    }
)

# 2. Milvus (hybrid search)
results = client.search(
    collection_name=collection_name,
    data=[query_vector],
    filter="category in ['AI', 'ML'] and date >= '2025-01-01'",
    limit=10
)

# 3. pgvector (SQL filter)
SELECT id, content
FROM documents
WHERE metadata->>'category' = 'AI'
ORDER BY embedding <=> query_vector
LIMIT 10;

# 4. Weaviate (where filter)
results = client.query.get("Document", ["content", "title"]) \
    .with_where({
        "path": ["category"],
        "operator": "Equal",
        "valueString": "AI"
    }) \
    .with_near_vector({"vector": query_vector}) \
    .with_limit(10) \
    .do()

Hybrid Search

# Hybrid search (vector + keyword)

# 1. Weaviate (built-in hybrid)
results = client.query.get("Document", ["content"]) \
    .with_hybrid(
        query="machine learning",  # BM25
        vector=query_vector,       # Vector
        alpha=0.7,                # 0=BM25, 1=vector
        properties=["content"]
    ) \
    .with_limit(10) \
    .do()

# 2. pgvector (manual hybrid)
WITH vector_search AS (
    SELECT id, content,
           1 - (embedding <=> query_vector) AS vector_score
    FROM documents
    ORDER BY embedding <=> query_vector
    LIMIT 100
),
keyword_search AS (
    SELECT id, content,
           ts_rank(text_search, query) AS keyword_score
    FROM documents, to_tsquery('english', 'machine & learning') query
    WHERE text_search @@ query
)
SELECT
    v.id,
    v.content,
    (v.vector_score * 0.7 + k.keyword_score * 0.3) AS combined_score
FROM vector_search v
JOIN keyword_search k ON v.id = k.id
ORDER BY combined_score DESC
LIMIT 10;

Vector Database Scaling

Scaling Strategies

Scaling Options:

Vertical: Increase RAM, CPU (single node)
Horizontal: Add nodes (cluster)
Sharding: Split data across nodes
Replication: Copy data for high availability

Vector Database Performance

Performance Optimization

# Performance optimization strategies

# 1. Choose appropriate index type
# - IVFFlat: < 1M vectors
# - HNSW: > 1M vectors

# 2. Tune index parameters
# HNSW: M=16-32, efConstruction=64-512
# IVFFlat: lists=sqrt(rows)

# 3. Use appropriate distance metric
# - Cosine: Normalized vectors (OpenAI)
# - Dot product: Faster, requires normalization
# - L2: Not recommended for high-dim

# 4. Use metadata filters
# Reduces search space, improves performance

# 5. Batch operations
# Bulk insert/update for better throughput

Vector Database Cost

Cost Comparison

Database	Deployment	Cost	Notes
Pinecone	Cloud	$0.10-1.00/hour	Pay-per-use
Milvus	Self-hosted	Free (hardware)	Open source
pgvector	Self-hosted	Free (hardware)	Postgres extension
Weaviate	Self-hosted	Free (hardware)	Open source
Weaviate Cloud	Cloud	$0-25/month	Free tier available

Cost Optimization

# Cost optimization strategies

# 1. Use self-hosted for cost savings
# Milvus, pgvector, Weaviate are free

# 2. Use appropriate index type
# HNSW for best performance/cost ratio

# 3. Delete old vectors
# Implement retention policies

# 4. Use dimensionality reduction
# PCA, UMAP to reduce dimensions

# 5. Use partitioning
# Partition by date, category for better performance

Vector Database Guides

Pinecone Guide - Managed vector database
Milvus Guide - Open-source vector database
pgvector Guide - PostgreSQL vector extension
Weaviate Guide - Knowledge graph vector database

Key Takeaways

Vector embeddings: High-dimensional representations of data
Similarity search: Find nearest neighbors in vector space
ANN algorithms: HNSW, IVF for fast approximate search
Distance metrics: Cosine, dot product, Euclidean
Metadata filtering: Pre-filter for better performance
Hybrid search: Vector + keyword for best results
Scaling: Vertical (single node) vs. horizontal (cluster)
Use When: Semantic search, RAG, recommendations, anomaly detection

Back to Module 5