Vector Databases
High-Performance Similarity Search for AI/ML
Overview
Vector databases are specialized databases designed for efficient similarity search on high-dimensional vector embeddings. They enable applications like semantic search, RAG (Retrieval-Augmented Generation), recommendation systems, and anomaly detection.
What are Vector Embeddings?
Embedding Concepts
Embedding Characteristics:
- High-dimensional: 384-3072 dimensions
- Dense: Most values are non-zero
- Semantic: Similar concepts are close in vector space
- Fixed-size: All embeddings from a model have same dimensions
Vector Database Comparison
Feature Comparison
| Feature | Pinecone | Milvus | pgvector | Weaviate | |---------|----------|--------|----------|----------| | Type | Managed Service | Open Source | Postgres Extension | Open Source | | Deployment | Cloud only | Self-hosted/Cloud | Self-hosted only | Self-hosted/Cloud | | Index Types | HNSW | IVF, HNSW, FLAT | IVFFlat, HNSW | HNSW | | Max Dimensions | 20,000 | 32,768 | 2,000 | Unlimited | | Metadata | Yes | Dynamic schema | JSONB columns | Schema | | Knowledge Graph | No | No | No | Yes | | Scalability | Auto-scaling | Horizontal scaling | Vertical scaling | Horizontal scaling | | Pricing | Pay-per-use | Free (self-hosted) | Free (self-hosted) | Free tier | | Best For | RAG, production | On-premises, K8s | PostgreSQL shops | Knowledge graphs |
Use Case Selection
Distance Metrics
Metric Comparison
| Metric | Formula | Range | Use Case |
|---|---|---|---|
| Cosine | 1 - cos(θ) | 0-2 | Normalized vectors (OpenAI) |
| Dot Product | -A·B | -∞ to ∞ | Normalized vectors, faster |
| Euclidean (L2) | √Σ(Ai - Bi)² | 0-∞ | General purpose |
| Manhattan (L1) | Σ|Ai - Bi| | 0-∞ | Sparse vectors |
Metric Selection
# Distance metric examples
import numpy as np
# Sample vectorsa = np.array([0.1, 0.2, 0.3, 0.4])b = np.array([0.2, 0.3, 0.4, 0.5])
# 1. Cosine distance (recommended for OpenAI embeddings)def cosine_distance(a, b): return 1 - np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# 2. Dot product (faster, requires normalized vectors)def dot_product(a, b): return -np.dot(a, b) # Negative for ascending sort
# 3. Euclidean distance (not recommended for high-dim vectors)def euclidean_distance(a, b): return np.linalg.norm(a - b)
# 4. Manhattan distancedef manhattan_distance(a, b): return np.sum(np.abs(a - b))
print(f"Cosine: {cosine_distance(a, b):.4f}")print(f"Dot: {dot_product(a, b):.4f}")print(f"Euclidean: {euclidean_distance(a, b):.4f}")print(f"Manhattan: {manhattan_distance(a, b):.4f}")ANN (Approximate Nearest Neighbor)
ANN vs. Exact Search
| Method | Speed | Accuracy | Use Case |
|---|---|---|---|
| Exact (FLAT) | Slow (O(n)) | 100% | Small datasets (< 10K) |
| ANN (IVF) | Fast (O(log n)) | 95-99% | Medium datasets (10K-1M) |
| ANN (HNSW) | Very fast | 99-100% | Large datasets (> 1M) |
HNSW Algorithm
HNSW Parameters:
- M: Number of bidirectional links (16-32)
- efConstruction: Size of dynamic list during build (32-512)
- ef: Size of dynamic list during search (top_k to 1000)
Vector Database Operations
CRUD Operations
# Common CRUD operations across vector databases
# 1. CREATE (Insert/Update)# Pinecone: index.upsert()# Milvus: client.insert()# pgvector: INSERT INTO# Weaviate: client.data_object.create()
# 2. READ (Query)# Pinecone: index.query()# Milvus: client.search()# pgvector: SELECT ... ORDER BY embedding <=># Weaviate: client.query.get().with_near_vector()
# 3. UPDATE# Pinecone: index.upsert() (same as create)# Milvus: client.upsert()# pgvector: UPDATE ... SET embedding = ...# Weaviate: client.data_object.update()
# 4. DELETE# Pinecone: index.delete()# Milvus: client.delete()# pgvector: DELETE FROM# Weaviate: client.data_object.delete()Vector Database Patterns
Metadata Filtering
# Metadata filtering across databases
# 1. Pinecone (pre-filter)results = index.query( vector=query_vector, top_k=10, filter={ "category": {"$eq": "AI"}, "date": {"$gte": "2025-01-01"} })
# 2. Milvus (hybrid search)results = client.search( collection_name=collection_name, data=[query_vector], filter="category in ['AI', 'ML'] and date >= '2025-01-01'", limit=10)
# 3. pgvector (SQL filter)SELECT id, contentFROM documentsWHERE metadata->>'category' = 'AI'ORDER BY embedding <=> query_vectorLIMIT 10;
# 4. Weaviate (where filter)results = client.query.get("Document", ["content", "title"]) \ .with_where({ "path": ["category"], "operator": "Equal", "valueString": "AI" }) \ .with_near_vector({"vector": query_vector}) \ .with_limit(10) \ .do()Hybrid Search
# Hybrid search (vector + keyword)
# 1. Weaviate (built-in hybrid)results = client.query.get("Document", ["content"]) \ .with_hybrid( query="machine learning", # BM25 vector=query_vector, # Vector alpha=0.7, # 0=BM25, 1=vector properties=["content"] ) \ .with_limit(10) \ .do()
# 2. pgvector (manual hybrid)WITH vector_search AS ( SELECT id, content, 1 - (embedding <=> query_vector) AS vector_score FROM documents ORDER BY embedding <=> query_vector LIMIT 100),keyword_search AS ( SELECT id, content, ts_rank(text_search, query) AS keyword_score FROM documents, to_tsquery('english', 'machine & learning') query WHERE text_search @@ query)SELECT v.id, v.content, (v.vector_score * 0.7 + k.keyword_score * 0.3) AS combined_scoreFROM vector_search vJOIN keyword_search k ON v.id = k.idORDER BY combined_score DESCLIMIT 10;Vector Database Scaling
Scaling Strategies
Scaling Options:
- Vertical: Increase RAM, CPU (single node)
- Horizontal: Add nodes (cluster)
- Sharding: Split data across nodes
- Replication: Copy data for high availability
Vector Database Performance
Performance Optimization
# Performance optimization strategies
# 1. Choose appropriate index type# - IVFFlat: < 1M vectors# - HNSW: > 1M vectors
# 2. Tune index parameters# HNSW: M=16-32, efConstruction=64-512# IVFFlat: lists=sqrt(rows)
# 3. Use appropriate distance metric# - Cosine: Normalized vectors (OpenAI)# - Dot product: Faster, requires normalization# - L2: Not recommended for high-dim
# 4. Use metadata filters# Reduces search space, improves performance
# 5. Batch operations# Bulk insert/update for better throughputVector Database Cost
Cost Comparison
| Database | Deployment | Cost | Notes |
|---|---|---|---|
| Pinecone | Cloud | $0.10-1.00/hour | Pay-per-use |
| Milvus | Self-hosted | Free (hardware) | Open source |
| pgvector | Self-hosted | Free (hardware) | Postgres extension |
| Weaviate | Self-hosted | Free (hardware) | Open source |
| Weaviate Cloud | Cloud | $0-25/month | Free tier available |
Cost Optimization
# Cost optimization strategies
# 1. Use self-hosted for cost savings# Milvus, pgvector, Weaviate are free
# 2. Use appropriate index type# HNSW for best performance/cost ratio
# 3. Delete old vectors# Implement retention policies
# 4. Use dimensionality reduction# PCA, UMAP to reduce dimensions
# 5. Use partitioning# Partition by date, category for better performanceVector Database Guides
- Pinecone Guide - Managed vector database
- Milvus Guide - Open-source vector database
- pgvector Guide - PostgreSQL vector extension
- Weaviate Guide - Knowledge graph vector database
Key Takeaways
- Vector embeddings: High-dimensional representations of data
- Similarity search: Find nearest neighbors in vector space
- ANN algorithms: HNSW, IVF for fast approximate search
- Distance metrics: Cosine, dot product, Euclidean
- Metadata filtering: Pre-filter for better performance
- Hybrid search: Vector + keyword for best results
- Scaling: Vertical (single node) vs. horizontal (cluster)
- Use When: Semantic search, RAG, recommendations, anomaly detection
Back to Module 5