Skip to content

Vector Databases

High-Performance Similarity Search for AI/ML


Overview

Vector databases are specialized databases designed for efficient similarity search on high-dimensional vector embeddings. They enable applications like semantic search, RAG (Retrieval-Augmented Generation), recommendation systems, and anomaly detection.


What are Vector Embeddings?

Embedding Concepts

Embedding Characteristics:

  • High-dimensional: 384-3072 dimensions
  • Dense: Most values are non-zero
  • Semantic: Similar concepts are close in vector space
  • Fixed-size: All embeddings from a model have same dimensions

Vector Database Comparison

Feature Comparison

| Feature | Pinecone | Milvus | pgvector | Weaviate | |---------|----------|--------|----------|----------| | Type | Managed Service | Open Source | Postgres Extension | Open Source | | Deployment | Cloud only | Self-hosted/Cloud | Self-hosted only | Self-hosted/Cloud | | Index Types | HNSW | IVF, HNSW, FLAT | IVFFlat, HNSW | HNSW | | Max Dimensions | 20,000 | 32,768 | 2,000 | Unlimited | | Metadata | Yes | Dynamic schema | JSONB columns | Schema | | Knowledge Graph | No | No | No | Yes | | Scalability | Auto-scaling | Horizontal scaling | Vertical scaling | Horizontal scaling | | Pricing | Pay-per-use | Free (self-hosted) | Free (self-hosted) | Free tier | | Best For | RAG, production | On-premises, K8s | PostgreSQL shops | Knowledge graphs |

Use Case Selection


Distance Metrics

Metric Comparison

MetricFormulaRangeUse Case
Cosine1 - cos(θ)0-2Normalized vectors (OpenAI)
Dot Product-A·B-∞ to ∞Normalized vectors, faster
Euclidean (L2)√Σ(Ai - Bi)²0-∞General purpose
Manhattan (L1)Σ|Ai - Bi|0-∞Sparse vectors

Metric Selection

# Distance metric examples
import numpy as np
# Sample vectors
a = np.array([0.1, 0.2, 0.3, 0.4])
b = np.array([0.2, 0.3, 0.4, 0.5])
# 1. Cosine distance (recommended for OpenAI embeddings)
def cosine_distance(a, b):
return 1 - np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
# 2. Dot product (faster, requires normalized vectors)
def dot_product(a, b):
return -np.dot(a, b) # Negative for ascending sort
# 3. Euclidean distance (not recommended for high-dim vectors)
def euclidean_distance(a, b):
return np.linalg.norm(a - b)
# 4. Manhattan distance
def manhattan_distance(a, b):
return np.sum(np.abs(a - b))
print(f"Cosine: {cosine_distance(a, b):.4f}")
print(f"Dot: {dot_product(a, b):.4f}")
print(f"Euclidean: {euclidean_distance(a, b):.4f}")
print(f"Manhattan: {manhattan_distance(a, b):.4f}")

ANN (Approximate Nearest Neighbor)

MethodSpeedAccuracyUse Case
Exact (FLAT)Slow (O(n))100%Small datasets (< 10K)
ANN (IVF)Fast (O(log n))95-99%Medium datasets (10K-1M)
ANN (HNSW)Very fast99-100%Large datasets (> 1M)

HNSW Algorithm

HNSW Parameters:

  • M: Number of bidirectional links (16-32)
  • efConstruction: Size of dynamic list during build (32-512)
  • ef: Size of dynamic list during search (top_k to 1000)

Vector Database Operations

CRUD Operations

# Common CRUD operations across vector databases
# 1. CREATE (Insert/Update)
# Pinecone: index.upsert()
# Milvus: client.insert()
# pgvector: INSERT INTO
# Weaviate: client.data_object.create()
# 2. READ (Query)
# Pinecone: index.query()
# Milvus: client.search()
# pgvector: SELECT ... ORDER BY embedding <=>
# Weaviate: client.query.get().with_near_vector()
# 3. UPDATE
# Pinecone: index.upsert() (same as create)
# Milvus: client.upsert()
# pgvector: UPDATE ... SET embedding = ...
# Weaviate: client.data_object.update()
# 4. DELETE
# Pinecone: index.delete()
# Milvus: client.delete()
# pgvector: DELETE FROM
# Weaviate: client.data_object.delete()

Vector Database Patterns

Metadata Filtering

# Metadata filtering across databases
# 1. Pinecone (pre-filter)
results = index.query(
vector=query_vector,
top_k=10,
filter={
"category": {"$eq": "AI"},
"date": {"$gte": "2025-01-01"}
}
)
# 2. Milvus (hybrid search)
results = client.search(
collection_name=collection_name,
data=[query_vector],
filter="category in ['AI', 'ML'] and date >= '2025-01-01'",
limit=10
)
# 3. pgvector (SQL filter)
SELECT id, content
FROM documents
WHERE metadata->>'category' = 'AI'
ORDER BY embedding <=> query_vector
LIMIT 10;
# 4. Weaviate (where filter)
results = client.query.get("Document", ["content", "title"]) \
.with_where({
"path": ["category"],
"operator": "Equal",
"valueString": "AI"
}) \
.with_near_vector({"vector": query_vector}) \
.with_limit(10) \
.do()
# Hybrid search (vector + keyword)
# 1. Weaviate (built-in hybrid)
results = client.query.get("Document", ["content"]) \
.with_hybrid(
query="machine learning", # BM25
vector=query_vector, # Vector
alpha=0.7, # 0=BM25, 1=vector
properties=["content"]
) \
.with_limit(10) \
.do()
# 2. pgvector (manual hybrid)
WITH vector_search AS (
SELECT id, content,
1 - (embedding <=> query_vector) AS vector_score
FROM documents
ORDER BY embedding <=> query_vector
LIMIT 100
),
keyword_search AS (
SELECT id, content,
ts_rank(text_search, query) AS keyword_score
FROM documents, to_tsquery('english', 'machine & learning') query
WHERE text_search @@ query
)
SELECT
v.id,
v.content,
(v.vector_score * 0.7 + k.keyword_score * 0.3) AS combined_score
FROM vector_search v
JOIN keyword_search k ON v.id = k.id
ORDER BY combined_score DESC
LIMIT 10;

Vector Database Scaling

Scaling Strategies

Scaling Options:

  • Vertical: Increase RAM, CPU (single node)
  • Horizontal: Add nodes (cluster)
  • Sharding: Split data across nodes
  • Replication: Copy data for high availability

Vector Database Performance

Performance Optimization

# Performance optimization strategies
# 1. Choose appropriate index type
# - IVFFlat: < 1M vectors
# - HNSW: > 1M vectors
# 2. Tune index parameters
# HNSW: M=16-32, efConstruction=64-512
# IVFFlat: lists=sqrt(rows)
# 3. Use appropriate distance metric
# - Cosine: Normalized vectors (OpenAI)
# - Dot product: Faster, requires normalization
# - L2: Not recommended for high-dim
# 4. Use metadata filters
# Reduces search space, improves performance
# 5. Batch operations
# Bulk insert/update for better throughput

Vector Database Cost

Cost Comparison

DatabaseDeploymentCostNotes
PineconeCloud$0.10-1.00/hourPay-per-use
MilvusSelf-hostedFree (hardware)Open source
pgvectorSelf-hostedFree (hardware)Postgres extension
WeaviateSelf-hostedFree (hardware)Open source
Weaviate CloudCloud$0-25/monthFree tier available

Cost Optimization

# Cost optimization strategies
# 1. Use self-hosted for cost savings
# Milvus, pgvector, Weaviate are free
# 2. Use appropriate index type
# HNSW for best performance/cost ratio
# 3. Delete old vectors
# Implement retention policies
# 4. Use dimensionality reduction
# PCA, UMAP to reduce dimensions
# 5. Use partitioning
# Partition by date, category for better performance

Vector Database Guides


Key Takeaways

  1. Vector embeddings: High-dimensional representations of data
  2. Similarity search: Find nearest neighbors in vector space
  3. ANN algorithms: HNSW, IVF for fast approximate search
  4. Distance metrics: Cosine, dot product, Euclidean
  5. Metadata filtering: Pre-filter for better performance
  6. Hybrid search: Vector + keyword for best results
  7. Scaling: Vertical (single node) vs. horizontal (cluster)
  8. Use When: Semantic search, RAG, recommendations, anomaly detection

Back to Module 5