Pinecone Guide
Vector Database for AI/ML Applications
Overview
Pinecone is a managed vector database optimized for machine learning applications. It provides efficient similarity search, filtering, and metadata storage, making it ideal for RAG (Retrieval-Augmented Generation), semantic search, and recommendation systems.
Pinecone Architecture
Vector Search Process
Key Components:
- Index: Vector database storing embeddings
- Embedding Model: Converts text to vectors (OpenAI, Cohere, custom)
- Vector Search: ANN (Approximate Nearest Neighbor) algorithm
- Metadata: Structured data filtering and re-ranking
- API: REST and gRPC interfaces
Pinecone Setup
Index Creation
import pineconefrom pinecone import ServerlessSpec
# Initialize Pinecone clientpc = pinecone.Pinecone(api_key="your-api-key")
# Create indexindex_name = "ml-documents"
# Create serverless index (recommended)pc.create_index( name=index_name, dimension=1536, # OpenAI text-embedding-ada-002 metric="cosine", # Options: cosine, euclidean, dotproduct spec=ServerlessSpec( cloud="aws", region="us-east-1" ))
# Benefits:# - Serverless: No capacity planning, auto-scaling# - Pay-per-use: Only pay for queries and storage# - Managed: No infrastructure to maintainPod-Based Index
# Create pod-based index (for high throughput)pc.create_index( name="ml-documents-pod", dimension=1536, metric="cosine", spec=pinecone.PodSpec( environment="us-east-1-aws", pod_type="p1.x1" # p1.x1 (fastest) to p2.x8 (largest) pods=3, # Number of pods replicas=1, # Replicas per pod shard_type= pinecone.PodSpec.ShardType.Automated # Auto-scaling ))
# Pod types:# p1.x1: Fastest, lowest latency# p1.x2: Balanced performance# p2.x1: High throughput# p2.x8: Maximum throughput
# Use cases:# - Serverless: Development, variable workloads# - Pod-based: Production, high throughput, low latencyPinecone Operations
Upsert Vectors
# Upsert: Insert or update vectors
# Generate embeddings (using OpenAI)import openai
def embed_text(text): response = openai.Embedding.create( model="text-embedding-ada-002", input=text ) return response['data'][0]['embedding']
# Prepare datadocuments = [ { "id": "doc1", "text": "Machine learning is a subset of artificial intelligence.", "metadata": {"category": "AI", "author": "John Doe", "date": "2025-01-27"} }, { "id": "doc2", "text": "Data engineering involves building data pipelines.", "metadata": {"category": "DE", "author": "Jane Smith", "date": "2025-01-26"} }, { "id": "doc3", "text": "Cloud computing provides on-demand resources.", "metadata": {"category": "Cloud", "author": "Bob Johnson", "date": "2025-01-25"} }]
# Create embeddingsvectors_to_upsert = []for doc in documents: vector = embed_text(doc['text']) vectors_to_upsert.append({ "id": doc['id'], "values": vector, "metadata": doc['metadata'] })
# Connect to indexindex = pc.Index(index_name)
# Upsert vectorsindex.upsert(vectors=vectors_to_upsert, namespace="production")
print(f"Upserted {len(vectors_to_upsert)} vectors")Query with Metadata Filtering
# Query with metadata filter
# Generate query embeddingquery = "What is machine learning?"query_vector = embed_text(query)
# Query Pineconeresults = index.query( namespace="production", vector=query_vector, top_k=5, # Return top 5 results include_metadata=True, include_values=False, filter={ "category": {"$eq": "AI"} # Only AI category })
# Process resultsfor match in results['matches']: score = match['score'] # Similarity score (0-1) metadata = match['metadata'] text_id = match['id'] print(f"Score: {score:.4f}, Category: {metadata['category']}, Author: {metadata['author']}, ID: {text_id}")Advanced Filtering
# Complex metadata filters
# Filter by multiple conditionsresults = index.query( namespace="production", vector=query_vector, top_k=10, include_metadata=True, filter={ "$and": [ {"category": {"$in": ["AI", "ML"]}}, # AI or ML category {"date": {"$gte": "2025-01-01"}}, # Recent documents {"author": {"$ne": "John Doe"}} # Exclude specific author ] })
# Filter operators:# $eq: Equal to# $ne: Not equal to# $gt: Greater than# $gte: Greater than or equal# $lt: Less than# $lte: Less than or equal# $in: In list# $nin: Not in list# $and: Logical AND# $or: Logical OR# $not: Logical NOTPinecone for RAG
RAG Architecture
RAG Implementation
import openai
# RAG pipeline with Pineconedef rag_pipeline(question: str) -> str: # 1. Embed question query_vector = embed_text(question)
# 2. Search Pinecone for relevant documents results = index.query( namespace="production", vector=query_vector, top_k=5, include_metadata=True, include_values=False )
# 3. Retrieve full documents from database context_docs = [] for match in results['matches']: doc_id = match['id'] # Retrieve from database (PostgreSQL, MongoDB, etc.) doc = retrieve_document(doc_id) context_docs.append(doc['text'])
# 4. Augment prompt with context context = "\n".join([f"Document {i+1}: {doc}" for i, doc in enumerate(context_docs)])
augmented_prompt = f""" Context: {context}
Question: {question}
Answer: """
# 5. Generate answer with LLM response = openai.ChatCompletion.create( model="gpt-4", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": augmented_prompt} ], temperature=0.7, max_tokens=500 )
answer = response['choices'][0]['message']['content']
# 6. Add citations citations = [match['id'] for match in results['matches']]
return f"{answer}\n\nCitations: {citations}"Pinecone Performance
Index Configuration
# Optimize index for performance
# 1. Choose appropriate metric# Cosine: Normalized vectors (recommended for OpenAI embeddings)# Dot product: Normalized vectors, faster than cosine# Euclidean: Not recommended for high-dimensional vectors
# 2. Set appropriate top_k# Higher top_k = Slower queries, more comprehensive results# Lower top_k = Faster queries, narrower results
# 3. Use namespace for multi-tenancy# Separate namespaces per environment (dev, staging, prod)
# 4. Configure metadata filtering# Filters reduce search space, improve query performanceresults = index.query( vector=query_vector, top_k=10, filter={ "category": {"$eq": "AI"}, "date": {"$gte": "2025-01-01"} })
# 5. Use batch operations for bulk upsertsindex.upsert(vectors=vectors, batch_size=100)Query Performance
# Optimize query performance
# 1. Use parallel queries for multiple namespacesimport asyncio
async def parallel_queries(queries): tasks = [asyncio.create_task(query_pinecone(q)) for q in queries] results = await asyncio.gather(*tasks) return results
async def query_pinecone(query): return index.query( vector=embed_text(query), top_k=5 )
# 2. Reuse Pinecone client (connection pooling)# Initialize once, reuse across requests
# 3. Use appropriate pod type# p1.x1: Fastest (lowest latency)# p2.x8: Highest throughput
# 4. Monitor performance metrics# Query latency, P99 latency, throughputPinecone Cost Optimization
Pricing Model
| Component | Pricing | Notes |
|---|---|---|
| Serverless | $0.10 per 1M vectors stored | Storage |
| $0.10 per 1M queries | Queries | |
| Pod-based | $70/pod/month (p1.x1) | Compute |
| $0.20 per 1M queries | Queries | |
| Storage | Included with pod | 1GB/pod |
Cost Optimization
# Cost optimization strategies
# 1. Use serverless for development# Only pay for what you use
# 2. Use pod-based for production# Better for high throughput, predictable costs
# 3. Optimize vector dimensionality# Lower dimensions = Lower cost# Use dimensionality reduction:# - PCA: 1536 → 512 dimensions# - UMAP: 1536 → 256 dimensions
# 4. Delete unused vectors# Delete old vectors to reduce storage costs
# 5. Use namespaces for multi-tenancy# Separate environments, isolated data
# 6. Batch operations for bulk upserts# Reduce API calls, improve efficiencyPinecone Security
Access Control
# Pinecone API key management
# 1. Use environment variablesimport ospinecone_api_key = os.environ.get("PINECONE_API_KEY")
# 2. Create API keys with restrictions# In Pinecone Cloud: Create restricted API keys# - Read-only keys for queries# - Write keys for upserts# - Namespace-specific keys
# 3. Use IAM roles (for AWS-based Pinecone)# Least privilege access
# 4. Enable audit logging# Track all API calls and operationsData Encryption
# Encryption at rest (automatic)# Pinecone automatically encrypts all data at rest
# Encryption in transit# Use HTTPS for all API callspinecone.Pinecone(api_key="your-api-key", ssl=True)
# Data privacy# - Remove PII before embedding# - Anonymize sensitive information# - Use data maskingPinecone Monitoring
Metrics
# Key metrics to monitor
pinecone_metrics: - name: "Query latency (P50)" metric: pinecone_query_latency_p50 alert: "If > 100ms"
- name: "Query latency (P99)" metric: pinecone_query_latency_p99 alert: "If > 500ms"
- name: "Index size" metric: pinecone_index_size alert: "If > 10M vectors"
- name: "Query rate" metric: pinecone_queries_per_second alert: "If > 1000 qps"
- name: "Error rate" metric: pinecone_error_rate alert: "If > 1%"Pinecone Best Practices
DO
# 1. Use appropriate metric# Cosine similarity for OpenAI embeddings
# 2. Set top_k appropriately# Top 5-10 for RAG, Top 100 for broader search
# 3. Use metadata filtering# Pre-filter results for better performance
# 4. Namespace separation# Separate dev/staging/prod namespaces
# 5. Monitor performance# Track query latency and throughputDON’T
# 1. Don't ignore index capacity# Plan for growth (serverless or pod-based)
# 2. Don't use Euclidean metric# Not suitable for high-dimensional vectors
# 3. Don't forget to delete old vectors# Storage costs accumulate
# 4. Don't ignore metadata# Metadata filtering improves performance
# 5. Don't embed everything# Only embed searchable contentPinecone vs. Alternatives
| Feature | Pinecone | Weaviate | Milvus | pgvector |
|---|---|---|---|---|
| Managed | Yes | Yes | Yes (optional) | No |
| Cloud | AWS, GCP, Azure | Any | Any | Self-hosted |
| Pricing | Pay-per-use | Free tier, then usage-based | Self-hosted | Free (Postgres extension) |
| Scaling | Serverless/pod | Auto-scaling | Auto-scaling | Manual |
| Filtering | Metadata filters | Schema filters | Filters | SQL filters |
| Best For | RAG, semantic search | Knowledge graphs | Open-source, on-prem | Small-scale, self-hosted |
Key Takeaways
- Managed service: No infrastructure management
- Serverless: Pay-per-use, auto-scaling
- Vector search: ANN (Approximate Nearest Neighbor)
- Metadata filtering: Pre-filter for better performance
- RAG integration: Ideal for retrieval-augmented generation
- Cost optimization: Serverless for dev, pod-based for prod
- Security: API keys, encryption, audit logging
- Use When: RAG, semantic search, recommendations
Back to Module 5