Skip to content

Pinecone Guide

Vector Database for AI/ML Applications


Overview

Pinecone is a managed vector database optimized for machine learning applications. It provides efficient similarity search, filtering, and metadata storage, making it ideal for RAG (Retrieval-Augmented Generation), semantic search, and recommendation systems.


Pinecone Architecture

Vector Search Process

Key Components:

  • Index: Vector database storing embeddings
  • Embedding Model: Converts text to vectors (OpenAI, Cohere, custom)
  • Vector Search: ANN (Approximate Nearest Neighbor) algorithm
  • Metadata: Structured data filtering and re-ranking
  • API: REST and gRPC interfaces

Pinecone Setup

Index Creation

import pinecone
from pinecone import ServerlessSpec
# Initialize Pinecone client
pc = pinecone.Pinecone(api_key="your-api-key")
# Create index
index_name = "ml-documents"
# Create serverless index (recommended)
pc.create_index(
name=index_name,
dimension=1536, # OpenAI text-embedding-ada-002
metric="cosine", # Options: cosine, euclidean, dotproduct
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
)
)
# Benefits:
# - Serverless: No capacity planning, auto-scaling
# - Pay-per-use: Only pay for queries and storage
# - Managed: No infrastructure to maintain

Pod-Based Index

# Create pod-based index (for high throughput)
pc.create_index(
name="ml-documents-pod",
dimension=1536,
metric="cosine",
spec=pinecone.PodSpec(
environment="us-east-1-aws",
pod_type="p1.x1" # p1.x1 (fastest) to p2.x8 (largest)
pods=3, # Number of pods
replicas=1, # Replicas per pod
shard_type= pinecone.PodSpec.ShardType.Automated # Auto-scaling
)
)
# Pod types:
# p1.x1: Fastest, lowest latency
# p1.x2: Balanced performance
# p2.x1: High throughput
# p2.x8: Maximum throughput
# Use cases:
# - Serverless: Development, variable workloads
# - Pod-based: Production, high throughput, low latency

Pinecone Operations

Upsert Vectors

# Upsert: Insert or update vectors
# Generate embeddings (using OpenAI)
import openai
def embed_text(text):
response = openai.Embedding.create(
model="text-embedding-ada-002",
input=text
)
return response['data'][0]['embedding']
# Prepare data
documents = [
{
"id": "doc1",
"text": "Machine learning is a subset of artificial intelligence.",
"metadata": {"category": "AI", "author": "John Doe", "date": "2025-01-27"}
},
{
"id": "doc2",
"text": "Data engineering involves building data pipelines.",
"metadata": {"category": "DE", "author": "Jane Smith", "date": "2025-01-26"}
},
{
"id": "doc3",
"text": "Cloud computing provides on-demand resources.",
"metadata": {"category": "Cloud", "author": "Bob Johnson", "date": "2025-01-25"}
}
]
# Create embeddings
vectors_to_upsert = []
for doc in documents:
vector = embed_text(doc['text'])
vectors_to_upsert.append({
"id": doc['id'],
"values": vector,
"metadata": doc['metadata']
})
# Connect to index
index = pc.Index(index_name)
# Upsert vectors
index.upsert(vectors=vectors_to_upsert, namespace="production")
print(f"Upserted {len(vectors_to_upsert)} vectors")

Query with Metadata Filtering

# Query with metadata filter
# Generate query embedding
query = "What is machine learning?"
query_vector = embed_text(query)
# Query Pinecone
results = index.query(
namespace="production",
vector=query_vector,
top_k=5, # Return top 5 results
include_metadata=True,
include_values=False,
filter={
"category": {"$eq": "AI"} # Only AI category
}
)
# Process results
for match in results['matches']:
score = match['score'] # Similarity score (0-1)
metadata = match['metadata']
text_id = match['id']
print(f"Score: {score:.4f}, Category: {metadata['category']}, Author: {metadata['author']}, ID: {text_id}")

Advanced Filtering

# Complex metadata filters
# Filter by multiple conditions
results = index.query(
namespace="production",
vector=query_vector,
top_k=10,
include_metadata=True,
filter={
"$and": [
{"category": {"$in": ["AI", "ML"]}}, # AI or ML category
{"date": {"$gte": "2025-01-01"}}, # Recent documents
{"author": {"$ne": "John Doe"}} # Exclude specific author
]
}
)
# Filter operators:
# $eq: Equal to
# $ne: Not equal to
# $gt: Greater than
# $gte: Greater than or equal
# $lt: Less than
# $lte: Less than or equal
# $in: In list
# $nin: Not in list
# $and: Logical AND
# $or: Logical OR
# $not: Logical NOT

Pinecone for RAG

RAG Architecture

RAG Implementation

import openai
# RAG pipeline with Pinecone
def rag_pipeline(question: str) -> str:
# 1. Embed question
query_vector = embed_text(question)
# 2. Search Pinecone for relevant documents
results = index.query(
namespace="production",
vector=query_vector,
top_k=5,
include_metadata=True,
include_values=False
)
# 3. Retrieve full documents from database
context_docs = []
for match in results['matches']:
doc_id = match['id']
# Retrieve from database (PostgreSQL, MongoDB, etc.)
doc = retrieve_document(doc_id)
context_docs.append(doc['text'])
# 4. Augment prompt with context
context = "\n".join([f"Document {i+1}: {doc}" for i, doc in enumerate(context_docs)])
augmented_prompt = f"""
Context:
{context}
Question: {question}
Answer:
"""
# 5. Generate answer with LLM
response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": augmented_prompt}
],
temperature=0.7,
max_tokens=500
)
answer = response['choices'][0]['message']['content']
# 6. Add citations
citations = [match['id'] for match in results['matches']]
return f"{answer}\n\nCitations: {citations}"

Pinecone Performance

Index Configuration

# Optimize index for performance
# 1. Choose appropriate metric
# Cosine: Normalized vectors (recommended for OpenAI embeddings)
# Dot product: Normalized vectors, faster than cosine
# Euclidean: Not recommended for high-dimensional vectors
# 2. Set appropriate top_k
# Higher top_k = Slower queries, more comprehensive results
# Lower top_k = Faster queries, narrower results
# 3. Use namespace for multi-tenancy
# Separate namespaces per environment (dev, staging, prod)
# 4. Configure metadata filtering
# Filters reduce search space, improve query performance
results = index.query(
vector=query_vector,
top_k=10,
filter={
"category": {"$eq": "AI"},
"date": {"$gte": "2025-01-01"}
}
)
# 5. Use batch operations for bulk upserts
index.upsert(vectors=vectors, batch_size=100)

Query Performance

# Optimize query performance
# 1. Use parallel queries for multiple namespaces
import asyncio
async def parallel_queries(queries):
tasks = [asyncio.create_task(query_pinecone(q)) for q in queries]
results = await asyncio.gather(*tasks)
return results
async def query_pinecone(query):
return index.query(
vector=embed_text(query),
top_k=5
)
# 2. Reuse Pinecone client (connection pooling)
# Initialize once, reuse across requests
# 3. Use appropriate pod type
# p1.x1: Fastest (lowest latency)
# p2.x8: Highest throughput
# 4. Monitor performance metrics
# Query latency, P99 latency, throughput

Pinecone Cost Optimization

Pricing Model

ComponentPricingNotes
Serverless$0.10 per 1M vectors storedStorage
$0.10 per 1M queriesQueries
Pod-based$70/pod/month (p1.x1)Compute
$0.20 per 1M queriesQueries
StorageIncluded with pod1GB/pod

Cost Optimization

# Cost optimization strategies
# 1. Use serverless for development
# Only pay for what you use
# 2. Use pod-based for production
# Better for high throughput, predictable costs
# 3. Optimize vector dimensionality
# Lower dimensions = Lower cost
# Use dimensionality reduction:
# - PCA: 1536 → 512 dimensions
# - UMAP: 1536 → 256 dimensions
# 4. Delete unused vectors
# Delete old vectors to reduce storage costs
# 5. Use namespaces for multi-tenancy
# Separate environments, isolated data
# 6. Batch operations for bulk upserts
# Reduce API calls, improve efficiency

Pinecone Security

Access Control

# Pinecone API key management
# 1. Use environment variables
import os
pinecone_api_key = os.environ.get("PINECONE_API_KEY")
# 2. Create API keys with restrictions
# In Pinecone Cloud: Create restricted API keys
# - Read-only keys for queries
# - Write keys for upserts
# - Namespace-specific keys
# 3. Use IAM roles (for AWS-based Pinecone)
# Least privilege access
# 4. Enable audit logging
# Track all API calls and operations

Data Encryption

# Encryption at rest (automatic)
# Pinecone automatically encrypts all data at rest
# Encryption in transit
# Use HTTPS for all API calls
pinecone.Pinecone(api_key="your-api-key", ssl=True)
# Data privacy
# - Remove PII before embedding
# - Anonymize sensitive information
# - Use data masking

Pinecone Monitoring

Metrics

# Key metrics to monitor
pinecone_metrics:
- name: "Query latency (P50)"
metric: pinecone_query_latency_p50
alert: "If > 100ms"
- name: "Query latency (P99)"
metric: pinecone_query_latency_p99
alert: "If > 500ms"
- name: "Index size"
metric: pinecone_index_size
alert: "If > 10M vectors"
- name: "Query rate"
metric: pinecone_queries_per_second
alert: "If > 1000 qps"
- name: "Error rate"
metric: pinecone_error_rate
alert: "If > 1%"

Pinecone Best Practices

DO

# 1. Use appropriate metric
# Cosine similarity for OpenAI embeddings
# 2. Set top_k appropriately
# Top 5-10 for RAG, Top 100 for broader search
# 3. Use metadata filtering
# Pre-filter results for better performance
# 4. Namespace separation
# Separate dev/staging/prod namespaces
# 5. Monitor performance
# Track query latency and throughput

DON’T

# 1. Don't ignore index capacity
# Plan for growth (serverless or pod-based)
# 2. Don't use Euclidean metric
# Not suitable for high-dimensional vectors
# 3. Don't forget to delete old vectors
# Storage costs accumulate
# 4. Don't ignore metadata
# Metadata filtering improves performance
# 5. Don't embed everything
# Only embed searchable content

Pinecone vs. Alternatives

FeaturePineconeWeaviateMilvuspgvector
ManagedYesYesYes (optional)No
CloudAWS, GCP, AzureAnyAnySelf-hosted
PricingPay-per-useFree tier, then usage-basedSelf-hostedFree (Postgres extension)
ScalingServerless/podAuto-scalingAuto-scalingManual
FilteringMetadata filtersSchema filtersFiltersSQL filters
Best ForRAG, semantic searchKnowledge graphsOpen-source, on-premSmall-scale, self-hosted

Key Takeaways

  1. Managed service: No infrastructure management
  2. Serverless: Pay-per-use, auto-scaling
  3. Vector search: ANN (Approximate Nearest Neighbor)
  4. Metadata filtering: Pre-filter for better performance
  5. RAG integration: Ideal for retrieval-augmented generation
  6. Cost optimization: Serverless for dev, pod-based for prod
  7. Security: API keys, encryption, audit logging
  8. Use When: RAG, semantic search, recommendations

Back to Module 5