Pinecone Guide

Vector Database for AI/ML Applications

Overview

Pinecone is a managed vector database optimized for machine learning applications. It provides efficient similarity search, filtering, and metadata storage, making it ideal for RAG (Retrieval-Augmented Generation), semantic search, and recommendation systems.

Pinecone Architecture

Vector Search Process

Key Components:

Index: Vector database storing embeddings
Embedding Model: Converts text to vectors (OpenAI, Cohere, custom)
Vector Search: ANN (Approximate Nearest Neighbor) algorithm
Metadata: Structured data filtering and re-ranking
API: REST and gRPC interfaces

Pinecone Setup

Index Creation

import pinecone
from pinecone import ServerlessSpec

# Initialize Pinecone client
pc = pinecone.Pinecone(api_key="your-api-key")

# Create index
index_name = "ml-documents"

# Create serverless index (recommended)
pc.create_index(
    name=index_name,
    dimension=1536,  # OpenAI text-embedding-ada-002
    metric="cosine",  # Options: cosine, euclidean, dotproduct
    spec=ServerlessSpec(
        cloud="aws",
        region="us-east-1"
    )
)

# Benefits:
# - Serverless: No capacity planning, auto-scaling
# - Pay-per-use: Only pay for queries and storage
# - Managed: No infrastructure to maintain

Pod-Based Index

# Create pod-based index (for high throughput)
pc.create_index(
    name="ml-documents-pod",
    dimension=1536,
    metric="cosine",
    spec=pinecone.PodSpec(
        environment="us-east-1-aws",
        pod_type="p1.x1"  # p1.x1 (fastest) to p2.x8 (largest)
        pods=3,  # Number of pods
        replicas=1,  # Replicas per pod
        shard_type= pinecone.PodSpec.ShardType.Automated  # Auto-scaling
    )
)

# Pod types:
# p1.x1: Fastest, lowest latency
# p1.x2: Balanced performance
# p2.x1: High throughput
# p2.x8: Maximum throughput

# Use cases:
# - Serverless: Development, variable workloads
# - Pod-based: Production, high throughput, low latency

Pinecone Operations

Upsert Vectors

# Upsert: Insert or update vectors

# Generate embeddings (using OpenAI)
import openai

def embed_text(text):
    response = openai.Embedding.create(
        model="text-embedding-ada-002",
        input=text
    )
    return response['data'][0]['embedding']

# Prepare data
documents = [
    {
        "id": "doc1",
        "text": "Machine learning is a subset of artificial intelligence.",
        "metadata": {"category": "AI", "author": "John Doe", "date": "2025-01-27"}
    },
    {
        "id": "doc2",
        "text": "Data engineering involves building data pipelines.",
        "metadata": {"category": "DE", "author": "Jane Smith", "date": "2025-01-26"}
    },
    {
        "id": "doc3",
        "text": "Cloud computing provides on-demand resources.",
        "metadata": {"category": "Cloud", "author": "Bob Johnson", "date": "2025-01-25"}
    }
]

# Create embeddings
vectors_to_upsert = []
for doc in documents:
    vector = embed_text(doc['text'])
    vectors_to_upsert.append({
        "id": doc['id'],
        "values": vector,
        "metadata": doc['metadata']
    })

# Connect to index
index = pc.Index(index_name)

# Upsert vectors
index.upsert(vectors=vectors_to_upsert, namespace="production")

print(f"Upserted {len(vectors_to_upsert)} vectors")

Query with Metadata Filtering

# Query with metadata filter

# Generate query embedding
query = "What is machine learning?"
query_vector = embed_text(query)

# Query Pinecone
results = index.query(
    namespace="production",
    vector=query_vector,
    top_k=5,  # Return top 5 results
    include_metadata=True,
    include_values=False,
    filter={
        "category": {"$eq": "AI"}  # Only AI category
    }
)

# Process results
for match in results['matches']:
    score = match['score']  # Similarity score (0-1)
    metadata = match['metadata']
    text_id = match['id']
    print(f"Score: {score:.4f}, Category: {metadata['category']}, Author: {metadata['author']}, ID: {text_id}")

Advanced Filtering

# Complex metadata filters

# Filter by multiple conditions
results = index.query(
    namespace="production",
    vector=query_vector,
    top_k=10,
    include_metadata=True,
    filter={
        "$and": [
            {"category": {"$in": ["AI", "ML"]}},  # AI or ML category
            {"date": {"$gte": "2025-01-01"}},  # Recent documents
            {"author": {"$ne": "John Doe"}}  # Exclude specific author
        ]
    }
)

# Filter operators:
# $eq: Equal to
# $ne: Not equal to
# $gt: Greater than
# $gte: Greater than or equal
# $lt: Less than
# $lte: Less than or equal
# $in: In list
# $nin: Not in list
# $and: Logical AND
# $or: Logical OR
# $not: Logical NOT

Pinecone for RAG

RAG Architecture

RAG Implementation

import openai

# RAG pipeline with Pinecone
def rag_pipeline(question: str) -> str:
    # 1. Embed question
    query_vector = embed_text(question)

    # 2. Search Pinecone for relevant documents
    results = index.query(
        namespace="production",
        vector=query_vector,
        top_k=5,
        include_metadata=True,
        include_values=False
    )

    # 3. Retrieve full documents from database
    context_docs = []
    for match in results['matches']:
        doc_id = match['id']
        # Retrieve from database (PostgreSQL, MongoDB, etc.)
        doc = retrieve_document(doc_id)
        context_docs.append(doc['text'])

    # 4. Augment prompt with context
    context = "\n".join([f"Document {i+1}: {doc}" for i, doc in enumerate(context_docs)])

    augmented_prompt = f"""
    Context:
    {context}

    Question: {question}

    Answer:
    """

    # 5. Generate answer with LLM
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": augmented_prompt}
        ],
        temperature=0.7,
        max_tokens=500
    )

    answer = response['choices'][0]['message']['content']

    # 6. Add citations
    citations = [match['id'] for match in results['matches']]

    return f"{answer}\n\nCitations: {citations}"

Pinecone Performance

Index Configuration

# Optimize index for performance

# 1. Choose appropriate metric
# Cosine: Normalized vectors (recommended for OpenAI embeddings)
# Dot product: Normalized vectors, faster than cosine
# Euclidean: Not recommended for high-dimensional vectors

# 2. Set appropriate top_k
# Higher top_k = Slower queries, more comprehensive results
# Lower top_k = Faster queries, narrower results

# 3. Use namespace for multi-tenancy
# Separate namespaces per environment (dev, staging, prod)

# 4. Configure metadata filtering
# Filters reduce search space, improve query performance
results = index.query(
    vector=query_vector,
    top_k=10,
    filter={
        "category": {"$eq": "AI"},
        "date": {"$gte": "2025-01-01"}
    }
)

# 5. Use batch operations for bulk upserts
index.upsert(vectors=vectors, batch_size=100)

Query Performance

# Optimize query performance

# 1. Use parallel queries for multiple namespaces
import asyncio

async def parallel_queries(queries):
    tasks = [asyncio.create_task(query_pinecone(q)) for q in queries]
    results = await asyncio.gather(*tasks)
    return results

async def query_pinecone(query):
    return index.query(
        vector=embed_text(query),
        top_k=5
    )

# 2. Reuse Pinecone client (connection pooling)
# Initialize once, reuse across requests

# 3. Use appropriate pod type
# p1.x1: Fastest (lowest latency)
# p2.x8: Highest throughput

# 4. Monitor performance metrics
# Query latency, P99 latency, throughput

Pinecone Cost Optimization

Pricing Model

Component	Pricing	Notes
Serverless	$0.10 per 1M vectors stored	Storage
	$0.10 per 1M queries	Queries
Pod-based	$70/pod/month (p1.x1)	Compute
	$0.20 per 1M queries	Queries
Storage	Included with pod	1GB/pod

Cost Optimization

# Cost optimization strategies

# 1. Use serverless for development
# Only pay for what you use

# 2. Use pod-based for production
# Better for high throughput, predictable costs

# 3. Optimize vector dimensionality
# Lower dimensions = Lower cost
# Use dimensionality reduction:
# - PCA: 1536 → 512 dimensions
# - UMAP: 1536 → 256 dimensions

# 4. Delete unused vectors
# Delete old vectors to reduce storage costs

# 5. Use namespaces for multi-tenancy
# Separate environments, isolated data

# 6. Batch operations for bulk upserts
# Reduce API calls, improve efficiency

Pinecone Security

Access Control

# Pinecone API key management

# 1. Use environment variables
import os
pinecone_api_key = os.environ.get("PINECONE_API_KEY")

# 2. Create API keys with restrictions
# In Pinecone Cloud: Create restricted API keys
# - Read-only keys for queries
# - Write keys for upserts
# - Namespace-specific keys

# 3. Use IAM roles (for AWS-based Pinecone)
# Least privilege access

# 4. Enable audit logging
# Track all API calls and operations

Data Encryption

# Encryption at rest (automatic)
# Pinecone automatically encrypts all data at rest

# Encryption in transit
# Use HTTPS for all API calls
pinecone.Pinecone(api_key="your-api-key", ssl=True)

# Data privacy
# - Remove PII before embedding
# - Anonymize sensitive information
# - Use data masking

Pinecone Monitoring

Metrics

# Key metrics to monitor

pinecone_metrics:
  - name: "Query latency (P50)"
    metric: pinecone_query_latency_p50
    alert: "If > 100ms"

  - name: "Query latency (P99)"
    metric: pinecone_query_latency_p99
    alert: "If > 500ms"

  - name: "Index size"
    metric: pinecone_index_size
    alert: "If > 10M vectors"

  - name: "Query rate"
    metric: pinecone_queries_per_second
    alert: "If > 1000 qps"

  - name: "Error rate"
    metric: pinecone_error_rate
    alert: "If > 1%"

Pinecone Best Practices

DO

# 1. Use appropriate metric
# Cosine similarity for OpenAI embeddings

# 2. Set top_k appropriately
# Top 5-10 for RAG, Top 100 for broader search

# 3. Use metadata filtering
# Pre-filter results for better performance

# 4. Namespace separation
# Separate dev/staging/prod namespaces

# 5. Monitor performance
# Track query latency and throughput

DON’T

# 1. Don't ignore index capacity
# Plan for growth (serverless or pod-based)

# 2. Don't use Euclidean metric
# Not suitable for high-dimensional vectors

# 3. Don't forget to delete old vectors
# Storage costs accumulate

# 4. Don't ignore metadata
# Metadata filtering improves performance

# 5. Don't embed everything
# Only embed searchable content

Pinecone vs. Alternatives

Feature	Pinecone	Weaviate	Milvus	pgvector
Managed	Yes	Yes	Yes (optional)	No
Cloud	AWS, GCP, Azure	Any	Any	Self-hosted
Pricing	Pay-per-use	Free tier, then usage-based	Self-hosted	Free (Postgres extension)
Scaling	Serverless/pod	Auto-scaling	Auto-scaling	Manual
Filtering	Metadata filters	Schema filters	Filters	SQL filters
Best For	RAG, semantic search	Knowledge graphs	Open-source, on-prem	Small-scale, self-hosted

Key Takeaways

Managed service: No infrastructure management
Serverless: Pay-per-use, auto-scaling
Vector search: ANN (Approximate Nearest Neighbor)
Metadata filtering: Pre-filter for better performance
RAG integration: Ideal for retrieval-augmented generation
Cost optimization: Serverless for dev, pod-based for prod
Security: API keys, encryption, audit logging
Use When: RAG, semantic search, recommendations

Back to Module 5