Pandas Optimization
Scaling Pandas from MB to GB
Overview
Pandas is excellent for data analysis but struggles with large datasets (> 10GB). This document covers optimization techniques to scale Pandas from MB to GB workloads, and when to switch to alternatives like Polars or DuckDB.
The Pandas Scaling Problem
Memory usage: Pandas loads all data into memory. 100GB dataset = 100GB+ RAM required.
Optimization Techniques
1. Use Optimal Data Types
import pandas as pdimport numpy as np
# Bad: Default types (memory inefficient)df_bad = pd.DataFrame({ 'id': [1, 2, 3], # int64 (8 bytes) 'category': ['A', 'B', 'C'], # object (high overhead) 'flag': [True, False, True], # bool (1 byte, ok) 'amount': [10.5, 20.3, 30.1] # float64 (8 bytes)})
# Good: Optimal types (3-10x memory savings)df_good = pd.DataFrame({ 'id': pd.array([1, 2, 3], dtype='int32'), # 4 bytes 'category': pd.array(['A', 'B', 'C'], dtype='category'), # < 1 byte 'flag': pd.array([True, False, True], dtype='bool'), # 1 byte 'amount': pd.array([10.5, 20.3, 30.1], dtype='float32') # 4 bytes})
# Downcast existing DataFramedf = df.astype({ 'id': 'int32', 'amount': 'float32', 'category': 'category'})
# Automatic downcastingdef downcast(df): """Downcast numeric columns to save memory.""" for col in df.select_dtypes(include=['int64']).columns: df[col] = pd.to_numeric(df[col], downcast='integer') for col in df.select_dtypes(include=['float64']).columns: df[col] = pd.to_numeric(df[col], downcast='float') return df2. Use Categorical Data
# String columns with low cardinality: Use categoricaldf['category'] = df['category'].astype('category')
# Memory comparison# object: 100 MB for 10M rows# category: 10 MB for 10M rows (10x savings)
# Check cardinality before convertingif df['column'].nunique() / len(df) < 0.5: # < 50% unique df['column'] = df['column'].astype('category')3. Read Only Required Columns
# Bad: Read all columnsdf = pd.read_csv('large_file.csv')
# Good: Read only needed columnsdf = pd.read_csv('large_file.csv', usecols=['col1', 'col2', 'col3'])
# For Parquet (columnar, even faster)df = pd.read_parquet('large_file.parquet', columns=['col1', 'col2'])4. Chunked Processing
# Process in chunkschunk_size = 100000results = []
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size): # Process chunk result = chunk.groupby('category').agg({'amount': 'sum'}) results.append(result)
# Combine resultsfinal_result = pd.concat(results).groupby(level=0).sum()5. Filter at Read Time
# Bad: Read all, then filterdf = pd.read_csv('large_file.csv')df_filtered = df[df['date'] >= '2025-01-01']
# Good: Filter during read (for supported formats)# Parquet with predicates (using PyArrow)import pyarrow.dataset as dsdataset = ds.dataset('large_file.parquet', format='parquet')table = dataset.to_table(filter=ds.field('date') >= '2025-01-01')df = table.to_pandas()6. Use Inplace Operations
# Bad: Creates copydf = df.drop(columns=['unused1', 'unused2'])
# Good: Modifies in placedf.drop(columns=['unused1', 'unused2'], inplace=True)7. Avoid Iterrows
# Bad: Iterrows (1000x slower)for index, row in df.iterrows(): df.at[index, 'new_col'] = row['col1'] * 2
# Good: Vectorizeddf['new_col'] = df['col1'] * 2
# Good: Apply (still slower than vectorized)df['new_col'] = df['col1'].apply(lambda x: x * 2)When to Switch to Alternatives
Pandas vs. Polars vs. DuckDB
| Operation | Pandas | Polars | DuckDB |
|---|---|---|---|
| Read CSV | 1x | 2-5x faster | 5-10x faster |
| GroupBy | 1x | 5-20x faster | 10-50x faster |
| Join | 1x | 3-10x faster | 5-20x faster |
| Memory | High | Lower (60% less) | Lowest (80% less) |
| Ecosystem | Excellent | Growing | Good |
Polars: Drop-in Replacement
Polars Syntax Comparison
import polars as pl
# Read (faster than Pandas)df = pl.read_csv('large_file.csv')
# Filter (lazy evaluation, faster)df_filtered = df.filter(pl.col('amount') > 100)
# GroupBy (much faster)result = df.groupby('category').agg([ pl.col('amount').sum().alias('total'), pl.col('amount').mean().alias('average')])
# Join (faster)result = df1.join(df2, on='id', how='left')
# Lazy API (optimizes query plan)result = ( pl.scan_csv('large_file.csv') # Lazy .filter(pl.col('date') >= '2025-01-01') .groupby('category') .agg(pl.col('amount').sum()) .collect() # Execute)Migration from Pandas to Polars
# Pandasimport pandas as pddf = pd.read_csv('data.csv')df_filtered = df[df['amount'] > 100]result = df_filtered.groupby('category')['amount'].sum()
# Polars (similar syntax)import polars as pldf = pl.read_csv('data.csv')df_filtered = df.filter(pl.col('amount') > 100)result = df_filtered.groupby('category').agg(pl.col('amount').sum())DuckDB: SQL for Pandas
DuckDB Integration
import duckdbimport pandas as pd
# Query Pandas DataFrame with SQLdf = pd.read_csv('large_file.csv')
result = duckdb.query(""" SELECT category, SUM(amount) as total_amount, AVG(amount) as avg_amount, COUNT(*) as count FROM df WHERE date >= '2025-01-01' GROUP BY category HAVING SUM(amount) > 1000 ORDER BY total_amount DESC""").to_df()
# Multiple DataFramesdf1 = pd.read_csv('file1.csv')df2 = pd.read_csv('file2.csv')
result = duckdb.query(""" SELECT df1.*, df2.attribute FROM df1 LEFT JOIN df2 ON df1.id = df2.id""").to_df()DuckDB for Aggregations
# Pandas: Slow for large dataresult = df.groupby('category').agg({ 'amount': ['sum', 'mean', 'count']})
# DuckDB: Much fasterresult = duckdb.query(""" SELECT category, SUM(amount) as sum_amount, AVG(amount) as avg_amount, COUNT(*) as count_amount FROM df GROUP BY category""").to_df()Advanced Optimization
Memory-Mapped Files
# Use Dask for out-of-core computationimport dask.dataframe as dd
# Read large CSV (doesn't load all into memory)df = dd.read_csv('huge_file.csv')
# Operations are lazyresult = df.groupby('category').amount.sum().compute()Modin: Drop-in Pandas Replacement
# Modin uses Ray/Dask for parallelizationimport modin.pandas as pd
# Same syntax, parallel executiondf = pd.read_csv('large_file.csv')result = df.groupby('category').amount.sum()Performance Benchmarks
GroupBy Performance (10M rows)
| Operation | Pandas | Polars | DuckDB |
|---|---|---|---|
| Read CSV | 10s | 2s | 1s |
| GroupBy sum | 5s | 0.5s | 0.2s |
| GroupBy multiple | 8s | 0.8s | 0.3s |
| Total | 23s | 3.3s | 1.5s |
Result: DuckDB is 15x faster than Pandas for this workload.
Best Practices
DO
# 1. Use optimal dtypesdf = df.astype({'id': 'int32', 'category': 'category'})
# 2. Read only required columnsdf = pd.read_csv('file.csv', usecols=['col1', 'col2'])
# 3. Process in chunks for large filesfor chunk in pd.read_csv('huge.csv', chunksize=100000): process(chunk)
# 4. Use vectorized operationsdf['new_col'] = df['col1'] * 2
# 5. Switch to Polars/DuckDB for > 5GB dataDON’T
# 1. Don't use iterrowsfor row in df.iterrows(): # SLOW pass
# 2. Don't read all columns when not neededdf = pd.read_csv('file.csv') # Reads all
# 3. Don't use apply for simple operationsdf['new'] = df['col'].apply(lambda x: x * 2) # SLOWdf['new'] = df['col'] * 2 # FAST
# 4. Don't ignore memory usage# Monitor with df.info(memory_usage='deep')Key Takeaways
- Optimal dtypes: Use int32, float32, category for 3-10x memory savings
- Column pruning: Read only required columns
- Chunked processing: Process large files in chunks
- Vectorize operations: Avoid iterrows, use vectorized operations
- Switch when needed: Use Polars for 1-10GB, DuckDB/Spark for > 10GB
- Lazy evaluation: Use lazy APIs for query optimization
- Monitor memory: Use df.info(memory_usage=‘deep’)
Back to Module 2