Skip to content

Pandas Optimization

Scaling Pandas from MB to GB


Overview

Pandas is excellent for data analysis but struggles with large datasets (> 10GB). This document covers optimization techniques to scale Pandas from MB to GB workloads, and when to switch to alternatives like Polars or DuckDB.


The Pandas Scaling Problem

Memory usage: Pandas loads all data into memory. 100GB dataset = 100GB+ RAM required.


Optimization Techniques

1. Use Optimal Data Types

import pandas as pd
import numpy as np
# Bad: Default types (memory inefficient)
df_bad = pd.DataFrame({
'id': [1, 2, 3], # int64 (8 bytes)
'category': ['A', 'B', 'C'], # object (high overhead)
'flag': [True, False, True], # bool (1 byte, ok)
'amount': [10.5, 20.3, 30.1] # float64 (8 bytes)
})
# Good: Optimal types (3-10x memory savings)
df_good = pd.DataFrame({
'id': pd.array([1, 2, 3], dtype='int32'), # 4 bytes
'category': pd.array(['A', 'B', 'C'], dtype='category'), # < 1 byte
'flag': pd.array([True, False, True], dtype='bool'), # 1 byte
'amount': pd.array([10.5, 20.3, 30.1], dtype='float32') # 4 bytes
})
# Downcast existing DataFrame
df = df.astype({
'id': 'int32',
'amount': 'float32',
'category': 'category'
})
# Automatic downcasting
def downcast(df):
"""Downcast numeric columns to save memory."""
for col in df.select_dtypes(include=['int64']).columns:
df[col] = pd.to_numeric(df[col], downcast='integer')
for col in df.select_dtypes(include=['float64']).columns:
df[col] = pd.to_numeric(df[col], downcast='float')
return df

2. Use Categorical Data

# String columns with low cardinality: Use categorical
df['category'] = df['category'].astype('category')
# Memory comparison
# object: 100 MB for 10M rows
# category: 10 MB for 10M rows (10x savings)
# Check cardinality before converting
if df['column'].nunique() / len(df) < 0.5: # < 50% unique
df['column'] = df['column'].astype('category')

3. Read Only Required Columns

# Bad: Read all columns
df = pd.read_csv('large_file.csv')
# Good: Read only needed columns
df = pd.read_csv('large_file.csv', usecols=['col1', 'col2', 'col3'])
# For Parquet (columnar, even faster)
df = pd.read_parquet('large_file.parquet', columns=['col1', 'col2'])

4. Chunked Processing

# Process in chunks
chunk_size = 100000
results = []
for chunk in pd.read_csv('huge_file.csv', chunksize=chunk_size):
# Process chunk
result = chunk.groupby('category').agg({'amount': 'sum'})
results.append(result)
# Combine results
final_result = pd.concat(results).groupby(level=0).sum()

5. Filter at Read Time

# Bad: Read all, then filter
df = pd.read_csv('large_file.csv')
df_filtered = df[df['date'] >= '2025-01-01']
# Good: Filter during read (for supported formats)
# Parquet with predicates (using PyArrow)
import pyarrow.dataset as ds
dataset = ds.dataset('large_file.parquet', format='parquet')
table = dataset.to_table(filter=ds.field('date') >= '2025-01-01')
df = table.to_pandas()

6. Use Inplace Operations

# Bad: Creates copy
df = df.drop(columns=['unused1', 'unused2'])
# Good: Modifies in place
df.drop(columns=['unused1', 'unused2'], inplace=True)

7. Avoid Iterrows

# Bad: Iterrows (1000x slower)
for index, row in df.iterrows():
df.at[index, 'new_col'] = row['col1'] * 2
# Good: Vectorized
df['new_col'] = df['col1'] * 2
# Good: Apply (still slower than vectorized)
df['new_col'] = df['col1'].apply(lambda x: x * 2)

When to Switch to Alternatives

Pandas vs. Polars vs. DuckDB

OperationPandasPolarsDuckDB
Read CSV1x2-5x faster5-10x faster
GroupBy1x5-20x faster10-50x faster
Join1x3-10x faster5-20x faster
MemoryHighLower (60% less)Lowest (80% less)
EcosystemExcellentGrowingGood

Polars: Drop-in Replacement

Polars Syntax Comparison

import polars as pl
# Read (faster than Pandas)
df = pl.read_csv('large_file.csv')
# Filter (lazy evaluation, faster)
df_filtered = df.filter(pl.col('amount') > 100)
# GroupBy (much faster)
result = df.groupby('category').agg([
pl.col('amount').sum().alias('total'),
pl.col('amount').mean().alias('average')
])
# Join (faster)
result = df1.join(df2, on='id', how='left')
# Lazy API (optimizes query plan)
result = (
pl.scan_csv('large_file.csv') # Lazy
.filter(pl.col('date') >= '2025-01-01')
.groupby('category')
.agg(pl.col('amount').sum())
.collect() # Execute
)

Migration from Pandas to Polars

# Pandas
import pandas as pd
df = pd.read_csv('data.csv')
df_filtered = df[df['amount'] > 100]
result = df_filtered.groupby('category')['amount'].sum()
# Polars (similar syntax)
import polars as pl
df = pl.read_csv('data.csv')
df_filtered = df.filter(pl.col('amount') > 100)
result = df_filtered.groupby('category').agg(pl.col('amount').sum())

DuckDB: SQL for Pandas

DuckDB Integration

import duckdb
import pandas as pd
# Query Pandas DataFrame with SQL
df = pd.read_csv('large_file.csv')
result = duckdb.query("""
SELECT
category,
SUM(amount) as total_amount,
AVG(amount) as avg_amount,
COUNT(*) as count
FROM df
WHERE date >= '2025-01-01'
GROUP BY category
HAVING SUM(amount) > 1000
ORDER BY total_amount DESC
""").to_df()
# Multiple DataFrames
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
result = duckdb.query("""
SELECT df1.*, df2.attribute
FROM df1
LEFT JOIN df2 ON df1.id = df2.id
""").to_df()

DuckDB for Aggregations

# Pandas: Slow for large data
result = df.groupby('category').agg({
'amount': ['sum', 'mean', 'count']
})
# DuckDB: Much faster
result = duckdb.query("""
SELECT
category,
SUM(amount) as sum_amount,
AVG(amount) as avg_amount,
COUNT(*) as count_amount
FROM df
GROUP BY category
""").to_df()

Advanced Optimization

Memory-Mapped Files

# Use Dask for out-of-core computation
import dask.dataframe as dd
# Read large CSV (doesn't load all into memory)
df = dd.read_csv('huge_file.csv')
# Operations are lazy
result = df.groupby('category').amount.sum().compute()

Modin: Drop-in Pandas Replacement

# Modin uses Ray/Dask for parallelization
import modin.pandas as pd
# Same syntax, parallel execution
df = pd.read_csv('large_file.csv')
result = df.groupby('category').amount.sum()

Performance Benchmarks

GroupBy Performance (10M rows)

OperationPandasPolarsDuckDB
Read CSV10s2s1s
GroupBy sum5s0.5s0.2s
GroupBy multiple8s0.8s0.3s
Total23s3.3s1.5s

Result: DuckDB is 15x faster than Pandas for this workload.


Best Practices

DO

# 1. Use optimal dtypes
df = df.astype({'id': 'int32', 'category': 'category'})
# 2. Read only required columns
df = pd.read_csv('file.csv', usecols=['col1', 'col2'])
# 3. Process in chunks for large files
for chunk in pd.read_csv('huge.csv', chunksize=100000):
process(chunk)
# 4. Use vectorized operations
df['new_col'] = df['col1'] * 2
# 5. Switch to Polars/DuckDB for > 5GB data

DON’T

# 1. Don't use iterrows
for row in df.iterrows(): # SLOW
pass
# 2. Don't read all columns when not needed
df = pd.read_csv('file.csv') # Reads all
# 3. Don't use apply for simple operations
df['new'] = df['col'].apply(lambda x: x * 2) # SLOW
df['new'] = df['col'] * 2 # FAST
# 4. Don't ignore memory usage
# Monitor with df.info(memory_usage='deep')

Key Takeaways

  1. Optimal dtypes: Use int32, float32, category for 3-10x memory savings
  2. Column pruning: Read only required columns
  3. Chunked processing: Process large files in chunks
  4. Vectorize operations: Avoid iterrows, use vectorized operations
  5. Switch when needed: Use Polars for 1-10GB, DuckDB/Spark for > 10GB
  6. Lazy evaluation: Use lazy APIs for query optimization
  7. Monitor memory: Use df.info(memory_usage=‘deep’)

Back to Module 2