Rust for Data Engineering
High-Performance Computing for Data Pipelines
Overview
Rust is emerging as a powerful language for high-performance data engineering. With memory safety, zero-cost abstractions, and C-like performance, Rust enables data pipelines that are 10-100x faster than Python. This document covers when and how to use Rust for data engineering.
Why Rust for Data?
Performance Comparison:
| Operation | Python (Pandas) | Rust (Polars) | Speedup |
|---|---|---|---|
| Read 10GB CSV | 120s | 8s | 15x |
| GroupBy 1B rows | 45s | 3s | 15x |
| Join 10M rows | 30s | 2s | 15x |
| Memory usage | 5GB | 500MB | 10x |
Rust Data Ecosystem
Key Libraries
| Library | Purpose | Use Case |
|---|---|---|
| Polars | DataFrame library | Pandas alternative, data processing |
| Arrow | Memory format | Interoperability, zero-copy |
| DataFusion | SQL query engine | Query execution, modular pipelines |
| Core Data | Data structures | Type-safe data structures |
When to Use Rust
Ideal Use Cases
Specific Use Cases:
- Real-time systems: Fraud detection, real-time analytics
- Embedded analytics: In-application analytics
- High-throughput ETL: Moving data between systems
- API services: Data API backends
- Streaming services: Low-latency stream processing
When NOT to Use Rust
Use Python/Scala instead when:
- Team expertise: Team already knows Python/Spark
- Rapid prototyping: Python is faster for development
- Ecosystem needed: Spark ML, scikit-learn, etc.
- Simple workloads: Performance not critical
- Data science: Jupyter notebooks, exploration
Polars: Rust DataFrames
Getting Started
use polars::prelude::*;
fn main() -> Result<(), Box<dyn std::error::Error>> { // Read CSV let df = LazyFrame::scan("data.csv")? .collect()?;
// Filter let filtered = df.filter(&col("amount").gt(lit(100)))?;
// GroupBy let result = filtered .groupby(&["category"]) .agg([("amount", "sum")]) .sort(["category"], SortMultipleOptions::default())?;
// Show println!("{}", result);
// Write to Parquet let mut file = std::fs::File::create("output.parquet")?; result.to_parquet(&mut file, Compression::default())?;
Ok(())}Advanced Operations
use polars::prelude::*;
// Window functionslet df = df.lazy() .select([ col("*"), col("amount") .sum() .over([PartitionBy::new(["customer_id"]), OrderBy::new(["date"]).false(true)]) .alias("running_total") ]) .collect()?;
// Join strategieslet joined = df1 .join( df2, ["customer_id"], JoinArgs::new(JoinType::Left) .with_coalesce(true) // Sort-merge join )?;
// Pivot operationslet pivoted = df .pivot(["date"], ["category"], ["amount"]) .first() .collect()?;DataFusion: SQL Query Engine
Building a Query Engine
use datafusion::prelude::*;use datafusion::error::Result as DFResult;
fn query_data() -> DFResult<()> { // Register context let ctx = SessionContext::new();
// Read data let df = ctx.read_parquet("data.parquet")?;
// Create view ctx.register_table("sales", df)?;
// Execute SQL let df = ctx.sql( " SELECT category, SUM(amount) as total, AVG(amount) as average FROM sales WHERE date >= '2025-01-01' GROUP BY category ORDER BY total DESC ", )?;
// Collect results let results = df.collect()?;
// Process results for batch in results { for row in batch { let category: &str = row.get_column(0)?; let total: &f64 = row.get_column(1)?; println!("{}: {}", category, total); } }
Ok(())}Streaming with Rust
Kafka Consumer
use rdkafka::consumer::{Consumer, StreamConsumer};use rdkafka::message::Message;
fn consume_kafka() -> Result<(), Box<dyn std::error::Error>> { let consumer: StreamConsumer = Consumer::from_hosts(vec![("localhost:9092", "")]) .with_group_id("rust-consumer") .create()?;
consumer.subscribe(&["events"])?;
for message in consumer.iter() { match message { Ok(msg) => { let payload = msg.payload(); if let Some(data) = payload { // Process message process_message(data)?; } } Err(e) => { eprintln!("Kafka error: {}", e); } } }
Ok(())}
fn process_message(data: &[u8]) -> Result<(), Box<dyn std::error::Error>> { // Parse message let event: Event = serde_json::from_slice(data)?;
// Transform let transformed = transform_event(event)?;
// Write to output write_to_output(transformed)?;
Ok(())}Performance Optimization
Memory Efficiency
// Rust allows fine-grained memory control
// Streaming processing (no full load)use std::io::{BufRead, BufReader};
fn process_large_file(path: &str) -> Result<(), Box<dyn std::error::Error>> { let file = std::fs::File::open(path)?; let reader = BufReader::new(file);
// Process line by line (O(1) memory) for line in reader.lines() { let data: Vec<u8> = line?.into_bytes(); process_chunk(data)?; }
Ok(())}
// Zero-copy with Arrowuse arrow::array::StringArray;use arrow::record_batch::RecordBatch;
fn zero_copy_processing(batch: &RecordBatch) -> Result<(), Box<dyn std::error::Error>> { // Get string array (no copy) let strings = batch .column(0) .as_any() .downcast_ref::<StringArray>() .ok_or("Cannot cast")?;
for i in 0..strings.len() { let value = strings.value(i); // Process without copying }
Ok(())}Parallel Processing
use rayon::prelude::*;
fn parallel_process(data: Vec<Data>) -> Vec<Result<Processed, Error>> { // Process in parallel (all cores) data.par_iter() .map(|item| process_item(item)) .collect()}
fn process_item(item: Data) -> Result<Processed, Error> { // Complex transformation Ok(Processed { id: item.id, value: item.value * 2, })}Integration Patterns
Rust + Python Interop
Example: Python calls Rust for performance-critical code
// Rust library (mylib.rs)use pyo3::prelude::*;use pyo3::types::PyList;
#[pyfunction]fn process_data_fast(data: Vec<f64>) -> PyResult<Vec<f64>> { // High-performance processing data.iter() .map(|x| x * 2.0) .collect()}
#[pymodule]fn mylib(_py: Python, m: &PyModule) -> PyResult<()> { m.add_function(wrap_pyfunction!(process_data_fast))?; Ok(())}# Python codeimport mylib
# Call Rust for performancedata = [1.0, 2.0, 3.0, ...] # 10M valuesresult = mylib.process_data_fast(data) # 100x faster than PythonCost Optimization
Performance vs. Cost Trade-off
Cost Analysis:
| Scenario | Python Cost | Rust Cost | Savings |
|---|---|---|---|
| Process 1TB/day | $10K/month | $1K/month | 90% |
| Real-time API (100K QPS) | $50K/month | $5K/month | 90% |
| Batch processing (100TB) | $20K/month | $5K/month | 75% |
Senior Level Considerations
When to Invest in Rust
Team Considerations
| Factor | Rust | Python |
|---|---|---|
| Learning curve | Steep | Shallow |
| Development speed | Slower initially | Faster |
| Debugging | Compile-time errors help | Runtime errors |
| Ecosystem | Growing, smaller | Mature, extensive |
| Hiring | Harder to find | Easier to find |
| Compilation | Required | None |
Migration Strategy
Gradual Adoption
Best Practices
DO
// 1. Use Polars for DataFrame operationslet df = LazyFrame::scan("data.csv")? .filter(&col("amount").gt(lit(100)))?;
// 2. Use Arrow for memory efficiencylet batch = RecordBatch::try_from(vec![...])?;
// 3. Use rayon for parallel processingdata.par_iter().for_each(|item| process(item));
// 4. Profile with flamegraph// Use flamegraph to identify bottlenecks
// 5. Measure everythinglet start = Instant::now();// ... code ...let duration = start.elapsed();println!("Took: {:?}", duration);DON’T
// 1. Don't ignore compiler warnings// Fix all warnings for performance
// 2. Don't clone unnecessarily// Use references (&) instead
// 3. Don't allocate in hot loops// Pre-allocate outside loops
// 4. Don't use mutexes when read-write locks suffice// Use RwLock for read-heavy workloads
// 5. Don't optimize prematurely// Profile first, then optimize hot pathsKey Takeaways
- Performance: 10-100x faster than Python for data processing
- Memory safety: No GC pauses, predictable performance
- Polars: Pandas-like API, excellent performance
- DataFusion: Modular SQL query engine
- Integration: PyO3 enables Python + Rust hybrid
- Cost: Can reduce compute costs by 75-90%
- Investment: High learning curve, justify for performance-critical paths
- Team: Harder to hire Rust developers than Python
Further Learning
Resources:
Projects to Try:
- Convert a Pandas ETL script to Polars
- Build a real-time data API with Rust
- Create a custom data processing library
- Benchmark Python vs. Rust for your workload
Back to Module 2