Rust for Data Engineering

High-Performance Computing for Data Pipelines

Overview

Rust is emerging as a powerful language for high-performance data engineering. With memory safety, zero-cost abstractions, and C-like performance, Rust enables data pipelines that are 10-100x faster than Python. This document covers when and how to use Rust for data engineering.

Why Rust for Data?

Performance Comparison:

Operation	Python (Pandas)	Rust (Polars)	Speedup
Read 10GB CSV	120s	8s	15x
GroupBy 1B rows	45s	3s	15x
Join 10M rows	30s	2s	15x
Memory usage	5GB	500MB	10x

Rust Data Ecosystem

Key Libraries

Library	Purpose	Use Case
Polars	DataFrame library	Pandas alternative, data processing
Arrow	Memory format	Interoperability, zero-copy
DataFusion	SQL query engine	Query execution, modular pipelines
Core Data	Data structures	Type-safe data structures

When to Use Rust

Ideal Use Cases

Specific Use Cases:

Real-time systems: Fraud detection, real-time analytics
Embedded analytics: In-application analytics
High-throughput ETL: Moving data between systems
API services: Data API backends
Streaming services: Low-latency stream processing

When NOT to Use Rust

Use Python/Scala instead when:

Team expertise: Team already knows Python/Spark
Rapid prototyping: Python is faster for development
Ecosystem needed: Spark ML, scikit-learn, etc.
Simple workloads: Performance not critical
Data science: Jupyter notebooks, exploration

Polars: Rust DataFrames

Getting Started

use polars::prelude::*;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Read CSV
    let df = LazyFrame::scan("data.csv")?
        .collect()?;

    // Filter
    let filtered = df.filter(&col("amount").gt(lit(100)))?;

    // GroupBy
    let result = filtered
        .groupby(&["category"])
        .agg([("amount", "sum")])
        .sort(["category"], SortMultipleOptions::default())?;

    // Show
    println!("{}", result);

    // Write to Parquet
    let mut file = std::fs::File::create("output.parquet")?;
    result.to_parquet(&mut file, Compression::default())?;

    Ok(())
}

Advanced Operations

use polars::prelude::*;

// Window functions
let df = df.lazy()
    .select([
        col("*"),
        col("amount")
            .sum()
            .over([PartitionBy::new(["customer_id"]),
                  OrderBy::new(["date"]).false(true)])
            .alias("running_total")
    ])
    .collect()?;

// Join strategies
let joined = df1
    .join(
        df2,
        ["customer_id"],
        JoinArgs::new(JoinType::Left)
            .with_coalesce(true)  // Sort-merge join
    )?;

// Pivot operations
let pivoted = df
    .pivot(["date"], ["category"], ["amount"])
    .first()
    .collect()?;

DataFusion: SQL Query Engine

Building a Query Engine

use datafusion::prelude::*;
use datafusion::error::Result as DFResult;

fn query_data() -> DFResult<()> {
    // Register context
    let ctx = SessionContext::new();

    // Read data
    let df = ctx.read_parquet("data.parquet")?;

    // Create view
    ctx.register_table("sales", df)?;

    // Execute SQL
    let df = ctx.sql(
        "
        SELECT
            category,
            SUM(amount) as total,
            AVG(amount) as average
        FROM sales
        WHERE date >= '2025-01-01'
        GROUP BY category
        ORDER BY total DESC
    ",
    )?;

    // Collect results
    let results = df.collect()?;

    // Process results
    for batch in results {
        for row in batch {
            let category: &str = row.get_column(0)?;
            let total: &f64 = row.get_column(1)?;
            println!("{}: {}", category, total);
        }
    }

    Ok(())
}

Streaming with Rust

Kafka Consumer

use rdkafka::consumer::{Consumer, StreamConsumer};
use rdkafka::message::Message;

fn consume_kafka() -> Result<(), Box<dyn std::error::Error>> {
    let consumer: StreamConsumer = Consumer::from_hosts(vec![("localhost:9092", "")])
        .with_group_id("rust-consumer")
        .create()?;

    consumer.subscribe(&["events"])?;

    for message in consumer.iter() {
        match message {
            Ok(msg) => {
                let payload = msg.payload();
                if let Some(data) = payload {
                    // Process message
                    process_message(data)?;
                }
            }
            Err(e) => {
                eprintln!("Kafka error: {}", e);
            }
        }
    }

    Ok(())
}

fn process_message(data: &[u8]) -> Result<(), Box<dyn std::error::Error>> {
    // Parse message
    let event: Event = serde_json::from_slice(data)?;

    // Transform
    let transformed = transform_event(event)?;

    // Write to output
    write_to_output(transformed)?;

    Ok(())
}

Performance Optimization

Memory Efficiency

// Rust allows fine-grained memory control

// Streaming processing (no full load)
use std::io::{BufRead, BufReader};

fn process_large_file(path: &str) -> Result<(), Box<dyn std::error::Error>> {
    let file = std::fs::File::open(path)?;
    let reader = BufReader::new(file);

    // Process line by line (O(1) memory)
    for line in reader.lines() {
        let data: Vec<u8> = line?.into_bytes();
        process_chunk(data)?;
    }

    Ok(())
}

// Zero-copy with Arrow
use arrow::array::StringArray;
use arrow::record_batch::RecordBatch;

fn zero_copy_processing(batch: &RecordBatch) -> Result<(), Box<dyn std::error::Error>> {
    // Get string array (no copy)
    let strings = batch
        .column(0)
        .as_any()
        .downcast_ref::<StringArray>()
        .ok_or("Cannot cast")?;

    for i in 0..strings.len() {
        let value = strings.value(i);
        // Process without copying
    }

    Ok(())
}

Parallel Processing

use rayon::prelude::*;

fn parallel_process(data: Vec<Data>) -> Vec<Result<Processed, Error>> {
    // Process in parallel (all cores)
    data.par_iter()
        .map(|item| process_item(item))
        .collect()
}

fn process_item(item: Data) -> Result<Processed, Error> {
    // Complex transformation
    Ok(Processed {
        id: item.id,
        value: item.value * 2,
    })
}

Integration Patterns

Rust + Python Interop

Example: Python calls Rust for performance-critical code

// Rust library (mylib.rs)
use pyo3::prelude::*;
use pyo3::types::PyList;

#[pyfunction]
fn process_data_fast(data: Vec<f64>) -> PyResult<Vec<f64>> {
    // High-performance processing
    data.iter()
        .map(|x| x * 2.0)
        .collect()
}

#[pymodule]
fn mylib(_py: Python, m: &PyModule) -> PyResult<()> {
    m.add_function(wrap_pyfunction!(process_data_fast))?;
    Ok(())
}

# Python code
import mylib

# Call Rust for performance
data = [1.0, 2.0, 3.0, ...]  # 10M values
result = mylib.process_data_fast(data)  # 100x faster than Python

Cost Optimization

Performance vs. Cost Trade-off

Cost Analysis:

Scenario	Python Cost	Rust Cost	Savings
Process 1TB/day	$10K/month	$1K/month	90%
Real-time API (100K QPS)	$50K/month	$5K/month	90%
Batch processing (100TB)	$20K/month	$5K/month	75%

Senior Level Considerations

When to Invest in Rust

Team Considerations

Factor	Rust	Python
Learning curve	Steep	Shallow
Development speed	Slower initially	Faster
Debugging	Compile-time errors help	Runtime errors
Ecosystem	Growing, smaller	Mature, extensive
Hiring	Harder to find	Easier to find
Compilation	Required	None

Migration Strategy

Gradual Adoption

Best Practices

DO

// 1. Use Polars for DataFrame operations
let df = LazyFrame::scan("data.csv")?
    .filter(&col("amount").gt(lit(100)))?;

// 2. Use Arrow for memory efficiency
let batch = RecordBatch::try_from(vec![...])?;

// 3. Use rayon for parallel processing
data.par_iter().for_each(|item| process(item));

// 4. Profile with flamegraph
// Use flamegraph to identify bottlenecks

// 5. Measure everything
let start = Instant::now();
// ... code ...
let duration = start.elapsed();
println!("Took: {:?}", duration);

DON’T

// 1. Don't ignore compiler warnings
// Fix all warnings for performance

// 2. Don't clone unnecessarily
// Use references (&) instead

// 3. Don't allocate in hot loops
// Pre-allocate outside loops

// 4. Don't use mutexes when read-write locks suffice
// Use RwLock for read-heavy workloads

// 5. Don't optimize prematurely
// Profile first, then optimize hot paths

Key Takeaways

Performance: 10-100x faster than Python for data processing
Memory safety: No GC pauses, predictable performance
Polars: Pandas-like API, excellent performance
DataFusion: Modular SQL query engine
Integration: PyO3 enables Python + Rust hybrid
Cost: Can reduce compute costs by 75-90%
Investment: High learning curve, justify for performance-critical paths
Team: Harder to hire Rust developers than Python

Further Learning

Resources:

Projects to Try:

Convert a Pandas ETL script to Polars
Build a real-time data API with Rust
Create a custom data processing library
Benchmark Python vs. Rust for your workload

Back to Module 2