Skip to content

Rust for Data Engineering

High-Performance Computing for Data Pipelines


Overview

Rust is emerging as a powerful language for high-performance data engineering. With memory safety, zero-cost abstractions, and C-like performance, Rust enables data pipelines that are 10-100x faster than Python. This document covers when and how to use Rust for data engineering.


Why Rust for Data?

Performance Comparison:

OperationPython (Pandas)Rust (Polars)Speedup
Read 10GB CSV120s8s15x
GroupBy 1B rows45s3s15x
Join 10M rows30s2s15x
Memory usage5GB500MB10x

Rust Data Ecosystem

Key Libraries

LibraryPurposeUse Case
PolarsDataFrame libraryPandas alternative, data processing
ArrowMemory formatInteroperability, zero-copy
DataFusionSQL query engineQuery execution, modular pipelines
Core DataData structuresType-safe data structures

When to Use Rust

Ideal Use Cases

Specific Use Cases:

  • Real-time systems: Fraud detection, real-time analytics
  • Embedded analytics: In-application analytics
  • High-throughput ETL: Moving data between systems
  • API services: Data API backends
  • Streaming services: Low-latency stream processing

When NOT to Use Rust

Use Python/Scala instead when:

  • Team expertise: Team already knows Python/Spark
  • Rapid prototyping: Python is faster for development
  • Ecosystem needed: Spark ML, scikit-learn, etc.
  • Simple workloads: Performance not critical
  • Data science: Jupyter notebooks, exploration

Polars: Rust DataFrames

Getting Started

use polars::prelude::*;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Read CSV
let df = LazyFrame::scan("data.csv")?
.collect()?;
// Filter
let filtered = df.filter(&col("amount").gt(lit(100)))?;
// GroupBy
let result = filtered
.groupby(&["category"])
.agg([("amount", "sum")])
.sort(["category"], SortMultipleOptions::default())?;
// Show
println!("{}", result);
// Write to Parquet
let mut file = std::fs::File::create("output.parquet")?;
result.to_parquet(&mut file, Compression::default())?;
Ok(())
}

Advanced Operations

use polars::prelude::*;
// Window functions
let df = df.lazy()
.select([
col("*"),
col("amount")
.sum()
.over([PartitionBy::new(["customer_id"]),
OrderBy::new(["date"]).false(true)])
.alias("running_total")
])
.collect()?;
// Join strategies
let joined = df1
.join(
df2,
["customer_id"],
JoinArgs::new(JoinType::Left)
.with_coalesce(true) // Sort-merge join
)?;
// Pivot operations
let pivoted = df
.pivot(["date"], ["category"], ["amount"])
.first()
.collect()?;

DataFusion: SQL Query Engine

Building a Query Engine

use datafusion::prelude::*;
use datafusion::error::Result as DFResult;
fn query_data() -> DFResult<()> {
// Register context
let ctx = SessionContext::new();
// Read data
let df = ctx.read_parquet("data.parquet")?;
// Create view
ctx.register_table("sales", df)?;
// Execute SQL
let df = ctx.sql(
"
SELECT
category,
SUM(amount) as total,
AVG(amount) as average
FROM sales
WHERE date >= '2025-01-01'
GROUP BY category
ORDER BY total DESC
",
)?;
// Collect results
let results = df.collect()?;
// Process results
for batch in results {
for row in batch {
let category: &str = row.get_column(0)?;
let total: &f64 = row.get_column(1)?;
println!("{}: {}", category, total);
}
}
Ok(())
}

Streaming with Rust

Kafka Consumer

use rdkafka::consumer::{Consumer, StreamConsumer};
use rdkafka::message::Message;
fn consume_kafka() -> Result<(), Box<dyn std::error::Error>> {
let consumer: StreamConsumer = Consumer::from_hosts(vec![("localhost:9092", "")])
.with_group_id("rust-consumer")
.create()?;
consumer.subscribe(&["events"])?;
for message in consumer.iter() {
match message {
Ok(msg) => {
let payload = msg.payload();
if let Some(data) = payload {
// Process message
process_message(data)?;
}
}
Err(e) => {
eprintln!("Kafka error: {}", e);
}
}
}
Ok(())
}
fn process_message(data: &[u8]) -> Result<(), Box<dyn std::error::Error>> {
// Parse message
let event: Event = serde_json::from_slice(data)?;
// Transform
let transformed = transform_event(event)?;
// Write to output
write_to_output(transformed)?;
Ok(())
}

Performance Optimization

Memory Efficiency

// Rust allows fine-grained memory control
// Streaming processing (no full load)
use std::io::{BufRead, BufReader};
fn process_large_file(path: &str) -> Result<(), Box<dyn std::error::Error>> {
let file = std::fs::File::open(path)?;
let reader = BufReader::new(file);
// Process line by line (O(1) memory)
for line in reader.lines() {
let data: Vec<u8> = line?.into_bytes();
process_chunk(data)?;
}
Ok(())
}
// Zero-copy with Arrow
use arrow::array::StringArray;
use arrow::record_batch::RecordBatch;
fn zero_copy_processing(batch: &RecordBatch) -> Result<(), Box<dyn std::error::Error>> {
// Get string array (no copy)
let strings = batch
.column(0)
.as_any()
.downcast_ref::<StringArray>()
.ok_or("Cannot cast")?;
for i in 0..strings.len() {
let value = strings.value(i);
// Process without copying
}
Ok(())
}

Parallel Processing

use rayon::prelude::*;
fn parallel_process(data: Vec<Data>) -> Vec<Result<Processed, Error>> {
// Process in parallel (all cores)
data.par_iter()
.map(|item| process_item(item))
.collect()
}
fn process_item(item: Data) -> Result<Processed, Error> {
// Complex transformation
Ok(Processed {
id: item.id,
value: item.value * 2,
})
}

Integration Patterns

Rust + Python Interop

Example: Python calls Rust for performance-critical code

// Rust library (mylib.rs)
use pyo3::prelude::*;
use pyo3::types::PyList;
#[pyfunction]
fn process_data_fast(data: Vec<f64>) -> PyResult<Vec<f64>> {
// High-performance processing
data.iter()
.map(|x| x * 2.0)
.collect()
}
#[pymodule]
fn mylib(_py: Python, m: &PyModule) -> PyResult<()> {
m.add_function(wrap_pyfunction!(process_data_fast))?;
Ok(())
}
# Python code
import mylib
# Call Rust for performance
data = [1.0, 2.0, 3.0, ...] # 10M values
result = mylib.process_data_fast(data) # 100x faster than Python

Cost Optimization

Performance vs. Cost Trade-off

Cost Analysis:

ScenarioPython CostRust CostSavings
Process 1TB/day$10K/month$1K/month90%
Real-time API (100K QPS)$50K/month$5K/month90%
Batch processing (100TB)$20K/month$5K/month75%

Senior Level Considerations

When to Invest in Rust

Team Considerations

FactorRustPython
Learning curveSteepShallow
Development speedSlower initiallyFaster
DebuggingCompile-time errors helpRuntime errors
EcosystemGrowing, smallerMature, extensive
HiringHarder to findEasier to find
CompilationRequiredNone

Migration Strategy

Gradual Adoption


Best Practices

DO

// 1. Use Polars for DataFrame operations
let df = LazyFrame::scan("data.csv")?
.filter(&col("amount").gt(lit(100)))?;
// 2. Use Arrow for memory efficiency
let batch = RecordBatch::try_from(vec![...])?;
// 3. Use rayon for parallel processing
data.par_iter().for_each(|item| process(item));
// 4. Profile with flamegraph
// Use flamegraph to identify bottlenecks
// 5. Measure everything
let start = Instant::now();
// ... code ...
let duration = start.elapsed();
println!("Took: {:?}", duration);

DON’T

// 1. Don't ignore compiler warnings
// Fix all warnings for performance
// 2. Don't clone unnecessarily
// Use references (&) instead
// 3. Don't allocate in hot loops
// Pre-allocate outside loops
// 4. Don't use mutexes when read-write locks suffice
// Use RwLock for read-heavy workloads
// 5. Don't optimize prematurely
// Profile first, then optimize hot paths

Key Takeaways

  1. Performance: 10-100x faster than Python for data processing
  2. Memory safety: No GC pauses, predictable performance
  3. Polars: Pandas-like API, excellent performance
  4. DataFusion: Modular SQL query engine
  5. Integration: PyO3 enables Python + Rust hybrid
  6. Cost: Can reduce compute costs by 75-90%
  7. Investment: High learning curve, justify for performance-critical paths
  8. Team: Harder to hire Rust developers than Python

Further Learning

Resources:

Projects to Try:

  1. Convert a Pandas ETL script to Polars
  2. Build a real-time data API with Rust
  3. Create a custom data processing library
  4. Benchmark Python vs. Rust for your workload

Back to Module 2