Books, Papers, Blogs, and Courses for Data Engineers
Overview
This document provides comprehensive reading recommendations for data engineers at all levels. Books are organized by topic, papers by foundational concepts, and courses by learning path.
Essential Books
Must-Read (All Levels)
| Book | Author | Year | Why Read | Prerequisites |
|---|
| Designing Data-Intensive Applications | Martin Kleppmann | 2017 | Best book on distributed systems | Basic programming |
| The Data Warehouse Toolkit | Ralph Kimball | 2013 | Dimensional modeling foundation | Basic SQL |
| Streaming Systems | Tyler Akidau | 2018 | Streaming fundamentals | Distributed systems basics |
Advanced Books
| Book | Author | Topic | Level |
|---|
| Data Mesh: Delivering Data from Product Thinking | Zhamak Dehghani | Data architecture | Advanced |
| Database Internals | Alex Petrov | Storage engines | Advanced |
| Designing Great Data Warehouses | David Crepit | Dimensional modeling | Intermediate |
| Spark: The Definitive Guide | Chambers & Zaharia | Spark | Intermediate |
| Stream Processing with Apache Flink | Fabian Hueske | Flink | Intermediate |
| Fusion Tables for Data Analysis | Not applicable | Lakehouse | Intermediate |
Specialized Topics
| Topic | Book | Author | Best For |
|---|
| Kafka | ”Kafka: The Definitive Guide” | Neha Narkhede | Kafka deep dive |
| dbt | ”The dbt Book” | dbt Labs | Transformation |
| Airflow | ”Data Pipelines with Apache Airflow” | Basil Kaven | Orchestration |
| Cost | ”Cloud Cost Optimization” | O’Reilly | FinOps |
| Go | ”Data Engineering with Go” | Not applicable | High performance |
| Rust | ”Programming Rust” | Steve Klabnik | Performance critical |
Foundational Papers
Google Papers (Foundational)
| Paper | Year | Topic | Modern Equivalents |
|---|
| The Google File System | 2003 | Distributed storage | HDFS, S3 |
| MapReduce | 2004 | Batch processing | Spark, Hadoop |
| Bigtable | 2006 | NoSQL database | HBase, Cassandra |
| Chubby | 2006 | Distributed lock | ZooKeeper, etcd |
| Dremel | 2010 | Interactive query | Drill, Impala |
| Percolator | 2010 | Incremental processing | Flink, Spark Streaming |
Amazon Papers
| Paper | Year | Topic | Modern Equivalents |
|---|
| Dynamo | 2007 | Distributed KV | Cassandra, Riak |
| Aurora | 2017 | Cloud database | Cloud SQL, RDS |
Database Papers
| Paper | Year | Topic | Importance |
|---|
| A Protocol for Distributed Locks | 1990 | Distributed locks | Paxos foundation |
| The Log | 2013 | Log-based architecture | Kafka, Kinesis |
| How to Beat CAP | 2010 | CAP theorem | Understanding trade-offs |
Lakehouse Papers
| Paper | Year | Topic | Link |
|---|
| Delta Lake | 2019 | ACID on data lake | Link |
| Apache Iceberg | 2018 | Table format | Link |
| Dremel | 2010 | Nested data | Link |
| Lakehouse | 2021 | Lakehouse architecture | Link |
Blogs and Newsletters
Must-Read Weekly
| Resource | Focus | Frequency | Why Subscribe |
|---|
| The Morning Paper | Paper summaries | Daily | Stay current with research |
| Data Engineering Weekly | Industry news | Weekly | Curated DE news |
| ByteByteGo | System design | Weekly | Interview prep |
| Scale Fall | Real-world scale | Weekly | Production lessons |
Vendor Blogs (High Quality)
| Vendor | Blog | Topics | Frequency |
|---|
| Databricks | blog | Lakehouse, Spark, AI | Weekly |
| Confluent | blog | Kafka, streaming | Weekly |
| ClickHouse | blog | ClickHouse features | Monthly |
| Snowflake | blog | Snowflake features | Weekly |
| Uber Engineering | blog | Real-world scale | Variable |
| Netflix Tech Blog | blog | Streaming, architecture | Monthly |
| AWS Architecture | blog | AWS patterns | Weekly |
Individual Bloggers
Online Courses
University Courses (Free)
| Course | University | Platform | Topics | Duration |
|---|
| CMU Database Systems | CMU | YouTube | DB internals, indexing | 15 weeks |
| MIT 6.824 | MIT | website | Distributed systems | 12 weeks |
| Stanford ML | Stanford | Coursera | Machine learning | 11 weeks |
| Berkely CS186 | Berkeley | YouTube | Databases | 15 weeks |
| Course | Platform | Cost | Topics | Worth It? |
|---|
| Databricks Academy | Databricks | Free | Spark, Lakehouse | Yes |
| Confluent Training | Confluent | Free/Paid | Kafka | Yes (free) |
| DataCamp DE Track | DataCamp | $25/mo | Basic skills | For beginners |
| Coursera GCP Data Engineer | Coursera | $49/mo | GCP | If GCP focus |
| Udacity Data Engineer Nanodegree | Udacity | $1000/mo | General | Overpriced |
| Udemy Data Engineering | Udemy | $15 | Various | Hit or miss |
Specialized Courses
| Topic | Course | Platform | Level |
|---|
| Streaming | ”Kafka Streams” | Confluent | Intermediate |
| Flink | ”Apache Flink” | various | Intermediate |
| dbt | ”dbt Fundamentals” | dbt Labs | Beginner |
| Airflow | ”Apache Airflow” | various | Intermediate |
| Spark | ”Spark 3.0” | Databricks | Intermediate |
YouTube Channels
| Channel | Topics | Why Watch |
|---|
| CMU Database Group | Database internals | Deep technical content |
| MIT OpenCourseWare | Distributed systems | University lectures |
| Databricks | Spark, Lakehouse | Product deep dives |
| Confluent | Kafka, Streaming | Event-driven architecture |
| TechLead | System design interviews | Interview prep |
Podcasts
| Podcast | Hosts | Topics | Length |
|---|
| Data Engineering Podcast | Andreas Kretz | General DE | 30-60 min |
| The ai Show | Databricks | AI + Data | 30-60 min |
| Streaming Audio | Confluent | Kafka, streaming | 30-45 min |
| TheIndex | ClickHouse | Databases | 30-60 min |
Practice Resources
| Platform | Focus | Use For |
|---|
| LeetCode | SQL + algorithms | Interview prep |
| HackerRank | SQL, Python | Skill practice |
| Exercism | General programming | Language learning |
| Codewars | Algorithms | Coding practice |
Design Practice
| Resource | Focus | Use For |
|---|
| ByteByteGo | System design | Mock interviews |
| System Design Primer | GitHub repo | Design patterns |
| The Primeagen | System design | Video content |
Reading Strategy
By Career Level
Time Allocation
| Activity | Frequency | Time Investment |
|---|
| Book reading | Daily | 30 minutes |
| Paper reading | Weekly | 2-3 papers |
| Blog/newsletter | Daily | 15 minutes |
| Conference talks | Weekly | 1-2 talks |
| Courses | Quarterly | 1 course |
Key Takeaways
- DDIA is essential: Read “Designing Data-Intensive Applications” first
- Papers over books: Papers are primary sources, books are secondary
- Follow vendor blogs: Databricks, Confluent, dbt have great content
- Balance theory/practice: Papers + blog posts + hands-on
- Stay current: Papers, blogs, conferences
- Re-read classics: DDIA, Kimball worth revisiting annually
Back to Module 10