Skip to content

Further Reading

Books, Papers, Blogs, and Courses for Data Engineers

Overview

This document provides comprehensive reading recommendations for data engineers at all levels. Books are organized by topic, papers by foundational concepts, and courses by learning path.

Essential Books

Must-Read (All Levels)

Book	Author	Year	Why Read	Prerequisites
Designing Data-Intensive Applications	Martin Kleppmann	2017	Best book on distributed systems	Basic programming
The Data Warehouse Toolkit	Ralph Kimball	2013	Dimensional modeling foundation	Basic SQL
Streaming Systems	Tyler Akidau	2018	Streaming fundamentals	Distributed systems basics

Advanced Books

Book	Author	Topic	Level
Data Mesh: Delivering Data from Product Thinking	Zhamak Dehghani	Data architecture	Advanced
Database Internals	Alex Petrov	Storage engines	Advanced
Designing Great Data Warehouses	David Crepit	Dimensional modeling	Intermediate
Spark: The Definitive Guide	Chambers & Zaharia	Spark	Intermediate
Stream Processing with Apache Flink	Fabian Hueske	Flink	Intermediate
Fusion Tables for Data Analysis	Not applicable	Lakehouse	Intermediate

Specialized Topics

Topic	Book	Author	Best For
Kafka	”Kafka: The Definitive Guide”	Neha Narkhede	Kafka deep dive
dbt	”The dbt Book”	dbt Labs	Transformation
Airflow	”Data Pipelines with Apache Airflow”	Basil Kaven	Orchestration
Cost	”Cloud Cost Optimization”	O’Reilly	FinOps
Go	”Data Engineering with Go”	Not applicable	High performance
Rust	”Programming Rust”	Steve Klabnik	Performance critical

Foundational Papers

Google Papers (Foundational)

Paper	Year	Topic	Modern Equivalents
The Google File System	2003	Distributed storage	HDFS, S3
MapReduce	2004	Batch processing	Spark, Hadoop
Bigtable	2006	NoSQL database	HBase, Cassandra
Chubby	2006	Distributed lock	ZooKeeper, etcd
Dremel	2010	Interactive query	Drill, Impala
Percolator	2010	Incremental processing	Flink, Spark Streaming

Amazon Papers

Paper	Year	Topic	Modern Equivalents
Dynamo	2007	Distributed KV	Cassandra, Riak
Aurora	2017	Cloud database	Cloud SQL, RDS

Database Papers

Paper	Year	Topic	Importance
A Protocol for Distributed Locks	1990	Distributed locks	Paxos foundation
The Log	2013	Log-based architecture	Kafka, Kinesis
How to Beat CAP	2010	CAP theorem	Understanding trade-offs

Lakehouse Papers

Paper	Year	Topic	Link
Delta Lake	2019	ACID on data lake	Link
Apache Iceberg	2018	Table format	Link
Dremel	2010	Nested data	Link
Lakehouse	2021	Lakehouse architecture	Link

Blogs and Newsletters

Must-Read Weekly

Resource	Focus	Frequency	Why Subscribe
The Morning Paper	Paper summaries	Daily	Stay current with research
Data Engineering Weekly	Industry news	Weekly	Curated DE news
ByteByteGo	System design	Weekly	Interview prep
Scale Fall	Real-world scale	Weekly	Production lessons

Vendor Blogs (High Quality)

Vendor	Blog	Topics	Frequency
Databricks	blog	Lakehouse, Spark, AI	Weekly
Confluent	blog	Kafka, streaming	Weekly
ClickHouse	blog	ClickHouse features	Monthly
Snowflake	blog	Snowflake features	Weekly
Uber Engineering	blog	Real-world scale	Variable
Netflix Tech Blog	blog	Streaming, architecture	Monthly
AWS Architecture	blog	AWS patterns	Weekly

Individual Bloggers

Person	Blog	Topics
Martin Kleppmann	martin.kleppmann.com	Distributed systems
Julia Evans	jvns.ca	Systems, debugging
Maxime Beauchemin	medium	Data engineering
Jay Kreps	blog	Kafka, streaming

Online Courses

University Courses (Free)

Course	University	Platform	Topics	Duration
CMU Database Systems	CMU	YouTube	DB internals, indexing	15 weeks
MIT 6.824	MIT	website	Distributed systems	12 weeks
Stanford ML	Stanford	Coursera	Machine learning	11 weeks
Berkely CS186	Berkeley	YouTube	Databases	15 weeks

Platform Courses (Paid)

Course	Platform	Cost	Topics	Worth It?
Databricks Academy	Databricks	Free	Spark, Lakehouse	Yes
Confluent Training	Confluent	Free/Paid	Kafka	Yes (free)
DataCamp DE Track	DataCamp	$25/mo	Basic skills	For beginners
Coursera GCP Data Engineer	Coursera	$49/mo	GCP	If GCP focus
Udacity Data Engineer Nanodegree	Udacity	$1000/mo	General	Overpriced
Udemy Data Engineering	Udemy	$15	Various	Hit or miss

Specialized Courses

Topic	Course	Platform	Level
Streaming	”Kafka Streams”	Confluent	Intermediate
Flink	”Apache Flink”	various	Intermediate
dbt	”dbt Fundamentals”	dbt Labs	Beginner
Airflow	”Apache Airflow”	various	Intermediate
Spark	”Spark 3.0”	Databricks	Intermediate

YouTube Channels

Channel	Topics	Why Watch
CMU Database Group	Database internals	Deep technical content
MIT OpenCourseWare	Distributed systems	University lectures
Databricks	Spark, Lakehouse	Product deep dives
Confluent	Kafka, Streaming	Event-driven architecture
TechLead	System design interviews	Interview prep

Podcasts

Podcast	Hosts	Topics	Length
Data Engineering Podcast	Andreas Kretz	General DE	30-60 min
The ai Show	Databricks	AI + Data	30-60 min
Streaming Audio	Confluent	Kafka, streaming	30-45 min
TheIndex	ClickHouse	Databases	30-60 min

Practice Resources

Coding Platforms

Platform	Focus	Use For
LeetCode	SQL + algorithms	Interview prep
HackerRank	SQL, Python	Skill practice
Exercism	General programming	Language learning
Codewars	Algorithms	Coding practice

Design Practice

Resource	Focus	Use For
ByteByteGo	System design	Mock interviews
System Design Primer	GitHub repo	Design patterns
The Primeagen	System design	Video content

Reading Strategy

By Career Level

Time Allocation

Activity	Frequency	Time Investment
Book reading	Daily	30 minutes
Paper reading	Weekly	2-3 papers
Blog/newsletter	Daily	15 minutes
Conference talks	Weekly	1-2 talks
Courses	Quarterly	1 course

Key Takeaways

DDIA is essential: Read “Designing Data-Intensive Applications” first
Papers over books: Papers are primary sources, books are secondary
Follow vendor blogs: Databricks, Confluent, dbt have great content
Balance theory/practice: Papers + blog posts + hands-on
Stay current: Papers, blogs, conferences
Re-read classics: DDIA, Kimball worth revisiting annually

Back to Module 10