Skip to content

Further Reading

Books, Papers, Blogs, and Courses for Data Engineers


Overview

This document provides comprehensive reading recommendations for data engineers at all levels. Books are organized by topic, papers by foundational concepts, and courses by learning path.


Essential Books

Must-Read (All Levels)

BookAuthorYearWhy ReadPrerequisites
Designing Data-Intensive ApplicationsMartin Kleppmann2017Best book on distributed systemsBasic programming
The Data Warehouse ToolkitRalph Kimball2013Dimensional modeling foundationBasic SQL
Streaming SystemsTyler Akidau2018Streaming fundamentalsDistributed systems basics

Advanced Books

BookAuthorTopicLevel
Data Mesh: Delivering Data from Product ThinkingZhamak DehghaniData architectureAdvanced
Database InternalsAlex PetrovStorage enginesAdvanced
Designing Great Data WarehousesDavid CrepitDimensional modelingIntermediate
Spark: The Definitive GuideChambers & ZahariaSparkIntermediate
Stream Processing with Apache FlinkFabian HueskeFlinkIntermediate
Fusion Tables for Data AnalysisNot applicableLakehouseIntermediate

Specialized Topics

TopicBookAuthorBest For
Kafka”Kafka: The Definitive Guide”Neha NarkhedeKafka deep dive
dbt”The dbt Book”dbt LabsTransformation
Airflow”Data Pipelines with Apache Airflow”Basil KavenOrchestration
Cost”Cloud Cost Optimization”O’ReillyFinOps
Go”Data Engineering with Go”Not applicableHigh performance
Rust”Programming Rust”Steve KlabnikPerformance critical

Foundational Papers

Google Papers (Foundational)

PaperYearTopicModern Equivalents
The Google File System2003Distributed storageHDFS, S3
MapReduce2004Batch processingSpark, Hadoop
Bigtable2006NoSQL databaseHBase, Cassandra
Chubby2006Distributed lockZooKeeper, etcd
Dremel2010Interactive queryDrill, Impala
Percolator2010Incremental processingFlink, Spark Streaming

Amazon Papers

PaperYearTopicModern Equivalents
Dynamo2007Distributed KVCassandra, Riak
Aurora2017Cloud databaseCloud SQL, RDS

Database Papers

PaperYearTopicImportance
A Protocol for Distributed Locks1990Distributed locksPaxos foundation
The Log2013Log-based architectureKafka, Kinesis
How to Beat CAP2010CAP theoremUnderstanding trade-offs

Lakehouse Papers

PaperYearTopicLink
Delta Lake2019ACID on data lakeLink
Apache Iceberg2018Table formatLink
Dremel2010Nested dataLink
Lakehouse2021Lakehouse architectureLink

Blogs and Newsletters

Must-Read Weekly

ResourceFocusFrequencyWhy Subscribe
The Morning PaperPaper summariesDailyStay current with research
Data Engineering WeeklyIndustry newsWeeklyCurated DE news
ByteByteGoSystem designWeeklyInterview prep
Scale FallReal-world scaleWeeklyProduction lessons

Vendor Blogs (High Quality)

VendorBlogTopicsFrequency
DatabricksblogLakehouse, Spark, AIWeekly
ConfluentblogKafka, streamingWeekly
ClickHouseblogClickHouse featuresMonthly
SnowflakeblogSnowflake featuresWeekly
Uber EngineeringblogReal-world scaleVariable
Netflix Tech BlogblogStreaming, architectureMonthly
AWS ArchitectureblogAWS patternsWeekly

Individual Bloggers

PersonBlogTopics
Martin Kleppmannmartin.kleppmann.comDistributed systems
Julia Evansjvns.caSystems, debugging
Maxime BeaucheminmediumData engineering
Jay KrepsblogKafka, streaming

Online Courses

University Courses (Free)

CourseUniversityPlatformTopicsDuration
CMU Database SystemsCMUYouTubeDB internals, indexing15 weeks
MIT 6.824MITwebsiteDistributed systems12 weeks
Stanford MLStanfordCourseraMachine learning11 weeks
Berkely CS186BerkeleyYouTubeDatabases15 weeks

Platform Courses (Paid)

CoursePlatformCostTopicsWorth It?
Databricks AcademyDatabricksFreeSpark, LakehouseYes
Confluent TrainingConfluentFree/PaidKafkaYes (free)
DataCamp DE TrackDataCamp$25/moBasic skillsFor beginners
Coursera GCP Data EngineerCoursera$49/moGCPIf GCP focus
Udacity Data Engineer NanodegreeUdacity$1000/moGeneralOverpriced
Udemy Data EngineeringUdemy$15VariousHit or miss

Specialized Courses

TopicCoursePlatformLevel
Streaming”Kafka Streams”ConfluentIntermediate
Flink”Apache Flink”variousIntermediate
dbt”dbt Fundamentals”dbt LabsBeginner
Airflow”Apache Airflow”variousIntermediate
Spark”Spark 3.0”DatabricksIntermediate

YouTube Channels

ChannelTopicsWhy Watch
CMU Database GroupDatabase internalsDeep technical content
MIT OpenCourseWareDistributed systemsUniversity lectures
DatabricksSpark, LakehouseProduct deep dives
ConfluentKafka, StreamingEvent-driven architecture
TechLeadSystem design interviewsInterview prep

Podcasts

PodcastHostsTopicsLength
Data Engineering PodcastAndreas KretzGeneral DE30-60 min
The ai ShowDatabricksAI + Data30-60 min
Streaming AudioConfluentKafka, streaming30-45 min
TheIndexClickHouseDatabases30-60 min

Practice Resources

Coding Platforms

PlatformFocusUse For
LeetCodeSQL + algorithmsInterview prep
HackerRankSQL, PythonSkill practice
ExercismGeneral programmingLanguage learning
CodewarsAlgorithmsCoding practice

Design Practice

ResourceFocusUse For
ByteByteGoSystem designMock interviews
System Design PrimerGitHub repoDesign patterns
The PrimeagenSystem designVideo content

Reading Strategy

By Career Level

Time Allocation

ActivityFrequencyTime Investment
Book readingDaily30 minutes
Paper readingWeekly2-3 papers
Blog/newsletterDaily15 minutes
Conference talksWeekly1-2 talks
CoursesQuarterly1 course

Key Takeaways

  1. DDIA is essential: Read “Designing Data-Intensive Applications” first
  2. Papers over books: Papers are primary sources, books are secondary
  3. Follow vendor blogs: Databricks, Confluent, dbt have great content
  4. Balance theory/practice: Papers + blog posts + hands-on
  5. Stay current: Papers, blogs, conferences
  6. Re-read classics: DDIA, Kimball worth revisiting annually

Back to Module 10