Skip to content

Module 10: References


Overview

This module provides curated resources for continued learning, including books, courses, certifications, conferences, and community resources. Data engineering is rapidly evolving; staying current requires continuous learning.


Further Reading

Essential Books

BookAuthorFocusLevel
Designing Data-Intensive ApplicationsMartin KleppmannDistributed systemsAdvanced
The Data Warehouse ToolkitRalph KimballDimensional modelingIntermediate
Stream Processing with Apache FlinkFabian HueskeStreamingIntermediate
Data MeshZhamak DehghaniData architectureAdvanced
Designing Great Data WarehousesDavid CrepitDimensional modelingIntermediate
Spark: The Definitive GuideBill Chambers & Matei ZahariaSparkIntermediate
PaperTopicLink
The Google File SystemDistributed storageLink
MapReduceDistributed processingLink
BigtableNoSQL databaseLink
DynamoDistributed KV storeLink
LakehouseLakehouse architectureLink
Delta LakeACID transactionsLink
Apache IcebergTable formatLink

Blogs & Newsletters

ResourceFocusFrequency
The Morning PaperPaper summariesDaily
Data Engineering WeeklyIndustry newsWeekly
ClickHouse BlogClickHouseVariable
Databricks BlogLakehouse, SparkWeekly
Confluent BlogKafka, StreamingWeekly
Uber Engineering BlogReal-world architectureVariable

Certification Paths

Cloud Provider Certifications

CertificationProviderLevelValue
AWS Data AnalyticsAmazonSpecialtyHigh
Google Cloud Professional Data EngineerGoogleProfessionalHigh
Azure Data Engineer AssociateMicrosoftAssociateMedium
Databricks Certified Data EngineerDatabricksProfessionalHigh
Snowflake SnowProSnowflakeAdvancedMedium

Open Source Certifications

CertificationProviderLevelValue
Confluent KafkaConfluentVariousMedium
Apache SparkDatabricksProfessionalHigh
dbt Developerdbt LabsAssociateMedium

Certification Strategy

Recommendation: Focus on 1-2 cloud providers (your primary + one backup) plus core technologies (Spark, dbt, Kafka).


Community Resources

Conferences

ConferenceFocusWhenWhere
Data CouncilData engineeringSpringVarious
Strata DataData/AISpring/FallVarious
Spark SummitSparkAnnualVarious
CurrentKafkaAnnualVarious
kubeconKubernetesAnnualVarious
Flink ForwardFlinkAnnualVarious

Meetups

MeetupFocusFormat
Data Engineering MeetupGeneralPresentations + networking
Apache Kafka MeetupKafkaPresentations + tutorials
Spark MeetupSparkPresentations + networking
MLOps MeetupML OpsPresentations + discussions

Online Communities

CommunityPlatformFocus
r/dataengineeringRedditGeneral discussion
Data Engineering DiscordDiscordReal-time chat
Slack communitiesSlackVarious (dbt, Flink, etc.)
LinkedIn GroupsLinkedInProfessional networking
Twitter/XSocialNews, discussions

Learning Paths

Path 1: Staff Data Engineer

Timeline: 6-12 months

  1. Foundation (2 months)

    • Read “Designing Data-Intensive Applications”
    • Complete cloud provider certification
    • Build foundational projects
  2. Core Skills (3 months)

    • Master Spark or Flink
    • Learn dbt deeply
    • Complete Databricks certification
  3. Architecture (3 months)

    • Study system design patterns
    • Learn cost optimization
    • Practice mock interviews
  4. Interview Prep (2-4 months)

    • Prepare STAR stories
    • Practice system design
    • Mock interviews

Path 2: Principal Data Engineer

Timeline: 12-24 months

Prerequisites: Already at Staff level

  1. Depth in 2-3 areas (6 months)

    • Deep expertise in specialization
    • Public speaking, writing
    • Industry recognition
  2. Breadth across all areas (6 months)

    • Learn adjacent domains
    • Cross-functional projects
    • Mentorship experience
  3. Strategic Impact (6-12 months)

    • Company-wide initiatives
    • Cost optimization at scale
    • Technical vision

Practice Projects

Build These Projects

ProjectSkillsComplexity
Real-time ETL PipelineKafka, Flink/SS, DeltaMedium
Data Platform from ScratchEnd-to-end architectureHigh
Cost Optimization ProjectFinOps, storage/computeMedium
ML Feature StoreFeature engineering, servingHigh
Real-time PersonalizationStreaming, ML, low latencyHigh
Data Mesh ImplementationGovernance, decentralizationVery High

Project Checklist

For each project, ensure:

  • Public GitHub repository
  • Comprehensive index
  • Architecture diagram
  • Cost analysis
  • Deployment guide
  • Tests included
  • Blog post or talk

Staying Current

Trend202420252026
LLM OpsEmergingMainstreamStandard
Vector DatabasesNewGrowingMature
Data ContractsEmergingGrowingStandard
Real-time MLGrowingMainstreamStandard
FinOpsImportantCriticalStandard
Web3 DataHypeDecliningNiche
Edge ComputingEmergingGrowingGrowing

Information Diet

Daily:

  • Twitter/X (curated list)
  • Reddit (r/dataengineering)
  • LinkedIn (follow thought leaders)

Weekly:

  • Data Engineering Weekly
  • Vendor blogs (Databricks, Confluent, dbt)
  • One technical paper

Monthly:

  • One book chapter
  • One conference talk video
  • Update certification goals

Quarterly:

  • Attend one conference/meetup
  • Present or write
  • Assess skills gap

Key Takeaways

  1. Continuous learning: Data engineering evolves rapidly
  2. Certifications: Validate skills, but prioritize experience
  3. Community: Engage with meetups, conferences, online
  4. Practice: Build real projects, not just tutorials
  5. Stay current: Follow trends, read papers, attend events
  6. Contribute: Write, speak, mentor, build

  1. Read “Designing Data-Intensive Applications” - Essential foundation
  2. Complete one certification - Validate your skills
  3. Join a community - Local meetup or online
  4. Build a project - End-to-end data platform
  5. Present or write - Share your knowledge

This knowledge base provides the foundation. Continuous learning is required to stay current.


Last Updated: 2025 | Target: 2026 Standards