Image by author
There are many courses and resources available on machine learning and data science, but very few on data engineering. This raises some questions. Is it a difficult field? Do you offer low salaries? Isn't it considered as exciting as other tech roles? However, the reality is that many companies are actively seeking data engineering talent and offering substantial salaries, sometimes exceeding $200,000. Data engineers play a crucial role as data platform architects, designing and building the fundamental systems that enable data scientists and machine learning experts to function effectively.
To address this gap in the industry, DataTalkClub has introduced a free, transformative bootcamp,”Data Engineering Zoomcamp“. This course is designed to train beginners or professionals looking to change careers, with essential skills and practical experience in data engineering.
This is a 6 week training camp where you will learn through multiple courses, reading materials, workshops and projects. At the end of each module, you will be assigned homework to practice what you have learned.
- Week 1: Introduction to GCP, Docker, Postgres, Terraform and environment configuration.
- Week 2: Workflow orchestration with Mage.
- Week 3: Data Warehousing with BigQuery and Machine Learning with BigQuery.
- Week 4: Analytical engineer with dbt, Google Data Studio and Metabase.
- Week 5: Batch processing with Spark.
- Week 6: Streaming with Kafka.
Picture of DataTalksClub/data-engineering-zoomcamp
The curriculum contains 6 modules, 2 workshops and a project that covers everything needed to become a professional data engineer.
Module 1: Master containerization and infrastructure as code
In this module, you will learn about Docker and Postgres, starting with the basics and moving through detailed tutorials on creating data pipelines, running Postgres with Docker, and more.
The module also covers essential tools such as pgAdmin, Docker-compose, and SQL upgrade topics, with optional content on Docker networking and a special tour for Linux users of the Windows subsystem. In the end, the course introduces you to GCP and Terraform, giving you a comprehensive understanding of containerization and infrastructure as code, essential for modern cloud-based environments.
Module 2: Workflow Orchestration Techniques
The module offers an in-depth exploration of Mage, an innovative open source hybrid framework for data transformation and integration. This module starts with the basics of workflow orchestration and continues with hands-on exercises with Mage, including configuring it through Docker and creating ETL pipelines from API to Postgres and Google Cloud Storage (GCS), and then to BigQuery.
The module's combination of videos, resources and practical tasks ensures a comprehensive learning experience, equipping students with the skills to manage sophisticated data workflows using Mage.
Workshop 1: Data ingestion strategies
In the first workshop, you will master creating efficient data ingestion pipelines. The workshop focuses on essential skills such as extracting data from APIs and files, normalizing and loading data, and incremental loading techniques. After completing this workshop, you will be able to create efficient data pipelines like a senior data engineer.
Module 3: Data Storage
The module is an in-depth exploration of data storage and analysis, focusing on data warehousing using BigQuery. It covers key concepts such as partitioning and clustering, and dives deeper into BigQuery best practices. The module progresses to advanced topics, particularly integrating Machine Learning (ML) with BigQuery, highlights the use of SQL for ML, and provides resources on hyperparameter tuning, feature preprocessing, and model deployment.
Module 4: Analytical Engineering
The analytical engineering module focuses on building a project using dbt (Data Build Tool) with an existing data warehouse, either BigQuery or PostgreSQL.
The module covers configuring dbt in both on-premises and cloud environments, introducing analytical engineering concepts, ETL vs ELT and data modeling. It also covers advanced dbt features such as incremental models, tags, hooks, and snapshots.
In the end, the module presents techniques for visualizing transformed data using tools such as Google Data Studio and Metabase, and provides resources for troubleshooting and efficient data loading.
Module 5: Batch Processing Proficiency
This module covers batch processing using Apache Spark, starting with introductions to batch processing and Spark, along with installation instructions for Windows, Linux, and MacOS.
It includes exploring Spark SQL and DataFrames, preparing data, performing SQL operations, and understanding Spark internals. Finally, it concludes by running Spark in the cloud and integrating Spark with BigQuery.
Module 6: The art of streaming data with Kafka
The module begins with an introduction to stream processing concepts, followed by an in-depth exploration of Kafka, including its fundamentals, integration with Confluent Cloud, and practical applications involving producers and consumers.
The module also covers Kafka configuration and streams, covering topics such as stream joins, testing, windowing, and using Kafka ksqldb & Connect. Additionally, it extends its focus to Python and JVM environments, introducing Faust for Python stream processing, Pyspark – Structured Streaming, and Scala examples for Kafka Streams.
Workshop 2: Stream processing with SQL
You'll learn how to process and manage streaming data with RisingWave, which provides a cost-effective solution with a PostgreSQL-style experience to power your stream processing applications.
Project: Real-world data engineering application
The goal of this project is to implement all the concepts we have learned in this course to build an end-to-end data pipeline. You will create a dashboard consisting of two tiles by selecting a data set, creating a pipeline to process the data and storing it in a data lake, building a pipeline to transfer the processed data from the data lake to a data warehouse, transforming the data into the data warehouse and preparing it for the dashboard, and finally building a dashboard to present the data visually.
2024 Cohort Details
- Record: Enlist now
- Start date: January 15, 2024, at 17:00 CET
- Self-paced learning with guided support
- Cohort folder with tasks and deadlines
- Interactive loose community for peer learning
Previous requirements
- Basic coding and command line skills.
- Foundation in SQL
- Python – beneficial but not required
Expert instructors leading your journey
- Ankush Khanna
- Victoria Perez Mola
- Alexei Grigorev
- Matt Palmer
- Luis Oliveira
- Michael Zapatero
Join our 2024 cohort and start learning with an amazing data engineering community. With expert-led training, hands-on experience, and a curriculum tailored to industry needs, this bootcamp not only equips you with the necessary skills but also positions you at the forefront of a lucrative and in-demand career path. Sign up today and transform your aspirations into reality!
Abid Ali Awan (@1abidaliawan) is a certified professional data scientist who loves building machine learning models. Currently, he focuses on content creation and writing technical blogs on data science and machine learning technologies. Abid has a Master's degree in technology Management and a Bachelor's degree in Telecommunications Engineering. His vision is to build an artificial intelligence product using a graph neural network for students struggling with mental illness.