Image by story set on Freepik
It's a good time to get into data engineering. So where do you start?
Learning data engineering can sometimes be overwhelming due to the number of tools you need to know, not to mention the super intimidating job descriptions!
So if you're looking for an introduction to data engineering for beginners, this free program Data engineering course for beginnerstaught by Justin Chau, a developer advocate at Airbyte, is a good starting point.
In about three hours you'll learn essential data engineering skills: Docker, SQL, analytical engineering, and more. So if you want to explore data engineering and see if it's for you, this course is a great introduction. Now let's review what the course covers.
Link to the course: Data engineering course for beginners
This course begins with an introduction to why you should consider becoming a data engineer in the first place. Which I think is very useful to understand before diving right into the technical stuff.
Instructor, Justin Chau, talks about:
- The need for good quality data and data infrastructure to ensure the success of big data projects
- How data engineering roles are growing in demand and well-paying
- The business value you can add to the organization by working as a data engineer facilitating the organization's data infrastructure.
When learning data engineering, Docker is one of the first tools you can add to your toolbox. Docker is a popular containerization tool that allows you to package applications (with dependencies and configuration) into a single artifact called an image. In this way, Docker allows you to create a consistent and reproducible environment to run all your applications within a container.
The Docker module in this course starts with basic concepts such as:
- Docker files
- Dockable images
- Docker containers
The instructor then moves on to cover how to contain an application with Docker: running the Dockerfile creation and commands to get your container up and running. This section also covers persistent volumes, the fundamentals of Docker networking, and using Docker-Compose to manage multiple containers.
Overall, this module in itself is a good crash course on Docker if you are new to containerization.
In the following module on SQL, you will learn how to run Postgres in Docker containers and then learn the basics of SQL by creating a sample Postgres database and performing the following operations:
- CRUD operations
- Added features
- Using aliases
- Unions
- Union and union all
- Subqueries
With the fundamentals of Docker and SQL, you can now learn how to create a data pipeline from scratch. You will start by building a simple ELT process that you can improve throughout the rest of the course.
Additionally, you'll see how all the SQL, Docker networking, and Docker composition concepts you've learned so far come together to create this pipeline that runs Postgres on Docker for both the source and the target.
The course then moves to the analytical engineering part where you will learn about dbt (data creation tool) to organize your SQL queries as custom data transformation models.
The instructor will help you get started with dbt: install the necessary adapter and dbt-core and configure the project. This module focuses specifically on working with dbt models, macros and jinjas. You will learn how to:
- Define custom dbt models and run them on data in the target database
- Organize SQL queries as dbt macros for reuse
- Use dbt jinjas to add control structures to SQL queries
So far, you have created an ELT pipeline that runs via manual triggering. But you certainly need some automation, and the easiest way to do this is to define a cron job that runs automatically at a specific time of day.
So, this super short section covers cron jobs. But data orchestration tools like Airflow (which you'll learn in the next module) give you more granularity over the process.
To orchestrate data pipelines, you will use open source tools such as Airflow, Prefect, Dagster, and the like. In this section you will learn how to use the open source orchestration tool Airflow.
This section is more extensive compared to the previous sections because it covers everything you need to know to get up to speed and write Airflow DAGs for the current project.
You will learn how to configure the Airflow web server and scheduler to schedule jobs. Then you will learn about Airflow operators: Python and Bash operators. Finally, you will define the tasks that are included in the DAGs for the example at hand.
In the last module, you will learn about Airbyte, an open source data movement/integration platform that allows you to connect more data sources and destinations with ease.
You'll learn how to set up your environment and see how you can simplify the ELT process using Airbyte. To do this, you will modify the existing project components: ELT script and DAG to integrate Airbyte into the workflow.
I hope you found this free data engineering course review helpful. I enjoyed the course, especially the practical approach to gradually building and improving a data pipeline, rather than focusing solely on theory. The code is also available for you to follow. So, happy data engineering!
Bala Priya C. is a developer and technical writer from India. He enjoys working at the intersection of mathematics, programming, data science, and content creation. His areas of interest and expertise include DevOps, data science, and natural language processing. He likes to read, write, code and drink coffee! Currently, he is working to learn and share his knowledge with the developer community by creating tutorials, how-to guides, opinion pieces, and more.