Author's image | DALLE-3 and Canva
Data engineering is growing rapidly and companies are now hiring more data engineers than data scientists. Operational jobs like data engineering, cloud architecture, and MLOps engineering are in high demand.
As a data engineer, you must master containerization, infrastructure as code, workflow orchestration, analytical engineering, batch processing, and streaming tools. In addition to these tools, you need to master cloud infrastructure and manage services like Databricks and Snowflakes.
In this blog, we will learn about 10 GitHub repositories that will help you master all the basic tools and concepts. These GitHub repositories contain courses, experiences, roadmaps, a list of essential tools, projects, and a manual. All you need to do is bookmark them while learning how to become a professional data engineer.
1. Awesome data engineering
He Impressive data engineering The repository contains a list of tools, frameworks, and libraries for data engineering, making it a great starting point for anyone looking to dive into this field.
It covers tools on databases, data ingestion, file system, streaming, batch processing, data lake management, workflow orchestration, monitoring, testing, and charts and dashboards.
Link: igorbarinov/amazing-data-engineering
2. Data Engineering Zoomcamp
Data Engineering Zoomcamp is a comprehensive course that provides a hands-on learning experience in data engineering. Learn new concepts and tools through video tutorials, quizzes, projects, assignments, and community-driven assessments.
The Data Engineering Zoomcamp covers:
- Containerization and infrastructure as code
- Workflow Orchestration
- Data ingestion
- Data warehouse
- Analytical Engineering
- Batch processing
- Transmission
Link: DataTalksClub/data-engineering-zoomcamp
3. The Data Engineering Cookbook
He Data Engineering Cookbook is a collection of articles and tutorials covering various aspects of data engineering, including data ingestion, processing, and storage.
The Data Engineering Cookbook Includes:
- Basic engineering skills
- Advanced engineering skills
- Free practical courses/tutorials
- Case studies
- Best Practices Cloud Platforms
- 130+ data sources Data Science
- 1001 interview questions
- Recommended books, courses and podcasts
Link: andkret/cookbook
4. Data Engineer Roadmap
He Data Engineer Roadmap The repository provides a step-by-step guide to becoming a data engineer. This repository covers everything from data engineering basics to advanced topics like infrastructure as code and cloud computing.
The data engineer roadmap includes:
- Computer basics
- Learning Python
- Evidence
- Database
- Data warehouse
- Cluster computing
- Data processing
- Messenger service
- Workflow scheduling
- Grid
- Infrastructure as code
- CI/CD
- Data security and privacy
Link: datastacktv/data-engineer-roadmap
5. How to do data engineering
How to do data engineering is a beginner's resource to learn data engineering from scratch. It contains a list of tutorials, courses, books, and other resources that will help you build a solid foundation in data engineering concepts and best practices. If you are new to this field, this repository will help you easily navigate the vast landscape of data engineering.
How to become a data engineer includes:
- Useful articles and blogs.
- Talks
- Algorithms and data structures
- SQL
- Programming
- Databases
- Distributed systems
- Books
- courses
- Tools
- Cloud platforms
- Communities
- Jobs
- Newsletters
Link: adilkhash/Data-Engineering-How-to
6. Awesome open source data engineering
Awesome open source data engineering is a list of open source data engineering tools that is a gold mine for anyone who wants to contribute or use them to create real-world data engineering projects. It contains a wealth of information on open source tools and frameworks, making it a great resource for anyone looking to explore alternative data engineering solutions.
The repository includes open source tools on:
- Analytics
- Business Intelligence
- Data Lake House
- Change data capture
- Data warehouses
- Data and log governance
- Data virtualization
- Data orchestration
- Formats
- Integration
- Messaging infrastructure
- Specifications and standards
- Stream processing
- Evidence
- Monitoring and registration
- Versioned
- Workflow management
Link: gunnarmorling/awesome open-source-data-engineering
7. Pyspark example project
Pyspark example project The repository provides a practical example of best practice implementation for PySpark ETL jobs and applications.
PySpark is a popular tool for data processing and this repository will help you master it. You'll learn how to structure your code, handle data transformations, and optimize your PySpark workflows efficiently.
The project covers:
- Structure of an ETL job
- Pass configuration parameters to the ETL job
- Packaging ETL job dependencies
- Running the ETL job
- Debugging Spark jobs
- Automated testing
- Project dependency management
Link: AlexIoannides/pyspark-example-project
8. Data Engineer's Manual
Data Engineer's Handbook is a comprehensive collection of resources covering all aspects of data engineering. Includes tutorials, articles and books on all topics related to data engineering. Whether you're looking for a quick reference guide or in-depth knowledge, this handbook has something for data engineers of all levels.
The Manual includes:
- great books
- Communities to follow
- Companies to consider
- Blogs to read
- white papers
- Great YouTube channels
- Great podcasts
- Newsletters
- LinkedIn, twitter, TikTok and instagram influencers to follow
- courses
- Certifications
- Conferences
Link: DataExpert-io/data-engineer-manual
9. Data Engineering Wiki
He Data Engineering Wiki The repository is a community-driven wiki that provides a comprehensive resource for learning data engineering. This repository covers a wide range of topics, including data pipelines, data warehousing, and data modeling.
Data Engineering Wiki includes:
- Data engineering concepts
- Data Engineering FAQ
- Guides on how to make data engineering decisions
- Commonly used tools for data engineering
- Step-by-step guides for data engineering tasks
- Learning resources
Link: data-engineering-community/data-engineering-wiki
10. Data engineering practice
Data engineering practice offers a practical approach to learning data engineering. Provides practice projects and exercises to help you apply your knowledge and skills in real-world scenarios. By working on these projects, you'll gain hands-on experience and build a portfolio that showcases your data engineering capabilities.
Data engineering practice problems include exercises on:
- Downloading files
- Web Scraping + Download + Pandas
- Bot3 AWS + s3 + Python.
- Convert JSON to CSV + irregular directories
- Data modeling for Postgres + Python
- Ingestion and aggregation with PySpark
- Using various PySpark functions
- Using DuckDB for analysis and transformations
- Using Polars lazy computing
Link: danielbeach/data-engineering-practice
Last words
Mastering data engineering requires dedication, perseverance, and a passion for learning new concepts and tools. These 10 GitHub repositories provide a wealth of information and resources to help you become a professional data engineer and keep you up to date on current trends.
Whether you're just starting out or are an experienced data engineer, I encourage you to explore these resources, contribute to open source projects, and stay involved with the vibrant data engineering community on GitHub.
Abid Ali Awan (@1abidaliawan) is a certified professional data scientist who loves building machine learning models. Currently, he focuses on content creation and writing technical blogs on data science and machine learning technologies. Abid has a master's degree in technology management and a bachelor's degree in telecommunications engineering. His vision is to build an artificial intelligence product using a graph neural network for students struggling with mental illness.