Author's image | Mid-journey and Canva
Introduction
ETL, or Extract, Transform, Load, is a necessary data engineering process, which involves extracting data from various sources, converting it into a workable format, and moving it to some destination, such as a database. ETL pipelines automate this process, ensuring that data is processed consistently and efficiently, providing a framework for tasks such as data analysis, reporting, and machine learning, and ensuring that data is clean, reliable, and ready to use.
Bash, short for Bourne-Again Shell, also known as Unix shell, is a powerful tool for building ETL pipelines, due to its simplicity, flexibility and extremely broad applicability and is therefore an excellent choice for both beginners and professionals. experienced. . Bash scripts can do things like automate tasks, move files, and communicate with other tools on the command line, meaning it's a good choice for ETL work. Additionally, bash is ubiquitous on Unix-like systems (Linux, BSD, macOS, etc.), so it is ready to use on most of these systems without any additional work on your part.
This article is aimed at beginner and professional data scientists and engineers looking to create their first ETL pipeline. It assumes a basic knowledge of the command line and is intended to provide a practical guide to creating an ETL pipeline with bash. The goal of the article is to guide readers through the process of creating a basic ETL pipeline with bash. By the end of the article, readers will have a working knowledge of how to implement an ETL pipeline that extracts data from a source, transforms it, and loads it into a target database.
Setting up your environment
Before you begin, make sure you have the following:
- A Unix-based system (Linux or macOS)
- Bash shell (usually pre-installed on Unix systems)
- Basic understanding of command line operations
For our ETL pipeline, we'll need these specific command line tools:
You can install them using your system's package manager. On a Debian-based system, you can use apt-get
:
sudo apt-get install curl jq awk sed sqlite3
On macOS, you can use brew
:
brew install curl jq awk sed sqlite3
Let's set up a dedicated directory for our ETL project. Open your terminal and run:
mkdir ~/etl_project
cd ~/etl_project
This creates a new directory called etl_project
and sail towards it.
Data extraction
Data can come from a variety of sources, such as APIs, CSV files, or databases. In this tutorial, we will demonstrate how to extract data from a public API and a CSV file.
We are going to use curl
To get data from a public API. For example, we will pull data from a mock API that provides sample data.
# Fetching data from a public API
curl -o data.json "https://api.example.com/data"
This command will download the data and save it as data.json
.
We can also use curl
to download a CSV file from a remote server.
# Downloading a CSV file
curl -o data.csv "https://example.com/data.csv"
This will save the CSV file as data.csv
in our working directory.
Transforming data
Data transformation is necessary to convert raw data into a format suitable for analysis or storage. This may involve parsing JSON, filtering CSV files, or cleaning text data.
jq
is a powerful tool for working with JSON data. Let's use it to extract specific fields from our JSON file.
# Parsing and extracting specific fields from JSON
jq '.data() | {id, name, value}' data.json > transformed_data.json
This command extracts the id
, name
and value
fields of each entry in the JSON data and saves the result in transformed_data.json
.
awk
is a versatile tool for processing CSV files. We will use it to extract specific columns from our CSV file.
# Extracting specific columns from CSV
awk -F, '{print $1, $3}' data.csv > transformed_data.csv
This command extracts the first and third columns of data.csv
and saves them in transformed_data.csv
.
sed
is a sequence editor for filtering and transforming text. We can use it to perform text replacements and clean our data.
# Replacing text in a file
sed 's/old_text/new_text/g' transformed_data.csv
This command replaces the occurrences of old_text
with new_text
in transformed_data.csv
.
Loading data
Common destinations for loading data include databases and files. For this tutorial, we will use SQLite, a commonly used lightweight database.
First, let's create a new SQLite database and a table to hold our data.
# Creating a new SQLite database and table
sqlite3 etl_database.db "CREATE TABLE data (id INTEGER PRIMARY KEY, name TEXT, value REAL);"
This command creates a database file called etl_database.db
and a table called data
with three columns.
Next, we will insert our transformed data into the SQLite database.
# Inserting data into SQLite database
sqlite3 etl_database.db <
This command block sets the mode to CSV and imports transformed_data.csv
in it data
table.
We can verify that the data has been inserted correctly by querying the database.
# Querying the database
sqlite3 etl_database.db "SELECT * FROM data;"
This command retrieves all rows from the data
table and displays them.
Final thoughts
We have covered the following steps while building our ETL pipeline with bash, which include:
- Setting up the environment and installing tools
- Extracting data from a public API and a CSV file with
curl
- Data transformation using
jq
,awk
andsed
- Loading data into a SQLite database with
sqlite3
Bash is a good choice for ETL because of its simplicity, flexibility, automation capabilities, and interoperability with other CLI tools.
For further exploration, consider incorporating error handling, scheduling the pipeline using cron, or learning more advanced bash concepts. You can also investigate alternative transformation methods and applications to increase your pipeline skills.
Test your own ETL projects, putting what you've learned to the test in more elaborate scenarios. Hopefully, the basics here will be a good starting point for more complex data engineering tasks.
Matthew May (twitter.com/mattmayo13″ rel=”noopener”>@mattmayo13) has a master's degree in computer science and a postgraduate diploma in data mining. As Editor-in-Chief of KDnuggets & statologyand contributing editor at Mastering Machine LearningMatthew aims to make complex data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and exploring emerging ai. His mission is to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>