How to create your first ETL pipeline with Bash

Author's image | Mid-journey and Canva

Introduction

ETL, or Extract, Transform, Load, is a necessary data engineering process, which involves extracting data from various sources, converting it into a workable format, and moving it to some destination, such as a database. ETL pipelines automate this process, ensuring that data is processed consistently and efficiently, providing a framework for tasks such as data analysis, reporting, and machine learning, and ensuring that data is clean, reliable, and ready to use.

Bash, short for Bourne-Again Shell, also known as Unix shell, is a powerful tool for building ETL pipelines, due to its simplicity, flexibility and extremely broad applicability and is therefore an excellent choice for both beginners and professionals. experienced. . Bash scripts can do things like automate tasks, move files, and communicate with other tools on the command line, meaning it's a good choice for ETL work. Additionally, bash is ubiquitous on Unix-like systems (Linux, BSD, macOS, etc.), so it is ready to use on most of these systems without any additional work on your part.

This article is aimed at beginner and professional data scientists and engineers looking to create their first ETL pipeline. It assumes a basic knowledge of the command line and is intended to provide a practical guide to creating an ETL pipeline with bash. The goal of the article is to guide readers through the process of creating a basic ETL pipeline with bash. By the end of the article, readers will have a working knowledge of how to implement an ETL pipeline that extracts data from a source, transforms it, and loads it into a target database.

Setting up your environment

Before you begin, make sure you have the following:

A Unix-based system (Linux or macOS)
Bash shell (usually pre-installed on Unix systems)
Basic understanding of command line operations

For our ETL pipeline, we'll need these specific command line tools:

You can install them using your system's package manager. On a Debian-based system, you can use apt-get:

sudo apt-get install curl jq awk sed sqlite3

On macOS, you can use brew:

brew install curl jq awk sed sqlite3

Let's set up a dedicated directory for our ETL project. Open your terminal and run:

mkdir ~/etl_project
cd ~/etl_project

This creates a new directory called etl_project and sail towards it.

Data extraction

Data can come from a variety of sources, such as APIs, CSV files, or databases. In this tutorial, we will demonstrate how to extract data from a public API and a CSV file.

We are going to use curl To get data from a public API. For example, we will pull data from a mock API that provides sample data.

# Fetching data from a public API
curl -o data.json "https://api.example.com/data"

This command will download the data and save it as data.json.

We can also use curl to download a CSV file from a remote server.

# Downloading a CSV file
curl -o data.csv "https://example.com/data.csv"

This will save the CSV file as data.csv in our working directory.

Transforming data

Data transformation is necessary to convert raw data into a format suitable for analysis or storage. This may involve parsing JSON, filtering CSV files, or cleaning text data.

jq is a powerful tool for working with JSON data. Let's use it to extract specific fields from our JSON file.

# Parsing and extracting specific fields from JSON
jq '.data() | {id, name, value}' data.json > transformed_data.json

This command extracts the id, nameand value fields of each entry in the JSON data and saves the result in transformed_data.json.

awk is a versatile tool for processing CSV files. We will use it to extract specific columns from our CSV file.

# Extracting specific columns from CSV
awk -F, '{print $1, $3}' data.csv > transformed_data.csv

This command extracts the first and third columns of data.csv and saves them in transformed_data.csv.

sed is a sequence editor for filtering and transforming text. We can use it to perform text replacements and clean our data.

# Replacing text in a file
sed 's/old_text/new_text/g' transformed_data.csv

This command replaces the occurrences of old_text with new_text in transformed_data.csv.

Loading data

Common destinations for loading data include databases and files. For this tutorial, we will use SQLite, a commonly used lightweight database.

First, let's create a new SQLite database and a table to hold our data.

# Creating a new SQLite database and table
sqlite3 etl_database.db "CREATE TABLE data (id INTEGER PRIMARY KEY, name TEXT, value REAL);"

This command creates a database file called etl_database.db and a table called data with three columns.

Next, we will insert our transformed data into the SQLite database.

# Inserting data into SQLite database
sqlite3 etl_database.db <

This command block sets the mode to CSV and imports transformed_data.csv in it data table.

We can verify that the data has been inserted correctly by querying the database.

# Querying the database
sqlite3 etl_database.db "SELECT * FROM data;"

This command retrieves all rows from the data table and displays them.

Final thoughts

We have covered the following steps while building our ETL pipeline with bash, which include:

Setting up the environment and installing tools
Extracting data from a public API and a CSV file with curl
Data transformation using jq, awkand sed
Loading data into a SQLite database with sqlite3

Bash is a good choice for ETL because of its simplicity, flexibility, automation capabilities, and interoperability with other CLI tools.

For further exploration, consider incorporating error handling, scheduling the pipeline using cron, or learning more advanced bash concepts. You can also investigate alternative transformation methods and applications to increase your pipeline skills.

Test your own ETL projects, putting what you've learned to the test in more elaborate scenarios. Hopefully, the basics here will be a good starting point for more complex data engineering tasks.

Matthew May (twitter.com/mattmayo13″ rel=”noopener”>@mattmayo13) has a master's degree in computer science and a postgraduate diploma in data mining. As Editor-in-Chief of KDnuggets & statologyand contributing editor at Mastering Machine LearningMatthew aims to make complex data science concepts accessible. His professional interests include natural language processing, language models, machine learning algorithms, and exploring emerging ai. His mission is to democratize knowledge in the data science community. Matthew has been coding since he was 6 years old.

How to create your first ETL pipeline with Bash

Technical Terrence Team

£10,000 in an ISA? My goal would be to invest them to obtain a second income of £1,000 a year.

Leave a Reply Cancel reply

Recommended.

Ethereum ETFs Could Go Live in July, Analyst Says

Apple continues to use our own mortality as marketing

Phathom Pharma, Rhythm and Vaxcyte most likely acquisition targets in pharma sector: analyst

UT Austin Researchers Introduce LIBERO: A Lifelong Robot Learning Benchmark to Study Knowledge Transfer in Decision Making and Robotics at Scale

UK Expands Crackdown on Unauthorized Crypto ATMs – Bitcoin News

Categories

Important Links

How to create your first ETL pipeline with Bash

Introduction

Setting up your environment

Data extraction

Transforming data

Loading data

Final thoughts

Related

Technical Terrence Team

£10,000 in an ISA? My goal would be to invest them to obtain a second income of £1,000 a year.

Leave a Reply Cancel reply

Recommended.

Ethereum ETFs Could Go Live in July, Analyst Says

Apple continues to use our own mortality as marketing

Phathom Pharma, Rhythm and Vaxcyte most likely acquisition targets in pharma sector: analyst

UT Austin Researchers Introduce LIBERO: A Lifelong Robot Learning Benchmark to Study Knowledge Transfer in Decision Making and Robotics at Scale

UK Expands Crackdown on Unauthorized Crypto ATMs – Bitcoin News

Categories

Important Links

Get daily news updates to your inbox!