In this story, I'd like to talk about the things I like about Pandas and frequently use in the ETL applications I write to process data. We'll touch on exploratory data analysis, data cleaning, and data frame transformations. I'll demonstrate some of my favorite techniques for optimizing memory usage and processing large amounts of data efficiently using this library. Working with relatively small data sets in Pandas is rarely a problem. It handles data in data frames with ease and provides a very convenient set of commands for processing it. When it comes to data transformations on much larger data frames (1Gb and up), you would typically use Spark and distributed compute clusters. It can handle terabytes and petabytes of data, but it will probably also cost a lot of money to run all that hardware. That's why Pandas might be a better choice when we have to deal with medium-sized data sets in environments with limited memory resources.
Pandas and Python Generators
In one of my previous stories I wrote about how to process data efficiently using generators in Python (1).
It is a simple trick to optimize memory usage. Let's imagine we have a huge data set somewhere in external storage. It can be a database or just a large, simple CSV file. Imagine we need to process this 2-3TB file and apply some transformation to each row of data in this file. Suppose we have a service that will perform this task and it only has 32 Gb of memory. This will limit us in loading data and we will not be able to load the entire file into memory to split it line by line using simple Python. split(‘\n’)
operator. The solution would be to process it row by row and yield
each time freeing memory for the next. This can help us create a constant flow of ETL data towards the final destination of our data pipeline. It can be anything: a cloud storage bucket, another database, a data warehouse (DWH) solution, a streaming topic, or something else…