Pandas for data engineers. Advanced techniques to process and load… | by Mike Shakhomirov | February 2024

Advanced techniques to process and load data efficiently

Image generated by ai using ai-forever/Kandinsky-2″ rel=”noopener ugc nofollow” target=”_blank”>Kandinsky

In this story, I'd like to talk about the things I like about Pandas and frequently use in the ETL applications I write to process data. We'll touch on exploratory data analysis, data cleaning, and data frame transformations. I'll demonstrate some of my favorite techniques for optimizing memory usage and processing large amounts of data efficiently using this library. Working with relatively small data sets in Pandas is rarely a problem. It handles data in data frames with ease and provides a very convenient set of commands for processing it. When it comes to data transformations on much larger data frames (1Gb and up), you would typically use Spark and distributed compute clusters. It can handle terabytes and petabytes of data, but it will probably also cost a lot of money to run all that hardware. That's why Pandas might be a better choice when we have to deal with medium-sized data sets in environments with limited memory resources.

Pandas and Python Generators

In one of my previous stories I wrote about how to process data efficiently using generators in Python (1).

It is a simple trick to optimize memory usage. Let's imagine we have a huge data set somewhere in external storage. It can be a database or just a large, simple CSV file. Imagine we need to process this 2-3TB file and apply some transformation to each row of data in this file. Suppose we have a service that will perform this task and it only has 32 Gb of memory. This will limit us in loading data and we will not be able to load the entire file into memory to split it line by line using simple Python. split(‘\n’) operator. The solution would be to process it row by row and yield each time freeing memory for the next. This can help us create a constant flow of ETL data towards the final destination of our data pipeline. It can be anything: a cloud storage bucket, another database, a data warehouse (DWH) solution, a streaming topic, or something else…

Pandas for data engineers. Advanced techniques to process and load… | by Mike Shakhomirov | February 2024

Technical Terrence Team

Super Bowl Rumors: Bookmakers Point to Taylor Swift Effect; BetMGM's agreement with Elon Musk's X

Leave a Reply Cancel reply

Recommended.

How to use the new Copilot AI in Windows 11

Bitcoin Drop to $65,000 Triggers Over $400 Million Liquidation

Why is no one talking about the dividend forecast for Rolls-Royce shares?

Intelligent scaling: Accelerating pre-training of large language models with small model initialization

Ethereum developers create a “shadow fork” to test the conditions for Ether withdrawals

Categories

Important Links

Pandas for data engineers. Advanced techniques to process and load… | by Mike Shakhomirov | February 2024

Advanced techniques to process and load data efficiently

Pandas and Python Generators

Related

Technical Terrence Team

Super Bowl Rumors: Bookmakers Point to Taylor Swift Effect; BetMGM's agreement with Elon Musk's X

Leave a Reply Cancel reply

Recommended.

How to use the new Copilot AI in Windows 11

Bitcoin Drop to $65,000 Triggers Over $400 Million Liquidation

Why is no one talking about the dividend forecast for Rolls-Royce shares?

Intelligent scaling: Accelerating pre-training of large language models with small model initialization

Ethereum developers create a “shadow fork” to test the conditions for Ether withdrawals

Categories

Important Links

Get daily news updates to your inbox!