ML Engineering 101: A Detailed Explanation of the “Data Loader Worker (pid(s) xxx) Exited Unexpectedly” Error | by Mengliu Zhao | June 2024

Torch multiprocessing best practices.

However, virtual memory is only one side of the story. What if the problem does not go away after adjusting the swap disk?

The other side of the story is the underlying problems with the torch multiprocessing module. There are a number of best practice recommendations on the official website:

But in addition to these, three more approaches should be considered, especially when it comes to memory usage.

The first thing is the shared memory leak.. Leakage means that memory is not properly freed after each execution of the secondary worker, and you will observe this phenomenon when you monitor virtual memory usage at runtime. Memory consumption will continue to increase and reach the point of “out of memory”. This is a very typical memory leak.

So what will cause the leak?

Let's take a look at the DataLoader class itself:

https://github.com/pytorch/pytorch/blob/main/torch/utils/data/dataloader.py

Looking under the hood of DataLoader, we will see that when nums_worker > 0, _MultiProcessingDataLoaderIter is called. Inside _MultiProcessingDataLoaderIter, Torch.multiprocessing creates the worker queue. Torch.multiprocessing uses two different strategies for memory sharing and caching: file_descriptor and file_system. While file_system Does not require file descriptor caching, is prone to shared memory leaks.

To check which sharing strategy your machine is using, simply add the script:

torch.multiprocessing.get_sharing_strategy()

To get the file descriptor limit for your system (Linux), run the following command in terminal:

ulimit -n

To change your sharing strategy to file_descriptor:

torch.multiprocessing.set_sharing_strategy(‘file_descriptor’)

To count the number of open file descriptors, run the following command:

ls /proc/self/fd | wc -l

As long as the system allows it, the file_descriptor The strategy is recommended.

The second is the startup method of the multiprocessing worker. Simply put, it's the debate over whether to use a fork or a spawn as a worker startup method. Fork is the default way to start multiprocessing on Linux and can prevent copying certain files, making it much faster, but you may have trouble handling CUDA tensors and third-party libraries like OpenCV in your DataLoader.

To use the spawn method, you can simply pass the argument multiprocessing_context= “Appear”. to the data loader.

Three, make data set objects selectable/serializable

There is a very nice post that further discusses the “copy-on-read” effect for process folding: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/

In short, it is It is no longer a good approach. to create a list of file names and load them into the __getitem__ method. Create a numpy array or panda dataframe to store the list of file names for serialization purposes. And if you're familiar with HuggingFace, using a CSV/dataframe is the recommended way to load a local dataset: https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/loading_methods#datasets.load_dataset.example-2

ML Engineering 101: A Detailed Explanation of the “Data Loader Worker (pid(s) xxx) Exited Unexpectedly” Error | by Mengliu Zhao | June 2024

Technical Terrence Team

Costco membership dues update prompts analysts to reset stock outlook

Leave a Reply Cancel reply

Recommended.

Ethereum Sees Billion Dollar Exchange Outflow Along With Bitcoin

US to announce billions in subsidies for advanced chips: WSJ By Reuters

The 5 Pillars of Blended Learning: Building a Strong Foundation for Student Success

A Complete Guide to Flow Blockchain in 2024

Researchers at the National University of Singapore developed an innovative RMIA (robust membership inference attack) technique to improve privacy risk analysis in machine learning

Categories

Important Links

ML Engineering 101: A Detailed Explanation of the “Data Loader Worker (pid(s) xxx) Exited Unexpectedly” Error | by Mengliu Zhao | June 2024

Related

Technical Terrence Team

Costco membership dues update prompts analysts to reset stock outlook

Leave a Reply Cancel reply

Recommended.

Ethereum Sees Billion Dollar Exchange Outflow Along With Bitcoin

US to announce billions in subsidies for advanced chips: WSJ By Reuters

The 5 Pillars of Blended Learning: Building a Strong Foundation for Student Success

A Complete Guide to Flow Blockchain in 2024

Researchers at the National University of Singapore developed an innovative RMIA (robust membership inference attack) technique to improve privacy risk analysis in machine learning

Categories

Important Links

Get daily news updates to your inbox!