Torch multiprocessing best practices.
However, virtual memory is only one side of the story. What if the problem does not go away after adjusting the swap disk?
The other side of the story is the underlying problems with the torch multiprocessing module. There are a number of best practice recommendations on the official website:
But in addition to these, three more approaches should be considered, especially when it comes to memory usage.
The first thing is the shared memory leak.. Leakage means that memory is not properly freed after each execution of the secondary worker, and you will observe this phenomenon when you monitor virtual memory usage at runtime. Memory consumption will continue to increase and reach the point of “out of memory”. This is a very typical memory leak.
So what will cause the leak?
Let's take a look at the DataLoader class itself:
https://github.com/pytorch/pytorch/blob/main/torch/utils/data/dataloader.py
Looking under the hood of DataLoader, we will see that when nums_worker > 0, _MultiProcessingDataLoaderIter is called. Inside _MultiProcessingDataLoaderIter, Torch.multiprocessing creates the worker queue. Torch.multiprocessing uses two different strategies for memory sharing and caching: file_descriptor and file_system. While file_system Does not require file descriptor caching, is prone to shared memory leaks.
To check which sharing strategy your machine is using, simply add the script:
torch.multiprocessing.get_sharing_strategy()
To get the file descriptor limit for your system (Linux), run the following command in terminal:
ulimit -n
To change your sharing strategy to file_descriptor:
torch.multiprocessing.set_sharing_strategy(‘file_descriptor’)
To count the number of open file descriptors, run the following command:
ls /proc/self/fd | wc -l
As long as the system allows it, the file_descriptor The strategy is recommended.
The second is the startup method of the multiprocessing worker. Simply put, it's the debate over whether to use a fork or a spawn as a worker startup method. Fork is the default way to start multiprocessing on Linux and can prevent copying certain files, making it much faster, but you may have trouble handling CUDA tensors and third-party libraries like OpenCV in your DataLoader.
To use the spawn method, you can simply pass the argument multiprocessing_context= “Appear”. to the data loader.
Three, make data set objects selectable/serializable
There is a very nice post that further discusses the “copy-on-read” effect for process folding: https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/
In short, it is It is no longer a good approach. to create a list of file names and load them into the __getitem__ method. Create a numpy array or panda dataframe to store the list of file names for serialization purposes. And if you're familiar with HuggingFace, using a CSV/dataframe is the recommended way to load a local dataset: https://huggingface.co/docs/datasets/v2.19.0/en/package_reference/loading_methods#datasets.load_dataset.example-2