Before you can create a machine learning model, you must load your data into a data set. Fortunately, PyTorch has many commands to help with this entire process (if you're not familiar with PyTorch, I recommend brushing up on the basics). here).
PyTorch has good documentation to help with this process, but I haven't found any comprehensive documentation or tutorials on custom data sets. I will first start by creating basic pre-built data sets and then progress to creating data sets from scratch for different models!
Before we delve into the code for different use cases, let's understand the difference between the two terms. Generally, you first create your data set and then create a data loader. TO data set contains the characteristics and labels of each data point that will be fed into the model. TO data loader is a custom PyTorch iterable that makes it easy to load data with additional functions.
DataLoader(dataset, batch_size=1, shuffle=False, sampler=None,
batch_sampler=None, num_workers=0, collate_fn=None,
pin_memory=False, drop_last=False, timeout=0,
worker_init_fn=None, *, prefetch_factor=2,
persistent_workers=False)
The most common arguments in the data loader are Lot Size, shuffle (usually only for training data), number_workers (to load the data into multiple processes), and memory_pin (to place the recovered data tensors into fixed memory and enable faster data transfer to CUDA-enabled GPUs).
It is recommended to set pin_memory = True instead of specifying num_workers due to multiprocessing complications with CUDA.
In the event that your dataset is downloaded online or locally, it will be extremely easy to create the dataset. I think PyTorch has good documentation about this, so I'll be brief.
If you know the dataset is PyTorch or PyTorch compatible, simply call the necessary imports and the dataset of your choice:
from torch.utils.data import Dataset
from torchvision import datasets
from torchvision.transforms imports ToTensordata = torchvision.datasets.CIFAR10('path', train=True, transform=ToTensor())
Each data set will have unique arguments to pass to it (found here). In general, this will be the path in which the data set is stored, a boolean indicating whether it needs to be downloaded or not (conveniently called download), whether it is training or testing, and whether transformations need to be applied.
I mentioned that transformations can be applied to a data set at the end of the last section, but what really is a transformation?
TO transform is a data manipulation method to preprocess an image. There are many different facets to transformations. The most common transformation, ATensor(), will convert the data set into tensors (required to input into any model). Other transformations built into PyTorch (torchvision.transforms) include flipping, rotating, cropping, normalizing and shifting images. They are usually used so that the model can generalize better and does not overfit the training data. Data augmentations can also be used to artificially increase the size of the data set if necessary.
Note that most torchvision transforms only accept Pillow or Tensor image formats (not numpy). To convert, simply use
To convert from numpy, create a torch tensor or use the following:
From PIL import Image
# assume arr is a numpy array
# you may need to normalize and cast arr to np.uint8 depending on format
img = Image.fromarray(arr)
Transformations can be applied simultaneously using torchvision.transforms.compose. You can combine as many transformations as you need for the data set. Here is an example:
import torchvision.transforms.Composedataset_transform = transforms.Compose((
transforms.RandomResizedCrop(256),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
))
Make sure to pass the saved transformation as an argument to the data set so that it is applied in the data loader.
In most cases of developing your own model, you will need a custom data set. A common use case would be transfer learning to apply your own data set to a pre-trained model.
There are 3 parts required for a PyTorch dataset class: initialization, lengthand retrieving an item.
__in that__: To initialize the data set, pass in the raw and labeled data. Best practice is to pass raw image data and labeled data separately.
__len__: Returns the length of the data set. Before creating the dataset, it should be verified that the raw and labeled data are the same size.
__gets the object__: This is where all the data handling happens to return a given index (idx) of the raw and labeled data. If any transformation needs to be applied, the data must be converted to a tensor and transformed. If the initialization contained a path to the data set, the path must be opened and the data accessed/preprocessed before it can be returned.
Example data set for a semantic segmentation model:
from torch.utils.data import Dataset
from torchvision import transformsclass ExampleDataset(Dataset):
"""Example dataset"""
def __init__(self, raw_img, data_mask, transform=None):
self.raw_img = raw_img
self.data_mask = data_mask
self.transform = transform
def __len__(self):
return len(self.raw_img)
def __getitem__(self, idx):
if torch.is_tensor(idx):
idx = idx.tolist()
image = self.raw_img(idx)
mask = self.data_mask(idx)
sample = {'image': image, 'mask': mask}
if self.transform:
sample = self.transform(sample)
return sample