2. .dvc
files
.dvc
use of files YAML 1.2 file format, which is an easy-to-use data serialization format for all programming languages.
As I mentioned earlier, DVC creates a slight .dvc
file for each file or folder crawled with DVC.
When you take a look inside the content of images.dvc
you will see the following entries:
The most interesting part is md5
. MD5 is a popular hash function. It takes a file of arbitrary size and uses its contents to produce a fixed-length string (32 characters in our case).
These characters may seem random, but they will always be the same if you rehash the file as many times as possible. But, even if a single bit in the file is changed, the resulting hash will be completely different.
DVC uses these hashes (also called checksums) to differentiate whether two files are identical, completely different, or different versions of the same file.
For example, if I add a new fake image to the images
folder, the resulting MD5 hash inside images.dvc
It will be different:
As mentioned above, you need to keep track of all .dvc
files with Git so that modifications to large assets become part of your Git commits and history.
$ git add images.dvc
Learn more about how .dvc
files work from this page of the DVC User Guide.
3.DVC Cache
when you call dvc add
on a large asset, it is copied to a special directory called the DVC cache, located at .dvc/cache
.
The cache is where DVC keeps a clean record of your data and models across different versions. He .dvc
the files in the current working directory may show the most recent version or some other version of the large assets, but the cache will include all the different states the assets have been in since you started crawling them with DVC.
For example, suppose you added a 1 GB data.csv
file to DVC. By default, the file will be both in your workspace and within the .dvc/cache
folder, taking up twice the space (2 GB).
Any subsequent changes tracked with dvc add data.csv
will create a new version of data.csv
with a new hash inside .dvc/cache
taking up another gigabyte of memory.
So you may already be wondering: isn’t this highly inefficient? And the answer would be yes! At least for individual files, but we’ll cover methods to mitigate this problem in the next section.
As for folders, it’s a bit different.
When you crawl different versions of folders with dvc add dirname
, DVC is smart enough to detect only the files that changed within that directory. This means that unless you update all the files in the directory, DVC will cache only the versions of the changed files, which won’t take up much space.
In short, think of the DVC cache as a counterpart to Git staging area.
Learn more about internal DVC files and cache at this section of the user guide.