DVC - Data Version Control Cheatsheet
First thing to do in a brand new directory is to initialise
git init dvc init
Next, we create a data directory and then use
dvc get to get data from a data registry into our local machine.
dvc get is like a wrapper for wget or curl where it downloads data from dvc repository
mkdir data dvc get https://github.com/iterative/dataset-registry \get-started/data.xml -o data/data.xml
Now that we have the file
data/data.dvc, we can add it to tracking
dvc add data/data.xml
As soon as we run this, dvc will instruct us to add the change to git. These two files are generated when we do
git add data/.gitignore data/data.xml.dvc
We will then commit these two files using git
git commit -m "add raw data"
If we take a look at
data/data.xml.dvc, we will see something like the following. This file contains the metadata required to track the data file and will go into your git repository.
outs: - md5: a304afb96060aad90176268345e10355 path: data.xml
Add a remote storage, in this case I'm adding a S3 bucket and using the path "test"
dvc remote add -d storage s3://derekchia/test
Then we commit the configuration file containing the configuration for our remote storage
git commit .dvc/config -m "Configure remote storage"
Next, we can push the data file into our remote storage
Removing data and pulling it back
We can now try to remove the data file and then pull it back again. We also need to remove
.dvc/cache as well since this is where our data files are actually stored. See https://dvc.org/doc/command-reference/cache for more information
rm -f data/data.xml rm -rf .dvc/cache
dvc pull , we then pull the data back from our remote storage
Making changes to your data and reverting to previous version
To mimic the change in data, we double the data size using the following command
cp data/data.xml /tmp/data.xml cat /tmp/data.xml >> data/data.xml
When we change our data file, the
.dvc file also changes. This means that we need to track it with git before pushing the changed file to our remote storage
dvc add data/data.xml git add data/data.xml.dvc git commit -m "Dataset update" dvc push
We can confirm that the updated file is pushed into the remote storage by verifying that our remote storage now has two folders - each representing the different version.
If we look at our git commit log, we will see that we have several commits.
$ git log --oneline b3330e4 (HEAD -> master) Dataset update b74143a Configure remote storage b1ef2ae add raw data
To revert back to previous version of
data/data.xml.dvc commit, we run the
$ git checkout HEAD^1 data/data.xml.dvc Updated 1 path from 522ae3f
Next we run
dvc checkout for the right data file to appear. We can see that the
data/data.xml file has been modified
$ dvc checkout M data/data.xml
To keep this version of
data/data.xml.dvc, we can do a
git commit. Since we already have a version of the dataset in dvc, we do not need to do another
git commit -m data/data.xml.dvc -m "Revert data update"