Tools to support reproducible modeling

Data Version Control (DVC)

DVC stands for Data Version Control and it does exactly that. It is built on top of Git and so it uses very similar commands, which can be both easy and confusing at the same time. DVC ensures your large files themselves are not tracked by Git. Instead, it keeps the large files in its own cache and writes simple text files, which store the unique "version code" (a hash), each time you commit to DVC. These textfiles then have to be committed to Git. This may seem a bit cumbersome, as you have to commit data first to DVC, and then its version code to Git, but this allows the user to utilize the full potential of both Github as well as Cloud Storages (Like Amazon S3, Google Cloud, Microsoft Azure etc.). The version history is all stored in Git and can be pushed to Github or Gitlab, but the data itsself can be stored on different platforms more specialized in handling large files.

Find more on how to install and use DVC here

GIT

Today (03-2023), Git is the most commonly used version control system for code. Code development is usually shared on platforms such as Github or Gitlab.

One of the main differences between Git and older systems such as Subversion is that Git is a distributed version control system, whereas Subversion is a centralized system. In subversion, there is no history of changes on the machine, only on the central server. So therefore, to do any version control, the user has to be connected to a server. With Git a repository on a server (called remote) is first cloned to the user's local machine. This repository includes the full history of changes. The user can then make his/her changes, after which he/she can push these back to the remote, but does not have to. Because it is more indirect, and the availability of widely used platforms such as Github, the distributed nature of Git allows for smooth development between different organizations as well as open-source code development. Furthermore, you can develop code offline.

Find more on how to install and use GIT here

Snakemake

Snakemake is a powerful workflow management system that allows you to define and execute complex computational workflows in a reproducible and scalable manner. In this tutorial, we'll walk through the basics of Snakemake and demonstrate how to create and execute a simple workflow.

Information about Snakemake can be found here

Space shortcuts

Tools to support reproducible modeling

Data Version Control (DVC)

GIT

Snakemake