Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

When coding software / running models / creating extensive configurations, your code space / project folder / configurations will most likely contain a mix of text files, data files and software binaries. All these files need to be place in a repository under version control and need to be managed as a whole. For the text files you will want to be able to compare differences between versions, in order to understand what has changed over time. This will not be the case for binary files or very large data files as humans are generally not well equipped to compare bits and bytes.

In the 'past' SVN was an ideal place to store your whole repository in one place. Currently  SVN  is in the process of being phased out and as a replacement GITLABhas been introduced. What the advantages / disadvantages of both systems are will not be discussed here. Instead we will focus on how to setup your GITHUB repository to include both your text based files as your larger binaries.

The problem with GIT (and therefor also GITLAB) is that it is not designed to handle large and or binary files. To overcome this problem GIT Large File Storage (LFS) was introduced.  The basic idea behind GIT LFS is that the actual binary file is not stored in your GITLAB repository. Instead only a reference to this file is stored. The actual binary file is stored in an Object Storage location (S3 bucket).

GITLAB offers LFS out-of-the-box, using the MinIO server hosted by Deltares .

So in short. You will have a GIT repository in the Deltares GITLAB instance which will contain only your text base files and small data files. While your large files or binary files will be stored on the Deltares Minio server.

To manage all your text-, large- and binary files as a single project you have three options to connect your GITHUB repository to the Deltares MinIO object store:


Prerequisite: In the below guides we expect the user to have a basic understanding of GIT and its related commands.

How to setup GIT-LFS?



Expand

Setup your GITLAB repository:

First you must setup your GITHUB GITLAB repository by cloning this to your computer. If you do not yet have a GITHUB GITLAB repository, you can request one 'Request a repository' page. 

Once your GITLAB repository is in place and up-to-date, continue with the GIT-LFS instructions.

Setup GIT-LFS:

Go to the GIT-LFS website and follow the instructions on how to download and install GIT-LFS. A good starting point is the 'Getting Started' section on the home page.

GIT-LFS configuration files:

.gitattributes:    Stored in the root folder of your repository. This file contains patterns of all files that GIT-LFS should track and manage as 'large files'.

Code Block
languagegroovy
title.gitattributes
*.bin filter=lfs diff=lfs merge=lfs -text

Useful examples for .gitattributes can be found here


Info
titleGIT-LFS API Proxy
Contrary to GIT_LFS in combination with GITHUB, it is not necessary to install the GIT-LFS API Proxy

How this works in practice is as follows:

  1. User adds a file to the repository that is identified as a file that should be tracked (see .gitattributes)
  2. The file is committed in local repository
  3. Once user pushes this file to GITHUB GITLAB the GIT LFS Proxy intercepts the call and replaces the targeted file by a reference file. This reference file has the same name as the targeted file but the content has been replaced with reference information as shown in the example below

    Code Block
    version https://git-lfs.github.com/spec/v1
    oid sha256:3056ea3aa2461c9f149ff2c6b62ced81bf396e8bdbbdd0ddb6e8180350f7d715
    size 545791164


  4. The actual targeted file is pushed to your configured MinIO bucket with a name that matches the oid sha256 value.



Info
Please note. When deleting the targeted file from your repository, the file is only removed from GITLAB. The actual files in MinIO remain untouched


How to setup DVC?


Expand

Setup your GITHUB repository:

First you must setup your GITHUB GITLAB repository by cloning this to your computer. If you do not yet have a GITHUB GITLAB repository, you can request one 'Request a repository' page.

Once your GITHUB GITLAB repository is in place and up-to-date, continue with the DVC instructions.

Setup DVC:

Go to the DVC website and follow the instructions on how to download and install DVC. A good starting point is the 'Get Started' page.

DVC configuration files:

.dvcignore:    Stored in the root folder of your DVC project. This file contains patterns of all files that DVC should ignore.

Code Block
languagegroovy
title.dvcingore
# Add patterns of files dvc should ignore, which could improve
# the performance. Learn more at
# https://dvc.org/doc/user-guide/dvcignore

# Ignore secrets file
.dvc/config.local

.dvc/.gitignore:    Stored in the .dvc folder. This file is similar to the .dvcignore file however this file contains patterns of all files that GIT should ignore.

Code Block
languagegroovy
title.gitignore
/config.local
/tmp
/cache

.dvc/config:    Stored in the .dvc folder. Contains all DVC configuration that can be shared and can be uploaded into your repository.

Code Block
languagegroovy
titleconfig
 [core]
    remote = miniostorage
    autostage = true
['remote "miniostorage"']
    url = s3://<path to your bucket>
    endpointurl = https://s3.avi.deltares.nl
    ssl_verify = false

.dvc/config.local:    Stored in the .dvc folder. Contains all DVC configuration that cannot be shared nor uploaded into your repository

Code Block
languagegroovy
titleconfig.local
['remote "miniostorage"']
    access_key_id = 
    secret_access_key = 



How to setup custom scripts?


Expand

How you will setup your scripting environment will strongly depend on the codding language of the source code in your GITHUB repository. But in all cases you can take advantage of the REWIND functionality of MinIO. This functionality allows you to restore your data folder of files to a given point in time. 

For Python an example can be found here: https://github.com/robin-deltares/minio-py-rewind/blob/main/minio_rewind.py

Code Block
languagepy
titleRewind example
from minio import Minio
import minio_rewind

# For access
myMinioServer = 'my.minio.server'
myAccessKey   = 'my_access_key'
mySecretKey   = 'my_secret_key'

# The path that will be recursively downloaded
myBucketName = 'my_bucket_name'
myPathName   = 'my_path_name'
myRewind     = '2023.05.10T16:00' # Notation that mc uses

# Minio client connection
myClient = Minio(myMinioServer,
                 access_key=myAccessKey,
                 secret_key=mySecretKey)

# Prepare the rewind-settings
rewinder = minio_rewind.Rewinder(myClient,myRewind)

# Download the objects
rewinder.download(myBucketName,myPathName)



Choosing between the above solutions


Expand

GIT-LFS

git-lfs is intended to be transparent to git, therefore it requires a customized server. Its learning process is short and fast. Some configuration commands, and bang! it is running, storing large files independently of the git repository. That's its only function, and it does it fine. Having an additional server is not a drawback, but instead a requirement for such transparency. Once configured, files are just handled by git, by means of git hooks (endpoints that are activated after git operations).

Limitations of GIT-LFS can are documented here.

DVC

dvc is intended to provide independent management of large files for the final user. What dvc basically does is this: it just makes git ignore the files that you wish to control (adding them to .gitignore) and instead, it generates an additional file with the same name and the extension .dvc. So, in order to push a commit with its corresponding files, the user is required to manually "add" (equivalent to git commit, not to git add; there's no equivalent for the git stage in dvc) and "push" to both systems. This is not a drawback, but a necessary level of control. In exchange, the remote large-files-holder is just any remote filesystem, accessible directly by its path, via ssh or via multiple drivers (google drive, amazon, etc.). Anyway, hooks are also available for dvc, which would simplify the use of large files, if having additional files is not annoying to one, and saving files to the remote would require additional operations, remember that they are .gitignored! So, if you modify a file stored in dvc, such change will not be noticed by git status, and you might lose such change, except if you make the additional check with dvc.

Some comparisons with related technologies can be found here.

Custom scripts

scripting is the most flexible way to go. It allows you to access all MinIO's API functionality in the language of your preference. To help you get started MinIO offers the user a variety of SDKs. Scripting does imply that you as developer have enough coding skills  are also the maintainer. However with enough real-word examples this option should not be too difficult.

Note to developers: Please provide your examples to github-support@deltares.nl so we can incorporate them into this manual.