Tutorial Lessons learned on structuring repositories

open standards & open source - The infrastructure should be based on open standards, not require non-libre software and be transferable
complete transparency - The collected data should be reproducible, unambiguous and self descriptive. Tools and models should be open source, well documented and tested
centralized access - The collection and dissemination procedure for data, models and tools should be web-based and centralized to promote collaboration and impede divergence
clear ownership and responsibility - Although data, models and tools are collected and disseminated via the OpenEarth infrastructure, the responsibility for the quality of each product remains at its original owner.

To make the OpenEarth infrastructure work, the following ICT components are required in practice(see figure):

a linux server (either virtual or real),
a subversion server to enable version control, backup and controlled access,
a tomcat6 server running an OPeNDAP protocol to provide netCDF files,
an apache2 webserver to provide data visualisation files like *.kml for visualisation in Google Earth, and
the OpenSSH protocol to facilitate remote access to the server (for maintenance purposes).

A tutorial to set up this infrastructure can be found here. This tutorial focuses on the actual use of the SubVersion repositories once they have been set up.

Structuring your raw data repository

use a convenient strucure (see also Raw data repository structuring)
- OpenEarth repositories: structured around institutions
- Internal repositories: structured around projects
try to store your raw data in zipped form as much as possible
- this will save space
- this is a convenient way to keep the save dates of your raw data
- transfer speed of the svn repository to your local checkout deepens heavily on the average file size. Large numbers of very small files give extremely low transfer rates. Zipping many smaller files (order of 0-10 KB) to several bigger files (order 1 - 100 MB) can improve transfer rate with a factor 1000.
- It is easy to compare the contents of data is stored in ASCII txt with a diff viewer. If you zip the data, you lose this option. Raw data is usually stored once, and then never changed, so this shouldn't be a problem.
work with regenerate scripts as much as possible
- easier to maintain
- whole dataset can be re-processed much easier
- easier to learn from others

Space shortcuts

Child pages

Contents

Structuring your raw data repository

Structuring your models repository

Structuring your tools repository