Proper structuring of a raw data repository concerns two items:

  • The directory and file structure per dataset (universal across all openearth data repositories)
  • The global directory structure of all datasets (different flavors possible depending on application)

 

Language: English

Using English in directory and file names as well as log messages is strongly recommended. In this way international sharing of the data, maybe relevant in a later stage (even if it's not foreseen now), is always easily possible.

Directory naming

Don't use capitals and spaces in directory names, use underscores instead.

Dataset structure

The main structure of a dataset is outlined below. The doc/ directory as such is optional, but providing documentation is strongly recommended.

<dataset-name>/
    raw/
    scripts/
    doc/

The sub-directory structure and file naming is open to decide yourself, but please note the tips below. These help both you and others to conveniently use the data.

Dataset naming

Use dataset names that are informative to most of the users, also on the longer term. A type of data (e.g. bathymetry) or a type of instrument (e.g. adcp) is usually more clear than the name of a measurement campaign (e.g. megapex). The name of a measurement campaign is primarily clear to a small population of the people that where involved, but in 5 or 10 years time potential users of the data might not be familiar with it. Instead, put the name of the measurement campaign in the keywords and summary of the data product (netCDF).

Consistency

Be consistent in directory structure, directory/file naming and file formats. This makes life easier when it comes to scripting and finding your way in the data.

Date format

Use numeric date formats like yyyy-mm-dd (2016-01-12) to facilitate chronological ordering and machine readability.

File names

Use as much as possible the file names as provided by the instrument/logger/data supplier, even if they do not comply with above mentioned standards, rather than renaming. This keeps the reference to the original direct and prevents mistakes during manual rename actions.

Global structure

For the global structure of a repository that stores a collection of datasets, some flavors are possible, depending on its scope.

The openearthrawdata repository is using the following structure, in order to clearly acknowledge the institutes that bring in the data:

openearthrawdata repository structure
trunk/
	<institute-name-1>/
    		<datacatogory-name-1>/
        		<dateset-name-1>/
        		<dateset-name-2>/
    		<dateset-name-3>/
    		<dateset-name-4>/
	<institute-name-2>/
		<dateset-name-5>/
    	<dateset-name-6>/
	...

 

For project specific repositories, a good alternative is to start with a layer of measurement domains. The structure for the zandmotordata.nl repository looks like:

Project specific repository structure
trunk/
	aeolian/
	dune_development/
	ecology/
		benthos/
		birds/
		ecotopes/
		fishes/
		seals/
	hydrology/
		monitoring_wells/
	meteohydro/
	morphology/
	naturedunes/

 

Contents

Binary files

Only store files in a binary format on a subversion repository if that is necessary and if they will (almost) never be modified. Be aware that having multiple versions of a binary file in the repository history causes inefficient use of the storage capacity.

  • No labels