Proper structuring of a raw data repository concerns two items:
- The directory and file structure per dataset (universal across all openearth data repositories)
- The global directory structure of all datasets (different flavors possible depending on application)
Language: English
Using English in directory and file names as well as log messages is strongly recommended. In this way international sharing of the data, maybe relevant in a later stage (even if it's not foreseen now), is always easily possible.
Directory naming
Don't use capitals and spaces in directory names, use underscores instead.
Dataset structure
The main structure of a dataset is outlined below. The doc/ directory as such is optional, but providing documentation is strongly recommended.
<dataset-name>/ raw/ scripts/ doc/
The sub-directory structure and file naming is open to decide yourself, but please note the tips below. These help both you and others to conveniently use the data.
Dataset naming
Use dataset names that are informative to most of the users, also on the longer term. A type of data (e.g. bathymetry) or a type of instrument (e.g. adcp) is usually more clear than the name of a measurement campaign (e.g. megapex). The name of a measurement campaign is primarily clear to a small population of the people that where involved, but in 5 or 10 years time potential users of the data might not be familiar with it. Instead, put the name of the measurement campaign in the keywords and summary of the data product (netCDF).
Consistency
Be consistent in directory structure, directory/file naming and file formats. This makes life easier when it comes to scripting and finding your way in the data.
Date format
Use numeric date formats like yyyy-mm-dd (2016-01-12) to facilitate chronological ordering and machine readability.
File names
Use as much as possible the file names as provided by the instrument/logger/data supplier, even if they do not comply with above mentioned standards, rather than renaming. This keeps the reference to the original direct and prevents mistakes during manual rename actions.
Global structure
For the global structure of a repository that stores a collection of datasets, some flavors are possible, depending on its scope.
The openearthrawdata repository is using the following structure, in order to clearly acknowledge the institutes that bring in the data:
trunk/ <institute-name-1>/ <datacatogory-name-1>/ <dateset-name-1>/ <dateset-name-2>/ <dateset-name-3>/ <dateset-name-4>/ <institute-name-2>/ <dateset-name-5>/ <dateset-name-6>/ ...
For project specific repositories, a good alternative is to start with a layer of measurement domains. The structure for the zandmotordata.nl repository looks like:
trunk/ aeolian/ dune_development/ ecology/ benthos/ birds/ ecotopes/ fishes/ seals/ hydrology/ monitoring_wells/ meteohydro/ morphology/ naturedunes/
Contents
Binary files
Only store files in a binary format on a subversion repository if that is necessary and if they will (almost) never be modified. Be aware that having multiple versions of a binary file in the repository history causes inefficient use of the storage capacity.