Data Collection Protocol

Skip to end of metadata
Go to start of metadata

Printing: If you want to print this document select tools -> export to pdf from the top right corner
Acknowledgements

Table of contents

Introduction

The memo presented on this wiki page serves as the primary deliverable of WP2 Data standards of the EU FP7 Project MICORE. The goal of this memo is to define a data standard and archiving protocol to provide end-users with a comprehensive standardised database. The MICORE data standards protocol is adopted by the OpenEarth initiative.

Premises

To setup the data collection procedure for Micore the following premises are taken into account.

  • The collected data should be reproducible, unambiguous and self descriptive
  • The data collection procedure will be transferable
  • The data collection procedure will be automated
  • The data collection procedure will be based on open standards
  • The data collection procedure will not require non-libre software.

Data collection in Micore

Several types of data will be collected in the Micore project. Most of the data collected will be geospatial or geospatial temporal in nature. The data collection procedure mainly focuses on these two types of data. If required this data standard will be expanded to fit different types of data.

Procedure

The data collection procedure consists of four phases.

  1. Extract. Taking measurements and storing the measured data into files.
  2. Transform. Enriching gathered data with metadata and storing in a standard file format.
  3. Load. Storing the files in a database.
  4. Provide. Giving access to the database.

Architecture

To support the data collection procedure with software tools, the following architecture is used.

Extraction

Gathering measurements

The data collection methods are described in the Site Monitoring work package.

Required information

To make our data collection reproducible, unambiguous and self descriptive we have to look at the following aspects and look at the required information. The geospatial, time and physics aspects of the data is important as most data collected in this project is related to these aspects.

General

General information on a data set will be filled in a metadata form(see below). The resulting metadata file contains the following information:

  • title
  • description
  • contact information
  • resource identifier (unique location where data is stored)
  • other aspects like, date, lineage, language, authorization, copyright, etc...

This general information will be collected using the inspire directive

Metadata form

To make data understandable for others and to conform to the European Inspire guideline extra metadata should be provided. The inspire guideline follows the ISO19115 metadata standard. This should be done by filling in the forms in the inspire metadata editor. The resulting file should be saved next to the data set.

The inspire metadata editor can be used to describe datasets.

the inpire metatadata editor

The following fields of the metadata editor can be a bit confusing:

  • he resource identifier. We use the location in on the OpenEarthRawData repository for this.
  • the resource name. This will be micore followed by a colon followed by the resource identifier.
  • the lineage means the origin of your data. Inspire gives as an example "Dataset has been digitised from the standard 1:5.000 map".

The output of the editor is an xml file which is not intended for human consumption but can be viewed or edited by for example jedit with the xml plugin.

Geospatial

Referencing information on the earth requires the definition of a location. References done to a location are done in a coordinate system. Several types of coordinate systems exists. Most relevant are

  • Geographical coordinate system. This defines the size and shape of the earth (for example World Geodetic System 1984) , the origin (usually equator and greenwich) and the units (degrees).
  • Projected coordinate system. This defines a translation from the original geographic coordinate system in another (usually x,y cartesian) coordinate space. For example WGS 84/UTM zone 31N is defined as the transverse mercator projection of the WGS84 spheroid defined in meters.
  • Engineering coordinate system. This is a local system often related to a local object. For example in a physical experiment or on a boat.
  • Vertical coordinate system. Several reference levels can be used for vertical coordinate systems. Reference levels can be the spheroid, the geoid, the mean sea level or for example the mean low water level.

Example of vertical coordinate system

For the geospatial information the standards of the Open Geospatial Consortium are used.

Time

If we refer to a certain date or time there can be confusion about many aspects, possibly resulting in misinterpretation of data:

  • Calendar: Gregorian or other
  • Leap years: every 4 years a leap year, but not every 100 years except for every 400 years.
  • Time zones
  • Day light saving times

For time information the Climate and Forecasting convention will be used.

Physics

The following information needs to be stored with the measurements:

  • Units of measurement
  • Measurement method
  • Physical phenomena

For time information the Climate and Forecasting convention will be used.

Data intake

The data extraction covers the procedure from taking measurements and making this information available in digital form. The information collected should be made accessible to others by using central storage or external public sources.

Central storage of raw data

A version control system (subversion) will be made available to store raw data. This subversion repository will be made available through the url https://repos.deltares.nl/repos/OpenEarthRawData/trunk and will provide access to raw measurements. On windows computers this server can be accessed by use of the toirtoise svn client , through the add network wizard (webdav) or through a web browser.
Layout of the repository

\_organization (default micore)
   \_your_dataset_title
     \_inspire_description.xml
     \_raw
       \_data file1
       \_data file2
     \_cache (empty)
     \_scripts
     \_processed

External sources

If data is already stored in a publicly available website there is no requirement to store this data again. Only a reference to the dataset will be required if the information is compatible with the open standards used in this project (WFS, WCS, opendap). If the dataset uses another protocol or plain files a script is required to cache the data locally before transformation. A script for local caching should be provided.

Step by step

For local data sets

For public data sets

Tansformation

The main goal of the transformation step is to transform the raw data (or external source) into one consistent and unambiguous collection of datasets. This is required to make the whole data collection easily accessible. This requires adding extra information to the collected data and the use of existing standards and conventions. Most of these tasks can be accomplished using scripts and command line utilities.
Examples of transformations are:

  • File format conversions
  • Reprojection
  • Statistical aggregation

The script used to transform the data will be stored in https://repos.deltares.nl/repos/OpenEarthRawData/trunk/micore/your_dataset_title/scripts. Scripts should run without any user interaction.

File formats

Supported File formats

To ease the task of making the data available we limit the number of supported file formats to the following two:

File conversion

The data which is provided will often be in another format then netcdf/cf or shapefiles. The raw data should therefor be transformed into the proper file formats. Many types of raster data can be transformed into netcdf using the gdal library or the gdal_translate command line utility. Most feature data can be translated into shapefiles using the ogr library or the ogr2ogr command line utility. Also the arcgis toolbox can be used for these tasks.

The netcdf files generated should conform to the climate and forecasting (CF) convention where possible. You can check conformity on a netcdf checker website

Geospatial

Collected data is often stored in a geographical or projected coordinate system. For common analysis the coordinate system to use is the WGS84. If data is stored in a different coordinate system it can be described unambiguous by using a "well known text" (WKT) reference. The "well known text" coordinate system allows to describe a coordinate system. It is defined by the OGC SFA specification (chapter 9). You can look up your spatial reference WKT on the website http://www.spatialreference.org.

Spatial reference

Example of coordinate system description in well known text (WKT) format:

COMPD_CS["OSGB36 / British National Grid + ODN",
    PROJCS["OSGB 1936 / British National Grid",
        GEOGCS["OSGB 1936",
            DATUM["OSGB_1936",
                SPHEROID["Airy 1830",6377563.396,299.3249646,AUTHORITY["EPSG","7001"]],
                TOWGS84[375,-111,431,0,0,0,0],
                AUTHORITY[["EPSG","6277"]],
            PRIMEM["Greenwich",0,AUTHORITY["EPSG","8901"]],
            UNIT["DMSH",0.0174532925199433,AUTHORITY["EPSG","9108"]],
            AXIS["Lat",NORTH],
            AXIS["Long",EAST],
            AUTHORITY[["EPSG","4277"]],
        PROJECTION["Transverse_Mercator"],
        PARAMETER["latitude_of_origin",49],
        PARAMETER["central_meridian",-2],
        PARAMETER["scale_factor",0.999601272],
        PARAMETER["false_easting",400000],
        PARAMETER["false_northing",-100000],
        UNIT["metre",1,AUTHORITY["EPSG","9001"]],
        AXIS["E",EAST],
        AXIS["N",NORTH],
        AUTHORITY[["EPSG","27700"]],
    VERT_CS["Newlyn",
        VERT_DATUM["Ordnance Datum Newlyn",2005,AUTHORITY["EPSG","5101"]],
        UNIT["metre",1,AUTHORITY["EPSG","9001"]],
        AXIS["Up",UP],
        AUTHORITY[["EPSG","5701"]],
    AUTHORITY[["EPSG","7405"]]

The NetCDF/CF convention specifies the use of coordinates. The well known text can be stored as an "spatial_ref" attribute in the netcdf/cf file to variables containing the coordinates.
Shapefiles should have an .prj file which contains the projection information.

Time

When using time or dates it should be clear which calendar, timezone, daylight savingstime, am/pm notation is used. This is defined clearly in the NetCDF/CF convention for CF files. If time is referenced in shapefiles local time is assumed. This does not allow comparison over time zones. Therefore, timeseries are best stored in NetCDF/CF files.

Physics

To answer the questions what was measured and how much there needs to be a clear description of the units of measurement, measurement method and which physical phenomena are measured. The units can be described in SI units or SI derived units. The notation should be stored in a udunits compatible way in the netcdf files. The compatible units are listed on the unidata website. The shape file doesn't allow for units attributes so this information should be part of documentation stored with the dataset.

The physical phenomena which are described and their measurement method can be described in the netcdf/cf files. If a variable is described in the standard names of the CF convention this should be used if not a compatible name should be chosen.

Step by step

Loading

Periodically the transformation scripts will be run and the resulting netcdf and shape files will be stored in a database. The netcdf files will be loaded into the micore opendap system (multidimensional database). This system will be available on a public website. The shapefiles will be loaded into a postgis database (relational database). This postgis database will be made available to Micore project members.
The data layer can be enriched by adding a service layer (WCS, WFS) and a presentation layer (WMS) to make data easily presentable. This part will be a cooperation between workpackage 6 and 2.

Providing.

The goal of data providing is to make the data available to everybody in the project. This will be done by directly allowing access to all the layers of the data store.

Layers Software Communcation protocol how to access
File layer Coverages: Netcdf, Features: shapefiles subversion/webdav/http matlab/ogr/gdal/arcgis
Data layer Coverages: Hyrax, Features: Postgis opendap, odbc, jdbc, psql matlab/ogr/gdal/arcgis
Service layer Thredds, UMN Mapserver WCS, WFS matlab/ogr/gdal/arcgis
Presentation layer UMN Mapserver, mapnik
WMS
browser/google earth

File layer

The files will be made accessible through the subversion repository which can be accessed through the subversion, http and the webdav protocol.

Service layer

Run a web map server on top of the OpenDAP system and the spatial database.

GIS data types

The Open Geospatial Consortium (OGC) specifies services to expose the geospatial related information. This terminology will be used because much of the data collected in the Micore project has a geospatial aspect. The three main services are the Feature, Map and Coverage service.

Features

A feature is a geographic object, usually composed of a geographic object (point, polygon) with a combination properties . For example a river can be seen as a complex set of lines with properties as name, depth, lenght, waterlevel, etc. The definition from the OGC:

The quantum of geographic information is the feature. A feature object (in software) corresponds to a real world or abstract entity. Attributes of (either contained in or associated to) this feature object describe measurable or describable phenomena about this entity. Unlike a data structure description, feature object instances derive their semantics and valid use or analysis from the corresponding real world entities'meaning.

Coverage

The coverage is also a feature but it does not have single value properties but the properties are multidimensional functions. Dimensions are often spacial in nature but can also relate to time or frequency. Examples of coverages are bathymetry, wind fields, lidar measurements. The OGC definition:

A coverage is a feature that associates positions within a bounded space (its domain) to feature attribute values (its range). In other words, it is both a feature and a function. Examples include a raster image, a polygon overlay or a digital elevation matrix.

Maps

While features and coverages relate to the data it does not imply any representation. When you present your geospatial data you usually render it on a map or a globe. Rendered geospatial information is what is covered by the map services. The OGC definition:

This International Standard defines a "map" to be a portrayal of geographic information as a digital image file suitable for display on a computer screen.

Examples

Feature (geometry with attribute)
Coverage (multidimensional function with bounded space)
Map (rendered features)

Presentation layer

Use a rendering engine in combination with the WFS and WCS services to render images for the WMS services.

Client layer

To present the data on websites several tools can be used. Usually web based mapping clients provide access to one or more of the services in the presentation and service layer.

Alternative Tools

This data collection and dissemination method can also be setup using ESRI software. The following table lists comparable software next to the open source software used in this project.

+-------------+--------------+--------------+
|             |Open          |ESRI          |
|             |Source        |              |
+-------------+--------------+--------------+
|Extract      |Subversion    |Esri GIS      |
|             |wget          |Toolbox       |
+-------------+--------------+--------------+
|Transform    |gdal/ogr,     |Esri GIS      |
|             |proj.4,       |Toolbox       |
|             |nco           |              |
+-------------+--------------+--------------+
|Load         |ogr2ogr,      |ArcGIS server |
|             |loaddap       |              |
|             |              |              |
+-------------+--------------+--------------+
|Provide      |opendap,      |Oracle        |
|data         |postgis       |spatial/MS    |
|layer        |              |SQL+ArcGIS    |
|             |              |server        |
+-------------+--------------+--------------+
|Provide      |UMN           |ArcGIS server |
|Service      |Mapserver     |              |
|Layer        |, Mapnik      |              |
|             |              |              |
+-------------+--------------+--------------+
|Provide      |Openlayers    |ArcWeb        |
|Client Layer |              |Services      |
|             |              |              |
+-------------+--------------+--------------+

Links

OpenGIS
Inspire directive
Climate and Forecasting Metadata Convention
Netcdf
Opendap

Attachments

Name Size Creator Creation Date Comment  
PNG File vertical_sigma.png 92 kB Mark van Koningsveld 03-04-2009 23:42    
PNG File vertical_z.png 73 kB Mark van Koningsveld 03-04-2009 23:42    
PNG File vertical_pressure.png 79 kB Mark van Koningsveld 03-04-2009 23:42    
PNG File vertical_hybrid.png 84 kB Mark van Koningsveld 03-04-2009 23:41    
PNG File spatialreference.png 174 kB Mark van Koningsveld 03-04-2009 23:41    
PNG File inspire.png 224 kB Mark van Koningsveld 03-04-2009 23:41    
PNG File inspire_fields.png 138 kB Mark van Koningsveld 03-04-2009 23:41    
PNG File map.png 136 kB Mark van Koningsveld 03-04-2009 23:41    
PNG File geoid.png 27 kB Mark van Koningsveld 03-04-2009 23:41    
GIF File feature.gif 3 kB Mark van Koningsveld 03-04-2009 23:41    
PNG File architecture.png 347 kB Mark van Koningsveld 03-04-2009 23:41    
File coverage.jpg 27 kB Mark van Koningsveld 03-04-2009 23:41    
PNG File data_workflow.png 241 kB Mark van Koningsveld 03-04-2009 23:41    
Labels:
deliverable deliverable Delete
dm11 dm11 Delete
bwn bwn Delete
d22 d22 Delete
micore micore Delete
Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.