OPeNDAP aggregation with Matlab

subsetting and aggregation

OPeNDAP servers are basically a kind of remote file system that allows access via the web.
It allows for subsetting (slicing a salami sausage) and aggregation (making a salami sausage)
Files on a OPeNDAP server can be

subsetted: request only part of a file. In traditional download approaches files were considered indivisible granules nthat could be download as a whole, (ii) or be subsetted by dedicated processing on the server. OPeNDAP allows users to define the subsetting on the client. The data owner only has to provide his netCDF files via OPeNDAP for the whole concept of granularity to become irrelevant. Users can talk to every single number behind a remote OPeNDAP url, just as users were already able to talk to every single number inside a local netCDF file. OPeNDAP is basically web-enabled netCDF.

aggregated: sometimes datasets consist of lots os smaller files.Users often want to access these collections of files as if there were one single, virtual file. Some types can be aggregated by the THREDDS OPeNDAP server to behave as one giant virtual file. This option is available for the temporal chunks produced by forecasting systems, so called Forecast Model Run Collections (FMRC). Please refer to the Unidata FMRC documentation: http://www.unidata.ucar.edu/software/netcdf/ncml/v2.2/FmrcAggregation.html.
For other file collections this so-called server-side aggregation has not been implemented. The THREDDS OPeNDAP server offers xml catalogs http://www.unidata.ucar.edu/Projects/THREDDS/tech/catalog/v1.0.2/Primer.html that contain meta-data enriched list of files on an OPeNDAP server. With this meta-data, user can aggregate (parts of) remote file collections themselves. This is called client-side aggregation. THREDDS always offer the equivalent of a remote ls (linux) or dir (windows) that simply returns the names of the files, their file size and their time stamp.

In summary, the following methods are available for remote file aggregation:

server -side aggregation
- temporal chunks: FMRC
client-side aggregation
- spatial tiles: parse THREDDS catalog.xml
- time series at various locations: parse THREDDS catalog.xml

THREDDS catalogs (for subsequent aggregation)

For Matlab we have made the function opendap_catalog that parses a remote catalog.xml to find the remote files. On any THREDDS OPeNDAP machien you can find this xml by simply replacing the view-for-humans url that ends with catalog.html by catalog.xml.
For instance the RIjkswaterstaat vaklodingen dataset
http://opendap.deltares.nl/thredds/catalog/opendap/rijkswaterstaat/vaklodingen/catalog.html
becomes http://opendap.deltares.nl/thredds/catalog/opendap/rijkswaterstaat/vaklodingen/catalog.xml.

<catalog version="1.0.1"><service name="all" serviceType="Compound" base="">
<service name="odap" serviceType="OPENDAP" base="/thredds/dodsC/"/>
<service name="http" serviceType="HTTPServer" base="/thredds/fileServer/"/>
</service><dataset name="vaklodingen" ID="varopendap/rijkswaterstaat/vaklodingen">
<metadata inherited="true"><serviceName>all</serviceName><dataType>GRID</dataType>
</metadata>

<dataset name="vaklodingenKB140_1716.nc"
ID="varopendap/rijkswaterstaat/vaklodingen/vaklodingenKB140_1716.nc"
urlPath="opendap/rijkswaterstaat/vaklodingen/vaklodingenKB140_1716.nc">
<dataSize units="Mbytes">5.010</dataSize>
<date type="modified">2012-09-11 13:49:19Z</date></dataset>

<dataset name="vaklodingenKB139_2120.nc"
ID="varopendap/rijkswaterstaat/vaklodingen/vaklodingenKB139_2120.nc"
urlPath="opendap/rijkswaterstaat/vaklodingen/vaklodingenKB139_2120.nc">
<dataSize units="Mbytes">5.010</dataSize>
<date type="modified">2012-09-11 13:49:17Z</date></dataset>

...
<dataset name="vaklodingenKB109_4948.nc"
ID="varopendap/rijkswaterstaat/vaklodingen/vaklodingenKB109_4948.nc"
urlPath="opendap/rijkswaterstaat/vaklodingen/vaklodingenKB109_4948.nc">
<dataSize units="Mbytes">55.01</dataSize>
<date type="modified">2012-09-11 13:51:53Z</date></dataset>

<dataset name="vaklodingenKB109_4746.nc"
ID="varopendap/rijkswaterstaat/vaklodingen/vaklodingenKB109_4746.nc"
urlPath="opendap/rijkswaterstaat/vaklodingen/vaklodingenKB109_4746.nc">
<dataSize units="Mbytes">5.010</dataSize>
<date type="modified">2012-09-11 13:51:55Z</date></dataset>

<dataset name="catalog.nc"
ID="varopendap/rijkswaterstaat/vaklodingen/catalog.nc"
urlPath="opendap/rijkswaterstaat/vaklodingen/catalog.nc">
<dataSize units="Kbytes">206.8</dataSize>
<date type="modified">2012-09-11 16:46:52Z</date></dataset>

</dataset>
</catalog>

This xml contains a file list with relative local file names and a few services taht are offered, among which is OPeNDAP (next to possibly some WxS). Using the
(i) url root (http://opendap.deltares.nl/thredds/)
(ii) the opendap-service part from catalog.xml: base="/thredds/dodsC/"
(iii) the local reference from catalog.xml: ID="varopendap/rijkswaterstaat/vaklodingen/vaklodingenKB140_1716.nc",
the absolute url can be constructed
http://opendap.deltares.nl/thredds/dodsC/opendap/rijkswaterstaat/vaklodingen/vaklodingenKB140_1716.nc.

opendap_catalog.m is an openearthtoosl Matlab function taht does this for the entire catalog.xml, and optionally for linked cataliogs.ml, as well. It returns a cellstr with the file name urls that can readibly be used in the netCDF library. opendap_catalog can also be called on a local directory with netCDF files (for instance a local cache of part of an opendap server), and will a similar list.

example: spatial tiles. For offline use I have downloaded spatial files from the Rijkswaterstaat Vaklodingen offshore bathymetry tiles http://opendap.deltares.nl/thredds/catalog/opendap/rijkswaterstaat/vaklodingen/catalog.html into a local folder D:\opendap.deltares.nl\thredds\catalog\opendap\rijkswaterstaat\vaklodingen\catalog.html. I chose the local folder to have the same name as the remote url for reasons of simplicity in use (replace http with d) and traceability.

list = opendap_catalog('http://opendap.deltares.nl/thredds/catalog/opendap/rijkswaterstaat/vaklodingen/catalog.html')

% will return a cellstr

'http://opendap.deltares.nl/thredds/dodsC/opendap/rijkswaterstaat/vaklodingen/vaklodingenKB140_1716.nc'
'http://opendap.deltares.nl/thredds/dodsC/opendap/rijkswaterstaat/vaklodingen/vaklodingenKB139_2120.nc'
...
'http://opendap.deltares.nl/thredds/dodsC/opendap/rijkswaterstaat/vaklodingen/vaklodingenKB109_4948.nc'
'http://opendap.deltares.nl/thredds/dodsC/opendap/rijkswaterstaat/vaklodingen/vaklodingenKB109_4746.nc'

list = opendap_catalog('D:\opendap.deltares.nl\thredds\catalog\opendap\rijkswaterstaat\vaklodingen\catalog.html')

% will return a cellstr

'D:\opendap.deltares.nl\thredds\catalog\opendap\rijkswaterstaat\vaklodingenvaklodingenKB140_1716.nc'
'D:\opendap.deltares.nl\thredds\catalog\opendap\rijkswaterstaat\vaklodingenvaklodingenKB139_2120.nc'
...
'D:\opendap.deltares.nl\thredds\catalog\opendap\rijkswaterstaat\vaklodingenvaklodingenKB109_4948.nc'
'D:\opendap.deltares.nl\thredds\catalog\opendap\rijkswaterstaat\vaklodingenvaklodingenKB109_4746.nc'

This means that with {{opendap_catalog}.m} you can work identically with local and remote
netCDF collections. The THREDDS calalog has web-enabled remote ls (linux) or dir (windows).

OPeNDAP: talk to one remote netCDF file as if it were local
THREDDS catalog: talk to remote netCDF file collections as if they were local.

OpenEarth THREDDS metadata harvesting and dissemination

For aggregation, we also need to know something of the nature of each file. Catalogs can be configured to contain this very meta-data, but often that kind of meta-data is not present. We have to get it ourselves in that case. Fortunately the remote netCDF files are not black-box granules, so we can interrogate them to extract what we want. We made the client_side harvesting function nc_cf_harvest.m (and also dapcrawler.py) to interrogate the remote netCDF-files-formerly-considered-granules to extract the essentuil what-where-when meta-data: spatial extent + resolution (what?), temporal extent + resolution (when?) and standard_names, long_names, variable names and units (what?). (Note with such harvesting we do not need all the INSPIRE/ISO199115/ISO19135 meta-data anymore that's everyone is so concerned with (for decades and it still doesn't work smoothly). We simply made our own search netCDF engine, perhaps Google will also search inside netCDF files some day, next to pdf, doc, xls, html that Google already interogates. OpenEarth actually promotes to store meta-data inside data files-formerly-known-as-granules rather then as ISO xml next to it. The ingredients and expiration date of your peanut butter also stored on the very jar you buy, and not in an xml file next to your jar...).

As OpenEarth approach, we store (cache) this harvested meta-data as {catalog.nc}, and store it between the very remote netCDF data files. The previous function opendap_catalog.m filters out any url ending with {catalog.nc}. You can now simply talk to this remote cached meta-data file (or oldfashionedly download it) to obtain all meta-data needed for aggregation. Meta-data in netCDF format is of course easier to handle for the people who can handle very netCDF data files already (no need to learn netCDF and xml). With the OpenEarth function nc2struct you can now grab all meta-data at once. The field {meta.urlPath} contains the url to be used for accessing the data.

meta = nc2struct('http://opendap.deltares.nl/thredds/dodsC/opendap/rijkswaterstaat/vaklodingen/catalog.nc')
meta =

                          urlPath: {179x1 cell}
             projectionCoverage_x: [179x2 double]
             projectionCoverage_y: [179x2 double]
    geospatialCoverage_northsouth: [179x2 double]
      geospatialCoverage_eastwest: [179x2 double]
                     timeCoverage: [179x2 double]
                          datenum: [179x2 double]
             Metadata_Conventions: {179x1 cell}
                    cdm_data_type: {179x1 cell}
                    creator_email: {179x1 cell}
                     creator_name: {179x1 cell}
                      creator_url: {179x1 cell}
               geospatial_lat_max: [179x1 double]
               geospatial_lat_min: [179x1 double]
             geospatial_lat_units: {179x1 cell}
               geospatial_lon_max: [179x1 double]
               geospatial_lon_min: [179x1 double]
             geospatial_lon_units: {179x1 cell}
          geospatial_vertical_max: [179x1 double]
          geospatial_vertical_min: [179x1 double]
     geospatial_vertical_positive: {179x1 cell}
        geospatial_vertical_units: {179x1 cell}
                          history: {179x1 cell}
                      institution: {179x1 cell}
                         keywords: {179x1 cell}
              keywords_vocabulary: {179x1 cell}
                          license: {179x1 cell}
                 naming_authority: {179x1 cell}
                 processing_level: {179x1 cell}
                  publisher_email: {179x1 cell}
                   publisher_name: {179x1 cell}
                    publisher_url: {179x1 cell}
         standard_name_vocabulary: {179x1 cell}
                          summary: {179x1 cell}
              time_coverage_units: {179x1 cell}
                            title: {179x1 cell}

you can now plot the spatial overview

lon = [meta.geospatial_lon_min meta.geospatial_lon_max meta.geospatial_lon_max  meta.geospatial_lon_min meta.geospatial_lon_min|meta.geospatial_lon_min meta.geospatial_lon_max meta.geospatial_lon_max  meta.geospatial_lon_min meta.geospatial_lon_min]
lat = [meta.geospatial_lat_min meta.geospatial_lat_min meta.geospatial_lat_max  meta.geospatial_lat_max meta.geospatial_lat_min|meta.geospatial_lat_min meta.geospatial_lat_min meta.geospatial_lat_max  meta.geospatial_lat_max meta.geospatial_lat_min]
KMLplot(lat',lon')

Space shortcuts

Child pages

subsetting and aggregation

THREDDS catalogs (for subsequent aggregation)

OpenEarth THREDDS metadata harvesting and dissemination