1207616-000-ZWS-0002-r-Database volume analysis method.pdf
2015-04-20 Lunchlezing - Database Analysis.ppt

 

Introduction

Many systems are expanding over time. After the initial set up new models are added to the system or new (gridded) data feeds are introduced. All these changes have an impact on the size of the databases with the FEWS system. Finally, the introduction of ensemble forecasting can increase the size and complexity of the system further.

While the increase in coverage and complexity of the forecasting service is desirable, the associated increase in database size should be optimised where possible to ensure that the system performance remains acceptable. While increased database sizes do not pose technical problems when running the FEWS Software, such large databases do result in a loss of performance, in particular when synchronising data from the central server to distributed clients.

There are many factors that influence the size of the operational FEWS Databases. This includes the amount and resolution (spatial and temporal) of scalar and gridded data imported, the amount of data results from data manipulation and models runs within the system – and importantly the length of time data is kept within the database before it is purged by the rolling barrel process. In some cases data is kept that forms an intermediate result – and is not used in viewing results or in the production of a forecast product. This data could be set to temporary to avoid being stored at all. There are several other methods for optimising the size of operational FEWS Databases.

read more in the attached document ...

Database inspection from within client

Please note that via F12 menu in the Delft-FEWS client you can get detailed information on the database (menu "database").

 

Database sizes and Performance of the system

FEWS stores all timeseries in the database as blobs (binary large objects). The contents of the blobs cannot be read directly in this manner (to do so use the jdbc server) but the rest of the table can be read directly. The example queries below focus on determining the size of certain groups of timeseries in the database as the timeseries table is usually by far the largest table in the database. This can help in optimizing the configuration of the system allowing you to focus on the largest chunks of data.

These examples have been made using a postgresql database. When using these on another database some SQL statements may need to be modified to cater for small language differences.

Delft-Fews uses a database to store and retrieve forecast data. Although a central database exists, each client (the Operator Client, OC and the Forecasting Shell Server, FSS, the calculation node), uses it's own local datastore which is a synchronized mirror of the central database. Which data is synchronised between to clients and the central database is determined using synchronisation profiles and synchronisation levels. To achieve optimal performance it is vital to only synchronise data that needs to be synchronised.

Tuning the data streams between the different components can be done on a global level (using synchronisation profiles, not discussed here) and on a per time series level by setting a synchronisation level to a specific timeseries. This document focuses on the amount of data send by the FSS to the master controller (for later retrieval) and the amount of data that is needed to visualize results on the OC. Apart from synchronisation Delft Fews can also use other techniques to minimize the amount of data. These are (1) compression of data by specifying a precision and (2) marking bits of data as temporary. Below a tables listing synchLevels currently used in Delft-Fews:

SynchLevels - selected by convention

The latest documentation of synch levels can be found here

The general steps to follow for minimizing the amount of data being transferred are:

  1. Determine what the bulk of the data in your systems is. Is it forecast data or imported grids?
  2. Only synchronize data to the OC that is uses in the OC (e.g. to make graphs). For example, ensemble grids are almost never interesting to look at in the OC so they should be marked with synchLevel 7
  3. make sure temporary data (intermediate steps in calculations) get synchLevel 9 and a short expiry time (it is now possible to use the temporary series type which keeps the series in memory until the end of the workflow).
  4. Set the valueResolution parameter in the Parameters.xml for the parameters that take up a lot of space. By sacrificing some accuracy you can achieve very high compression ratio's for certain parameters. This can be very effective for large grids with a different value for each cell such a temperature fields. Precipitation usually already compresses very well without sacrificing accuracy because a precipitation field usually contain lots of repetitive values.

Calculating database size

The size of the database can be estimate using the following calculations:

Forecastdata = FS * FF * RB
Historic/Staterun = SS * SF * RB
ModelState = MS * SF * RB
Historical date = RB * IS

Where:
FS = Size of a single forecast
SS = Size of a single Historic run
IS = Size of one day of import data
FF = forecast frequency (per day)
SF = Staterun/Historic frequency (per day)
RB = Rolling barrel length in days
MS = Size of the model state for one Historical run

To this several other (usually small) components will need to be added such as configuration data, model states etc. but the above make up the bulk of the data.

The fields in the timeseries table

The table below lists the fields in the timeseries table in the FEWS database:

Field*

Description

localIntId

Only filled local datastore, not in the MC database. Used to determine when a blob is created

creatorTaskRunId

PM

blobId

ID of this blob

modifierId

PM

moduleInstanceId

ModuleInstance that created this series

parameterId

Parameterid for this timeseries

locationId

location id (or id's). A blob can hold data for multiple locations for optimisation purposes

beginTime

Date/time of first value in the blob

endTime

Date/time of last value in the blob

timeSeriesType

Type of the timeseries according to the following enumeration:

0 external historical
1 external forecasting
2 simulated historical
3 simulated forecasting

taskRunId

Holds the taskrun that created this series. This field is null in the case of external data

ensembleMemberId

EnsemblememberId. Default is main$0 for non-ensemble date

blob

The actual blob

blobSize

size of the blob in bytes

creationTime

time the blob was created

synchLevel

Level used to determine when and how to synchronise the data

expiryTime

Expirytime of this record. After this date/time the record will be removed.

localAvailableTime

Time this record was first available in the system

valueType

Type of the actual timeseries data: 
0 scalar
1 grid
2 longitudinal profile
3 polygon
4 sample

timeStepId

timestep of this timeseries

externalForecastingStartTime

forecast time in case of an external forecasting timeseries

constantFlag

 

maxValue

Max value in the blob

These fields can be used to query the databse. Please note that it will be difficult to use the locationId as the system optimises the database by storing data (for the same period) for several location in one record. In this process the locationIf field will be changed to indicate a list of locations (in a run-length encoded format)

Example Queries

If you query on a local database (firebird) you cannot use the Length(TimeSeries.blob) function. Instead you can used the blobsize field to calculate the size: SELECT Sum(OCTET_LENGTH("blob"))/1024.0/1024.0

The query below sums all the blobs in the timeseries table to determine the total size in megabytes (MB):

SELECT Sum(Length(TimeSeries.blob))/1024.0/1024.0
FROM Timeseries
where moduleinstanceid LIKE '%'

 


Same for MS SQL Server:

SELECT Sum(DataLength(TimeSeries.blob))/1024.0/1024.0
FROM Timeseries
where moduleinstanceid LIKE '%'

 


When querying a firebird database the OCTET_LENGTH command can be used to establish the size of the blob :

SELECT Sum(OCTET_LENGTH("blob"))/1024.0/1024.0
FROM Timeseries
where moduleinstanceid LIKE '%'

Note the position of the brackets when calculating MB from Bytes - this should be done after summing to avoid underestimating the size due to rounding off before summation. 

 


The query below produces a table showing the size of all the import data in the database (assuming all the import modules start with Import: 

SELECT moduleinstanceid, parameterid, timeseriestype, synchLevel, count(TimeSeries.blob),
Sum(Length(TimeSeries.blob))
FROM Timeseries
where moduleinstanceid liKe 'Import%'
GROUP BY moduleInstanceId,timeSeriesType, synchLevel, parameterid;

Result:

To determine the size of a single forecast one must find all the records belonging to a specific taskrunId:

SELECT TimeSeries.creatorTaskRunId, Timeseries.parameterid, TimeSeries.moduleInstanceId,
TimeSeries.synchLevel,
Count(TimeSeries.synchLevel) AS NumBlobs, Sum(Length(TimeSeries.blob))
FROM TimeSeries
WHERE (((TimeSeries.creatorTaskRunId)='EFASMC00:000024672'))
GROUP BY TimeSeries.creatorTaskRunId, TimeSeries.moduleInstanceId, timeseries.parameterid,
TimeSeries.synchLevel;

The amount import data for one day can be determined using the following query:

SELECT
Sum(Length(TimeSeries.blob)/1024.0/1024.0)
FROM Timeseries
WHERE creationtime BETWEEN to_date('02-02-2008','MM-DD-YYYY') AND to_date('02-03-
2008','MM-DD-YYYY')
and moduleinstanceid liKe 'Import%'

Alternatively we can group this to see which part of the import data take up most space:

SELECT moduleinstanceid, parameterid, timeseriestype, synchLevel, count(synchLevel),
Sum(Length(TimeSeries.blob)/1024.0/1024.0)
FROM Timeseries
WHERE creationtime BETWEEN to_date('02-02-2008','MM-DD-YYYY') AND to_date('02-03-
2008','MM-DD-YYYY')
and moduleinstanceid liKe 'Import%'
GROUP BY moduleInstanceId,timeSeriesType, synchLevel, parameterid;

Example output:

The amount of data for the module states and logfiles can be determined using:

SELECT moduleInstanceId, count(
moduleinstanceid),
Sum(Length(warmstates.blob)) FROM
WarmStates GROUP BY moduleInstanceId;

And

SELECT eventcode, count(
eventcode),
Sum(Length(logmessage)) FROM
logentries GROUP BY eventcode;

Extension of ID-column width (2017.01)

Combined ID-column background

In the central database of Delft-FEWS all timeseries 'fragments' are stored in so -called BLOBs (Binary Large OBjects). While writing, it is attempted to put as many fragments into 1 BLOB to keep the numbers of (new) database records to a minimum. The 'combined' ID consisting of: Location-ID, Paramater-ID, Ensemble-ID and Qualifier-ID are used as reference to find data quickly. This smart 'BLOB'-merge activity is done automatically and tries to find an optimal overlap in IDs to make this merge as efficient as possible.  It helps if your configuration consist of -preferably short - very consistent IDs for locations,  parameters, ensembles and qualifiers. Also consistent pre- or postfixes help to create an 'overlap' in IDs for a most efficient BLOB-merge.

 

Widening the ID-column in Delft-FEWS 2017.02

The width of this combined ID-column had been set to a default value of 64b (64 characters) until Delft-FEWS 2016.02. This ID-column is of type varchar and has been set to a maximum of 2000 in version 2017.01 of Delft-FEWS. This does not mean that every record will be set at this maximum. It is standard database functionality to also set this maximum in the record itself. Oracle has set this maximum to 4000, MS-SQL Server to 8000 and PostgreSQL even has set this maximum to 1 GB (!). An important remark is that 'spaces' are NOT used to fill-up this maximum width. All databases types prevent this by the setting space-padding (set to OFF). This is also valid for Derby (for local datastores). Firebird is the only exception. Space-padding is set to ON but since this column is stored in a compressed way, its effect will be minimal. Bottomline result is that only the actual length of this combined ID column is stored for all database types.

This new width means that less records will be created in workflows/runs in which data is stored. Cache-files will be smaller too. In general, fewer records mean that the overall performance  of the central database and local datastore(s) improve.

 

What aspects improve the merge of the BLOBs?

As mentioned, a consistent use of IDs is important to improve the effect of this merge.

Delft-FEWS merges the records in the following order (to a maximum of 256kB) when they differ in:

  • Parameter-ID
  • Qualifier-ID
  • Location-ID
  • Ensemble-member-ID 

Merges into one BLOB will NOT take place when timeseries records have different ModuleInstance-ID, TimeStep-ID or ForecastTimes.

 

How do (reading) larger blobs and overall database performance relate?

All written BLOBs are read-only and will never be edited again.  Overall speed of writing will improve in any case because less records will be written. Larger BLOBs mean that - sometimes - more data is read than necessary. These read BLOBs will be cached in memory and on disk so that whenever a timeseries from this same BLOB is needed no database access is needed because it was already in the cache. In general, less records/BLOBs mean less overhead and (thus) a smaller database.

 

DatabaseViewer checks

In the Database Viewer (FEWS-Explorer + <F12> + J), the user can inspect the results of a workflow with respect to the number of BLOBs created. This information has been there in the lower part of the screen after selecting the workflow in the top part. The column 'Weighted Blob Count' displays the number of BLOBs (records) created in that run.

 

Since Delft-FEWS 2017.01 this information has been promoted to the top-part of the screen. In the overview of all workflows you can now directly see the BLOBs /records created for all  the workflows. When filtering on one specific workflow you immediately get an impression on how they relate to each other with respect to BLOBs created in these runs. Outliers can be easily recognized.

  • No labels

1 Comment

  1. Under "Calculating Database Size"

    "FS" is defined twice, first as "FS = Size of a single forecast" and then second as "FS = forecast frequency (per day)".

    I believe that the second definition should be "FF = forecast frequency (per day)".

    Cheers!