1207616-000-ZWS-0002-r-Database volume analysis method.pdf
2015-04-20 Lunchlezing - Database Analysis.ppt

Introduction

Many systems are expanding over time. After the initial set up new models are added to the system or new (gridded) data feeds are introduced. All these changes have an impact on the size of the databases with the FEWS system. Finally, the introduction of ensemble forecasting can increase the size and complexity of the system further.

While the increase in coverage and complexity of the forecasting service is desirable, the associated increase in database size should be optimised where possible to ensure that the system performance remains acceptable. While increased database sizes do not pose technical problems when running the FEWS Software, such large databases do result in a loss of performance, in particular when synchronising data from the central server to distributed clients.

There are many factors that influence the size of the operational FEWS Databases. This includes the amount and resolution (spatial and temporal) of scalar and gridded data imported, the amount of data results from data manipulation and models runs within the system – and importantly the length of time data is kept within the database before it is purged by the rolling barrel process. In some cases data is kept that forms an intermediate result – and is not used in viewing results or in the production of a forecast product. This data could be set to temporary to avoid being stored at all. There are several other methods for optimising the size of operational FEWS Databases.

Database sizes and Performance of the system

FEWS stores all timeseries in the database as blobs (binary large objects). The contents of the blobs cannot be read directly in this manner (to do so use the jdbc server) but the rest of the table can be read directly. The example queries below focus on determining the size of certain groups of timeseries in the database as the timeseries table is usually by far the largest table in the database. This can help in optimizing the configuration of the system allowing you to focus on the largest chunks of data.

These examples have been made using a postgresql database. When using these on another database some SQL statements may need to be modified to cater for small language differences.

Delft-Fews uses a database to store and retrieve forecast data. Although a central database exists, each client (the Operator Client, OC and the Forecasting Shell Server, FSS, the calculation node), uses it's own local datastore which is a synchronized mirror of the central database. Which data is synchronised between to clients and the central database is determined using synchronisation profiles and synchronisation levels. To achieve optimal performance it is vital to only synchronise data that needs to be synchronised.

Tuning the data streams between the different components can be done on a global level (using synchronisation profiles, not discussed here) and on a per time series level by setting a synchronisation level to a specific timeseries. This document focuses on the amount of data send by the FSS to the master controller (for later retrieval) and the amount of data that is needed to visualize results on the OC. Apart from synchronisation Delft Fews can also use other techniques to minimize the amount of data. These are (1) compression of data by specifying a precision and (2) marking bits of data as temporary. Below a tables listing synchLevels currently used in Delft-Fews:

SynchLevels - selected by convention

The latest documentation of synch levels can be found here

The general steps to follow for minimizing the amount of data being transferred are:

Determine what the bulk of the data in your systems is. Is it forecast data or imported grids?
Only synchronize data to the OC that is uses in the OC (e.g. to make graphs). For example, ensemble grids are almost never interesting to look at in the OC so they should be marked with synchLevel 7
make sure temporary data (intermediate steps in calculations) get synchLevel 9 and a short expiry time (it is now possible to use the temporary series type which keeps the series in memory until the end of the workflow).
Set the valueResolution parameter in the Parameters.xml for the parameters that take up a lot of space. By sacrificing some accuracy you can achieve very high compression ratio's for certain parameters. This can be very effective for large grids with a different value for each cell such a temperature fields. Precipitation usually already compresses very well without sacrificing accuracy because a precipitation field usually contain lots of repetitive values.

Calculating database size

The size of the database can be estimate using the following calculations:

Forecastdata = FS * FF * RB
Historic/Staterun = SS * SF * RB
ModelState = MS * SF * RB
Historical date = RB * IS

Where:
FS = Size of a single forecast
SS = Size of a single Historic run
IS = Size of one day of import data
FF = forecast frequency (per day)
SF = Staterun/Historic frequency (per day)
RB = Rolling barrel length in days
MS = Size of the model state for one Historical run

To this several other (usually small) components will need to be added such as configuration data, model states etc. but the above make up the bulk of the data.

The fields in the timeseries table

The table below lists the fields in the timeseries table in the FEWS database:

Field*	Description
localIntId	Only filled local datastore, not in the MC database. Used to determine when a blob is created
creatorTaskRunId	PM
blobId	ID of this blob
modifierId	PM
moduleInstanceId	ModuleInstance that created this series
parameterId	Parameterid for this timeseries
locationId	location id (or id's). A blob can hold data for multiple locations for optimisation purposes
beginTime	Date/time of first value in the blob
endTime	Date/time of last value in the blob
timeSeriesType	Type of the timeseries according to the following enumeration: 0 external historical 1 external forecasting 2 simulated historical 3 simulated forecasting
taskRunId	Holds the taskrun that created this series. This field is null in the case of external data
ensembleMemberId	EnsemblememberId. Default is main$0 for non-ensemble date
blob	The actual blob
blobSize	size of the blob in bytes
creationTime	time the blob was created
synchLevel	Level used to determine when and how to synchronise the data
expiryTime	Expirytime of this record. After this date/time the record will be removed.
localAvailableTime	Time this record was first available in the system
valueType	Type of the actual timeseries data: 0 scalar 1 grid 2 longitudinal profile 3 polygon 4 sample
timeStepId	timestep of this timeseries
externalForecastingStartTime	forecast time in case of an external forecasting timeseries
constantFlag
maxValue	Max value in the blob

These fields can be used to query the databse. Please note that it will be difficult to use the locationId as the system optimises the database by storing data (for the same period) for several location in one record. In this process the locationIf field will be changed to indicate a list of locations (in a run-length encoded format)

Example Queries

If you query on a local database (firebird) you cannot use the Length(TimeSeries.blob) function. Instead you can used the blobsize field to calculate the size: SELECT Sum(OCTET_LENGTH("blob"))/1024.0/1024.0

The query below sums all the blobs in the timeseries table to determine the total size in megabytes (MB):

SELECT Sum(Length(TimeSeries.blob))/1024.0/1024.0
FROM Timeseries
where moduleinstanceid LIKE '%'

Same for MS SQL Server:

SELECT Sum(DataLength(TimeSeries.blob))/1024.0/1024.0
FROM Timeseries
where moduleinstanceid LIKE '%'

When querying a firebird database the OCTET_LENGTH command can be used to establish the size of the blob :

SELECT Sum(OCTET_LENGTH("blob"))/1024.0/1024.0
FROM Timeseries
where moduleinstanceid LIKE '%'

Note the position of the brackets when calculating MB from Bytes - this should be done after summing to avoid underestimating the size due to rounding off before summation.

The query below produces a table showing the size of all the import data in the database (assuming all the import modules start with Import:

SELECT moduleinstanceid, parameterid, timeseriestype, synchLevel, count(TimeSeries.blob),
Sum(Length(TimeSeries.blob))
FROM Timeseries
where moduleinstanceid liKe 'Import%'
GROUP BY moduleInstanceId,timeSeriesType, synchLevel, parameterid;

Result:

To determine the size of a single forecast one must find all the records belonging to a specific taskrunId:

SELECT TimeSeries.creatorTaskRunId, Timeseries.parameterid, TimeSeries.moduleInstanceId,
TimeSeries.synchLevel,
Count(TimeSeries.synchLevel) AS NumBlobs, Sum(Length(TimeSeries.blob))
FROM TimeSeries
WHERE (((TimeSeries.creatorTaskRunId)='EFASMC00:000024672'))
GROUP BY TimeSeries.creatorTaskRunId, TimeSeries.moduleInstanceId, timeseries.parameterid,
TimeSeries.synchLevel;

The amount import data for one day can be determined using the following query:

SELECT
Sum(Length(TimeSeries.blob)/1024.0/1024.0)
FROM Timeseries
WHERE creationtime BETWEEN to_date('02-02-2008','MM-DD-YYYY') AND to_date('02-03-
2008','MM-DD-YYYY')
and moduleinstanceid liKe 'Import%'

Alternatively we can group this to see which part of the import data take up most space:

SELECT moduleinstanceid, parameterid, timeseriestype, synchLevel, count(synchLevel),
Sum(Length(TimeSeries.blob)/1024.0/1024.0)
FROM Timeseries
WHERE creationtime BETWEEN to_date('02-02-2008','MM-DD-YYYY') AND to_date('02-03-
2008','MM-DD-YYYY')
and moduleinstanceid liKe 'Import%'
GROUP BY moduleInstanceId,timeSeriesType, synchLevel, parameterid;

Example output:

The amount of data for the module states and logfiles can be determined using:

SELECT moduleInstanceId, count(
moduleinstanceid),
Sum(Length(warmstates.blob)) FROM
WarmStates GROUP BY moduleInstanceId;

And

SELECT eventcode, count(
eventcode),
Sum(Length(logmessage)) FROM
logentries GROUP BY eventcode;

Extension of ID-column width (2017.01)

Combined ID-column background

In the central database of Delft-FEWS all timeseries 'fragments' are stored in so -called BLOBs (Binary Large OBjects). While writing, it is attempted to put as many fragments into 1 BLOB to keep the numbers of (new) database records to a minimum. The 'combined' ID consisting of: Location-ID, Paramater-ID, Ensemble-ID and Qualifier-ID are used as reference to find data quickly. This smart 'BLOB'-merge activity is done automatically and tries to find an optimal overlap in IDs to make this merge as efficient as possible. It helps if your configuration consist of -preferably short - very consistent IDs for locations, parameters, ensembles and qualifiers. Also consistent pre- or postfixes help to create an 'overlap' in IDs for a most efficient BLOB-merge.

Widening the ID-column in Delft-FEWS 2017.02

The width of this combined ID-column had been set to a default value of 64b (64 characters) until Delft-FEWS 2016.02. This ID-column is of type varchar and has been set to a maximum of 2000 in version 2017.01 of Delft-FEWS. This does not mean that every record will be set at this maximum. It is standard database functionality to also set this maximum in the record itself. Oracle has set this maximum to 4000, MS-SQL Server to 8000 and PostgreSQL even has set this maximum to 1 GB (!). An important remark is that 'spaces' are NOT used to fill-up this maximum width. All databases types prevent this by the setting space-padding (set to OFF). This is also valid for Derby (for local datastores). Firebird is the only exception. Space-padding is set to ON but since this column is stored in a compressed way, its effect will be minimal. Bottomline result is that only the actual length of this combined ID column is stored for all database types.

This new width means that less records will be created in workflows/runs in which data is stored. Cache-files will be smaller too. In general, fewer records mean that the overall performance of the central database and local datastore(s) improve.

What aspects improve the merge of the BLOBs?

As mentioned, a consistent use of IDs is important to improve the effect of this merge.

Delft-FEWS merges the records in the following order (to a maximum of 256kB) when they differ in:

Parameter-ID
Qualifier-ID
Location-ID
Ensemble-member-ID

Merges into one BLOB will NOT take place when timeseries records have different ModuleInstance-ID, TimeStep-ID or ForecastTimes.

How do (reading) larger blobs and overall database performance relate?

All written BLOBs are read-only and will never be edited again. Overall speed of writing will improve in any case because less records will be written. Larger BLOBs mean that - sometimes - more data is read than necessary. These read BLOBs will be cached in memory and on disk so that whenever a timeseries from this same BLOB is needed no database access is needed because it was already in the cache. In general, less records/BLOBs mean less overhead and (thus) a smaller database.

DatabaseViewer checks

In the Database Viewer (FEWS-Explorer + <F12> + J), the user can inspect the results of a workflow with respect to the number of BLOBs created. This information has been there in the lower part of the screen after selecting the workflow in the top part. The column 'Weighted Blob Count' displays the number of BLOBs (records) created in that run.

Since Delft-FEWS 2017.01 this information has been promoted to the top-part of the screen. In the overview of all workflows you can now directly see the BLOBs /records created for all the workflows. When filtering on one specific workflow you immediately get an impression on how they relate to each other with respect to BLOBs created in these runs. Outliers can be easily recognized.

Page tree

Introduction

Database sizes and Performance of the system

SynchLevels - selected by convention

Calculating database size

The fields in the timeseries table

Example Queries

Extension of ID-column width (2017.01)

1 Comment

National Weather Service

Page tree

How to optimize the size of the database - using queries to determine the size of the different components

Introduction

Database sizes and Performance of the system

SynchLevels - selected by convention

Calculating database size

The fields in the timeseries table

Example Queries

Extension of ID-column width (2017.01)

1 Comment

National Weather Service