Description and usage
Principal components analysis (PCA) is used when a data set contains many time series (dimensions), and the dimensions need to be reduced, while retaining the most significant relationships between the time series. Reducing the number of dimensions reduces the data set size and removes unrelated variability. The implementation in FEWS allows you to derive a relation between independent timeseries (e.g. observations) and a dependent timeseries (e.g. a simulated timeseries).
Using historical timeseries, the PCA regression transformation produces a linear regression equation and a root mean square error (RMSE). This equation is applied with the current observations to compute an estimate of the (simulated) parameter, as well as its associated RMSE.
The PCA regression transformation was developed to update basin snow models when they drift away from realistic output. Snow updating uses historic and current snow water equivalence (SWE). Historic and current data come from monitoring stations within or near the basin, and from simulations of SWE in the basins. Historic observed data are used for PCA for a basin, and can potentially include time series from many monitoring stations. Current data are current daily SWE values, and are also either simulated (modelled basin) or observed (from monitoring stations within or near the basin). PCA finds the strongest underlying relationships between the historic observed station time series, and produces a linear equation. Current SWE values can then be input into the equation, and a PCA estimate of current basin SWE is produced.
In this function four nonequidistant input time series must be identified:
In the snow updating use of the PCA regression transformation, these time series are subsamples of a daily time series to produce one data point per month. See the dayMonth sample page for more details.
In this function two output time series must be identified:
1. A time series with the PCA-estimated parameter value calculated by the algorithm
2. A time series with the associated RMSE calculated by the algorithm
Each time series is assigned a variable ID which is used in the actual expression.
PCA and regression transformation
a) Handling of time series gaps and irregular lengths
In order to obtain the longest possible common period of record among the input time series, the gap filling behavior has been changed from the default FEWS behavior
On the left hand side of Table 1, a dataset is shown that consists of one basin and three stations with varying start and end times. Station 3 has a gap.
In the middle of the table, the resulting time series lengths are highlighted in color for various combinations of station and basin pairings.
On the right hand side of the table (light blue highlighting), the default FEWS pairing behavior is shown.
The BPA FEWS gap handling technique uses all of the available data, resulting in a longer dataset (gray highlighting).
Table 1: A graphical demonstration of the BPA FEWS gap handling technique
b) PCA transformation
i) Data Preprocessing
Before PCA calculations take place, the data set may need to be normalized and standardized (ex. where two datasets have very different means, standard deviations, or are not normally distributed) . However, no one type of preprocessing is appropriate for all time series. Therefore, to automatically assess which preprocessing type produces the 'best' results, the FEWS PCA algorithm performs a variety of preprocessing techniques. The user is presented with the result with the lowest RMSE.
Preprocessing techniques include the following attempts to normalize the dataset:
Preprocessing can also standardize the dataset by subtracting the time series mean and dividing by the standard deviation.
Therefore there are eight possible preprocessing types for PCA: square-root and standardizing, square-root and not standardizing, cube-root and standardizing, cube-root and not standardizing...
ii) Derivation of the PCA equation
Details below describe how a linear equation is produced from an eigenvector in FEWS.
Where a PCA equation is constructed from two historical SWE time series, and one historical modeled time series, the eigenvector matrix is:
'm', 'x', and 'y' are eigenvalues from basin 'm', and stations 'x' and 'y' respectively.
'z' is the PCA derived equation
'a' and 'b' are coefficients
'm' is the dimension of the matrix
'c' is a constant offset, derived by determining the mean of the historical SWE time series
c) Linear regression and multiple linear regression:
In addition to PCA analysis, the snow analysis module also attempts to minimize the RMSE by modeling using linear or multiple linear regression.
i) Data Preprocessing
Regression preprocessing and iteration procedures are identical to PCA (square-root, cube-root, log10, or no preprocessing). Data can also be either normalized or not. Therefore, there are eight regression preprocessing types: square-root and standardizing, square-root and not standardizing, cube-root and standardizing, cube-root and not standardizing...
ii) Derivation of the regression equation
The user is informed in the FEWS statistics window if regression produces the lowest RMSE, and has been chosen.
Config example: PCA_RMSE_SnowAnalysis.xml
Config example: TimeSeriesDisplayConfig.xml
The following sample describes how the PCA snow updataing display is configured.
Note the use of the dayMonthSample function (DayMonth Sample)
Produces the following display:
Figure 1. The snow updating GUI