Statistical Modelling: Univ.-Prof. Dr. Habil. Albrecht Gnauck

Brandenburg University of Technology at Cottbus Dept.
of Ecosystems and Environmental Informatics
Statistical Modelling
Univ.-Prof. Dr. habil. Albrecht Gnauck
International Master Course of Study Hydroinformatics EuroAquae
Winter term 2010/2011
Contents
1. 1.1 1.2 1.3 1.4 2. 2.1 2.2 2.3 2.4 3. 3.1 3.2 3.3 3.4 4. 4.1 4.2 4.3 4.4 5. 5.1 5.2 5.3 5.4 6. 6.1 6.2 6.3 6.4 7. 7.1 7.2 7.3 7.4 Events and data Analysis and control of aquatic ecosystems Statistical management of ecological data Sampling strategies Re-sampling and pre-treatment of data Probability functions and statistical measures Probability functions of ecological data Normal and skewed probability distribution functions Comparison of expectations Statistical measures Statistical test procedures Introduction General procedure of hypothesis testing Rules of decision Selected test procedures Linear regression and correlation analysis Steps of linear regression Confidence region of regression line The power of linear regression Empirical covariance and statistical measures of correlation Nonlinear regression analysis Polynomial regression Periodic regression Trend functions Comparison of regression functions Time series analysis Dynamic behaviour of time series Description of time series in the time and frequency domain Stationary processes Correlation and spectral functions Analysis of cycling processes Introduction Fourier analysis Digital data filter Wavelets
Literature
1. Events and data

Statistical modelling of hydrological systems is an important task to extract information from former and actual states of aquatic ecosystems (aquifers, freshwater ecosystems, marine ecosystems) by means of water quantity data and water quality data. Holism and reductionism are the two different approaches to study and model ecological processes and systems. Both approaches are needed for ecosystems modelling, simulation and management. Holism Aquatic ecosystems are complex systems with nonlinear interrelationships. Holism attempts to reveal the properties of ecosystems by studying the system as a whole. The system properties cannot be found by a study of the system components separately. It is required that the study be on the system level. This does imply that a study of the ecosystem components is not sufficient. The components of ecosystems are coordinated to such an extent that ecosystems work as indivisible unities. A study of ecosystem components level will never reveal the ecosystem properties. Reductionism To simplify the ecosystem study and to facilitate the interpretation of ecological processes the ecosystem components are separated from the system level. This method is useful to find governing relationships in real systems. This method has obvious shortcomings when the functioning of the entire ecosystem is to be revealed. As an example: A forest is more than the sum of all trees. The analysis and control of dynamic aquatic ecosystems such as ponds, lakes, reservoirs and river basins is often a complicated task because of the high number of system elements (or components) and interrelationships between system elements and between system elements and their environments. To solve management problems the system has to be decomposed and nonlinear interrelationships have to be linearised. Furthermore, the controllability of aquatic ecosystems has to refer to different and parallel working subsystems and system states. The quality of aquatic ecosystem analysis depends on the flexibility of statistical models used. The restricted information structure of complex aquatic ecosystems and aggregation of information lead to uncertainties of the modelling process and of the resulting models. Dynamic processes within 3
aquatic ecosystems are initiated by switching of input and state variables. They result in rapid changes of system states and output variables (non-autonomous control) or in low changes (autonomous control). In general, complex dynamic systems like hydrological systems (aquatic ecosystems) are characterised by three features (table 1).
Table 1: General characterisation of complex dynamic systems
Feature High dimension Uncertainty
Solving procedure Decomposition of system Analysis of dynamic characteristics (observability, controllability, perturbability, reachability, robustness, stability, sensitivity) Aggregation of information
Restricted information structure
Statistical modelling of hydrological systems is based on data. They are observations about characteristics and/or attributes of hydrological input, state and output variables. A group of state variables under study is called a (statistical) population (e. g. data of water flow, salinity data of river water, BOD data of a waste water treatment plant). If the frequency distribution of the attributes of a population is known, then it is possible to describe it by a probability density function or probability distribution function, which is an analytical function defined by a number of parameters. For the study of aquatic ecosystems a subset of the population or a sample is used. A population is denoted as univariate if only one variable (or water quality indicator) is considered. Common univariate measures are averages as measures of location of centres of data clouds along an axis, and measures of their dispersion as variance or spanning width. If more than one variable (or indicator) is considered the population is denoted as a multivariate one. Regression and correlation analysis belong to experimental statistical modelling of hydrological systems which is based on methods of the theory of probability. To solve practical problems such approaches are necessary which are compatible with the stochastic nature of the input variables and state equations. Statistical procedures will be the adequate mathematical methods as long as the processes within the systems and their describing equations are
unknown. A distinction is made between two groups of methods depending on whether the variable time is included or not: Static methods (without consideration of time as variable) and dynamic methods (with consideration of the variable time). The latter one is often called time series analysis or dynamic statistics. Simple and multiple linear and non-linear regression and correlation belong to static methods as well as multivariate statistical procedures. Static procedures answer the question whether there is a relationship between two or more variables of an environmental system. This question can be answered by a regression analysis which gives out the type of relationship between variables. Statistical modelling is done for different purposes. Administrations as well as industrial and agricultural companies use statistical data and results to plan their operations and economic developments. Researchers use statistics mainly as a first step to derive new scientific results. Therefore, the topics of statistical modelling can be formulated by: 1. Data sampling (Methods: Sampling design, re-sampling, plausibility checks, outlier correction). 2. Data analysis to fulfil the requirements of environmental administrations and associations (Methods: Descriptive statistics, frequency distributions, averages, variances, error correction, significance tests). 3. Data analysis to fulfil the requirements of different professional users (e.g. industry, agriculture, forestry) (Methods: Explanatory statistics, multivariate statistics, geostatistics, time series analysis). 4. Basic research (Methods: Regression and correlation analysis, multivariate statistics, advanced statistical techniques, digital data filtering, frequency analysis). Disturbances of statistical analysis of hydrological data are given by: 1. Mostly, only small sets of data of representative regularly sampled data are available. 2. The power of natural and artificial (man-made) external as well as natural internal driving forces on hydrological indicators influence the quality of data to be obtained.
3. Mostly, the a-priori process information on water quality indicators is low. 4. Hydrologic processes possess different rate constants. 5. Cycling effects in hydrological data are induced by natural internal or external as well as by man-made external processes. Classification of hydrological data Hydrological data may be classified by their origin: 1. Measured and/or observed data of hydrological indicators will be obtained by field samples and/or laboratory experiments. They are directly observed (direct observations) or indirectly observed (due to calibration of analytical instruments or sensors). 2. Summary data will be derived from statistics or from restricted observable ecological, respective water quality indicators. 3. Simulated data will be obtained by simulation models. 1.1 Analysis and control of aquatic ecosystems An aquatic ecosystem is a biotic and functional system or unit, which is able to sustain life and includes all biological and non-biological variables in that unit. Spatial and temporal scales are not specified a-priori, but are entirely based upon the objectives of the ecosystem study. Ecosystems are often called complex systems. Several approaches exist to study the behaviour of ecosystems. Empirical studies collect bits of information. An attempt is made to integrate and assemble the studies into a complete picture. Comparative studies are presented to compare some structural and functional components for a range of ecosystem types. Experimental studies where manipulations of a whole ecosystem are used to identify and elucidate ecological mechanisms. Modelling and computer simulation studies to work out ecosystem management plans and to derive eco-technological tools for goal oriented control actions.
Information systems and decision support systems studies to support industrial, agricultural and administrative ecological decisions and to work out medium-term and long-term development plans for ecological management. Like many words for which people have an intuitive understanding, a system is difficult to define precisely. In relation to the physical and biological sciences, a system is an organised collection of interrelated physical components characterised by a boundary and functional unity. A system is a collection of communicating materials and processes that together perform some set of functions. A system is an interlocking complex of processes characterised by many reciprocal cause-effect pathways. A system is a set of interrelated objects (elements, parts) that have certain general properties: 1. It fulfils a certain function, i.e. it can be defined by a system purpose recognisable by an observer. 2. It has a characteristic constellation of essential system elements and an essential system structure which determine its function, purpose, and identity. 3. It loses its identity if it is destroyed. Analysis and control of aquatic ecosystems are often complicated because of the high number of system elements and interrelationships between system elements and between an ecosystem and its environment. Mostly, an ecosystem will be analysed as one unit. Dynamic processes within ecosystems are initiated by switching processes of input and state variables with different transfer time constants (fig. 1). If they are overlaid by external and internal disturbances it can not be distinguished which part of ecosystem response and its intensity stem from a single ecological element. For ecosystem analysis, the complex structure of an ecosystem requires its decomposition and linearization of nonlinear interrelationships. The controllability of ecosystems has to refer to different working elements (or subsystems) and system states. Therefore, the whole ecosystem will be divided into several subsystems with internal and external feedbacks. This leads to uncertain statements on the ecosystem behaviour. The quality of ecosystem analysis depends
on the flexibility of mathematical models used for computation. Restricted information structure and aggregation of information lead to model errors.
Figure 1: Switching processes within a freshwater ecosystem
Ecosystems are multidimensional systems with several input and output variables. They can be seen as black box, grey box or white box systems. In dependence of the numbers of input and output variables SIMO-, MIMO-, SISOand MISO-systems will be distinguished. Ecosystems can be considered as stochastic transfer systems described by its state variables and parameters. They are characterised by measurable inputs, immeasurable (stochastic) disturbances as well as by measurement errors. In the case of real systems, disturbances, input signals and measurement errors will be overlaid and produce disturbed (and unsure) output signals. Transfer functions are represented by 1. Pulse function x(t) = 0 for t < 0 and t > T, x(t) = x0 for 0 t T, 2. Jump function: x(t) = x0(t) with (t) = 0 for t < 0 and (t) = 1 for t 0, 3. Harmonic function: x(t) = x0 + cos(t+) for - < t < + or x(t) = x0 ej(t+) = x0+ ejt with x0+= x0 ej, 8
4. White noise function. Other transfer functions are 1. Exponential function: x(t) = x0 e-t/T for 0 t < + or x(t) = x0+ ejt et for 0 t < + and 0, 2. Periodic function: x(t) = a0/2 + i aicos(i0t) + i bisin(i0t) or x(t) = i ci ej(i0t), 3. Dirac impulse: x(t) = 0 for t < 0 and t > T, x(t) = (t) with (t) = 0 for t 0 and (t)dt = 1, 4. Ramp function: x(t) = 0 for t < 0 and x(t) = at for t 0 or 5. Time discrete signal: x~(t) = k x(kT)(t-kT) with k = 0, 1, 2, and T 1/(2fmax) where fmax is the maximum frequency contained in the data serie. Feedback structures (or couplings) within ecosystems are given by simple feed-forward, feed-back self-tuning or complicated couplings between the ecosystem elements. 1.2 Statistical management of ecological data To handle and investigate hydrological data with sense they should be characterised by some relationships. The increase of information content of hydrological data analysis is expressed by the number of data operations. Four scales can be distinguished (fig. 2).
Increase of information content
Ratio Scale
Interval Scale
Ordinal Scale
Nominal Scale
Figure 2: Data scales in hydrological research
Transformations from one data scale to another serve as unificators of variables (tab. 2). The information content (knowledge, antithesis of uncertainty) and the scale level should not be changed during sampling and/or statistical data analysis. If there is no empiric equivalence scale, then the data are valuated as comparable.
Table 2: Comparison of data scales in hydrology
Scale Ratio Scale Interval Scale Ordinal Scale Nominal Scale
Arithmetic operation +, -, , / +, none none
Statistical measure Geometric mean Arithmetic mean Median, Quartiles Frequencies only
Nominal scale: No relationship between events, sometimes they are coded by numbers (e. g. lottery, pie charts), no arithmetic operation possible. Ordinal scale: Ranking of events or representations, classification of environmental indicators (e. g. EU water quality classes, soil classes etc.), ordinal comparisons are possible: Class I > Class II, estimation of median and quartiles. Interval scale: Ordinal scale with equal intervals (e. g. water temperature), statements on distances and differences between data are allowable. No natural origin (Zero point) exists. Ratio scale: It is an interval scale with a natural origin and allows statements on ratios (e. g. concentrations). One of the most important characteristic of hydrological data is its uncertainty which can be characterised as a state or condition of incomplete or unreliable knowledge. Sources of uncertainty are characterised by 1. Statistical analysis depends on the a-priori information of essential hydrological variables considered. 2. Hydrological variables and their rates of changes have different scales in time and space.
10
3. Mostly, a small set of representative data will be available. The strength of disturbances of the data observed leads to fuzzy effects of interpretations. Figure 3 shows different types of annual water quality data series which can be distinguished by their statistical measures:
30 TW (C) 20 10 0 9 NO2 (mg/l) pH-value Lf (mS/cm) o-PO4-P (mg/l) DOC (mg/l) J FM AM J J A S O N D time (month) NH4 (mg/l) 10 5 0 1
0.8
0.6 1.5 1 0.5 0 40
0.5
7 15 O2 (mg/l) 10 5 0 NO3 (mg/l)
0 15 10 5 0
20
J FM A M J J A S O N D time (month)
J FM AM J J A S O N D time (month)
Figure 3: Data series of water quality samples
The quality and usability of hydrological data are usually highly depending on the suitability of the sample and the adequacy of the sampling or monitoring program. The goal of sampling is to get information about the frequency distributions of data indicating environmental states or about the distribution parameters. These estimates are called sample statistics and form a base to give prognoses on environmental developments in general, but also on hydrological changes. If an investigation is based on samples then sampling statistics depends on the particular sampling environment, on stationary or instationary external or internal effects as well as on random influences. Sampling frequency depends on hydrologic process dynamics, on the degree of water pollution, on the type of pollution, and on the type of substance. Different results may
11
be obtained if different samples are selected. This variation in the data from sample to sample is called sampling variability. The difference between a statistic and the true population value is called sampling error. It increases if more random factors influence the sampling procedure. There is a margin of uncertainty expressed in terms of the sampling variance of the estimator. Sampling variance is a measure of the precision of the estimates. Comparison of hydrological data series: 1. Average is time dependent, dispersion is approximately time constant. 2. Average is approximately time constant, dispersion is time dependent. 3. Average and dispersion are time dependent. Variability within data series is caused by: 1. Environmental influences or factors, 2. Intrinsic factors between water samples, 3. Different sample treatment, 4. Different data treatment. 1.3 Sampling strategies Ecological data are obtained by field samples and/or laboratory analysis. They are directly observed (direct observations) or indirectly observed (due to calibration of analytical instruments and sensors). Summary data are derived from statistics or by restricted observable indicators. Simulated data are obtained by simulation models. Sampling design is based on different procedures. The most common used designs are 1. Systematic (periodically) sampling (yearly, monthly, weekly, daily, and hourly). 2. Sampling based on the level of admissible fault of the annual mean. 3. Random sampling. 4. Sample size for normal distributed data without trend and periodicities: n = ((t(95)v)/e(x*))2 with t(95) = 1.96, v = x*/s100 and e(x*) = 10% allowable deviation from mean.
12
The sampling location in space and time can have a very real effect on the quality and usefulness of data in hydrology. Site selection should be made primarily on the basis of the goal of the study as well as on the nature of the hydrologic process or phenomenon under consideration. Optimum number of samples, frequency of sampling and spacing can be estimated either by preliminary sampling experiments, by conclusions from expert knowledge, by practical experiences, or by statistical sampling design formulas and methods. Geostatistical methods can be helpful to determine optimal space distribution of sampling points. The sampling procedure covers three parts. 1. Hypothesis (program purpose, sampling design, formulation of questions), 2. Observation (sampling techniques, sampling protocol, analytical techniques), 3. Interpretation (data analysis, interpretation of results) Recommendations for hydrological sampling: 1. The goals and needs for hydrological data collection should be formulated explicitly for each application before sampling is started. 2. Prior knowledge of factors that affect hydrological variables to be sampled should be given. 3. During sampling significant changes of external and internal driving forces should not take place. 4. Existing estimates may be sufficient if they were obtained by an unbiased sampling design. 5. Sampling design in hydrology should cover the water budget (surface and groundwater), hydrochemical variables (organic and inorganic substances, metabolites), hydrophysical variables (considering internal and external driving forces), hydrobiological variables (life cycle of plants and organisms, conversion of organic and inorganic substances), microbiological variables, and other variables as required. Disturbances of data analysis: Only small sets of representative regular sampled data are available. 13
The power of external and internal driving forces on water quality (hydrological) indicators influences the quality of data to be obtained.
The a-priori process information on water quality (hydrological) indicators is low. Water quality (hydrologic) processes possess different rate constants.
1.4 Re-sampling and pre-treatment of data Series of measurements of hydrological data are time series of data recorded at discrete points in time often with unequal sampling intervals. In practice, they often contain missing data or they are based on different sampling intervals in time and space. To extract hydrologic process information from single data (events) the data series should be completed and based on a regular sampling grid. The application of static and dynamic statistical methods for analysing such data sets requires equidistant data. Re-sampling generally means data interpolation or, in the case of noisy information, data approximation. Figure 4 gives an overview on these procedures.
R a w h y d r o lo g ic a l d a ta
In te r p o la tio n
E q u id is ta n t d a ta
A p p r o x im a tio n
D ig ita l d a ta f ilte r in g
S ta tic
D y n a m ic
Low pass
H ig h p a s s
F u n c tio n a l r e la tio n s h ip
C o n s is te n t d a ta
Figure 4: Interpolation, approximation and digital filtering of data
14
The goal of the application of interpolation and approximation methods onto incomplete time series is to fill the intervals between two grid points so that series of measurements with small unique sampling intervals are kept. Table 3 contains some commonly used interpolation methods.
Table 3: Interpolation methods
Method Nearest neighbour Linear Cubic Hermite polynomial
Algorithm
t < (t k + t k +1) / 2 x ~ x (t ) = k x k +1 t (t k + t k +1) / 2 ~ x (t ) = xk +1 xk (t t k ) + xk t k +1 t k

~ = ak t 3 + b k t 2 + c k t 3 + d k x (t )| ,t ] [t k k +1 ~ = ek t 3 + f k t 2 + g k t 3 + hk x (t )| [t k,t k +1]
~ x C(0)[t0, tn] ~ ( x discontinuous) ~ x C(0)[t0, tn] ~ ( x continuous) ~ x C(1)[t0, tn] ~ ( x continuous dif~ x C(2)[t0, tn] ~ ~ ( x , x continuous
differentiable) ferentiable)
Characteristics
Cubic spline
Results of interpolation of water quality data based on biweekly sampling intervals are presented in figure 5 and in figure 6 for monthly sampling intervals.
Spree
NH4-N (mg/l) NH4-N (mg/l) 10 8 6 4 2 J F M A M J J A S O N D t (Month) Raw data 10 14 d, spline 8 6 4 2 0 J F M A M J J A S O N D t (Month) 0 Raw data 14 d, neighbour 10 8 6 4 2 J F M A M J J A S O N D t (Month) Raw data 10 14 d, cubic 8 6 4 2 0 J F M A M J J A S O N D t (Month) 0 Raw data 14 d, linear
NH4-N (mg/l)
Figure 5: Results of interpolation for two-weekly sampled data
The effectiveness of interpolation procedures can be evaluated by standard error estimations. Results for the biweekly data sets are presented in table 4. 15
NH4-N (mg/l)
Table 4: Standard error of data series with biweekly sampling interval
Year 1991 1991 1991 1991 1992 1992 1992 1992 1993 1993 1993 1993 1994 1994 1994 1994 1995 1995 1995 1995
Method neighbour linear spline cubic neighbour linear spline cubic neighbour linear spline cubic neighbour linear spline cubic neighbour linear spline cubic
NH4-N 0,39 0,36 0,37 0,37 0,25 0,23 0,24 0,24 0,21 0,18 0,20 0,20 0,12 0,09 0,11 0,11 0,09 0,08 0,10 0,10
NO2-N 0,065 0,052 0,053 0,053 0,041 0,033 0,038 0,038 0,041 0,037 0,035 0,035 0,020 0,017 0,019 0,019 0,016 0,015 0,016 0,016
NO3-N 0,38 0,35 0,36 0,36 0,46 0,41 0,43 0,43 0,43 0,37 0,39 0,39 0,27 0,23 0,23 0,23 0,22 0,20 0,22 0,22
o-PO4-P 0,082 0,081 0,081 0,081 0,116 0,114 0,115 0,115 0,080 0,079 0,079 0,079 0,020 0,019 0,019 0,019 0,027 0,025 0,026 0,026
DOC 1,41 1,33 1,36 1,35 0,70 0,64 0,67 0,65 0,46 0,42 0,44 0,43 1,47 1,14 1,41 1,41 0,73 0,69 0,71 0,71
Figure 6 contains some interpolation results for monthly sampled data sets of water quality indicators of River Spree at Berlin.
Spree
NH4-N (mg/l) NH4-N (mg/l) 10 8 6 4 2 J F M A M J J A S O N D t (Month) Raw data 10 28 d, spline 8 6 4 2 0 J F M A M J J A S O N D t (Month) 0 Raw data 28 d, neighbour 10 8 6 4 2 J F M A M J J A S O N D t (Month) Raw data 10 28 d, cubic 8 6 4 2 0 J F M A M J J A S O N D t (Month) 0 Raw data 28 d, linear
NH4-N (mg/l)
Figure 6: Results of interpolation for (nearly) monthly sampled data
16
NH4-N (mg/l)
Standard error estimates for monthly sampled water quality data sets are given in table 5.
Table 5: Standard error of data series with monthly sampling interval
Year 1991 1991 1991 1991 1992 1992 1992 1992 1993 1993 1993 1993 1994 1994 1994 1994 1995 1995 1995 1995
Method neighbour linear spline cubic neighbour linear spline cubic neighbour linear spline cubic neighbour linear spline cubic neighbour linear spline cubic
NH4-N 0,55 0,44 0,49 0,46 0,33 0,29 0,30 0,30 0,23 0,20 0,21 0,21 0,18 0,15 0,19 0,19 0,11 0,09 0,10 0,10
NO2-N 0,079 0,075 0,082 0,078 0,063 0,058 0,066 0,066 0,030 0,036 0,035 0,035 0,028 0,025 0,026 0,026 0,021 0,020 0,022 0,022
NO3-N 0,49 0,41 0,46 0,44 0,43 0,40 0,48 0,48 0,41 0,39 0,42 0,42 0,37 0,35 0,37 0,37 0,32 0,24 0,29 0,29
o-PO4-P 0,083 0,083 0,084 0,084 0,119 0,115 0,115 0,115 0,081 0,081 0,081 0,081 0,027 0,025 0,026 0,026 0,031 0,028 0,029 0,029
DOC 1,57 1,47 1,49 1,48 1,00 0,82 0,96 0,89 0,57 0,52 0,54 0,53 2,54 2,15 2,41 2,41 0,83 0,74 0,83 0,83
Table 6 contains some selected results of an interpolation study for rivers with different speed of flow.
Table 6: Interpolation methods for rivers with different hydraulic regime
Variable Ammonia Nitrite Nitrate Phosphate DOC UV absorp. Turbidity Conductivity Dissolved oxygen
Spree linear linear linear linear, spline, polynomial linear linear linear spline, polynomial linear
Dahme linear, spline linear linear linear, spline, polynomial linear linear linear spline, polynomial linear
Havel linear linear linear linear, spline, polynomial linear linear, spline polynomial linear spline, polynomial linear
Oder linear linear linear linear linear linear, spline linear linear, polynomial linear
The application of interpolation methods leads to equidistant data while approximation methods result in functional relationships which can be used as 17
estimations of reference functions (or estimated reference data) if no other reference (may be from literature or from former experience etc.) is available. The application of digital data filters gives out consistent data. In each case, the results of all three different types of procedures (interpolation, approximation and digital data filtering) deliver data sets which can be used for modelling, simulation and optimisation as well as decision making.
18
2.
Probability distribution functions and statistical measures
Relationships between random events will be analysed by probability calculus and mathematical statistics. In opposite of that are deterministic processes with well-known results of theoretical and practical experiments. A random event is an event which will occur under certain conditions, but it doesnt have to occur. Two types of random variables have to be distinguished: discrete or continuous. A random variable X is called discrete if it takes finite or enumerable infinite values x1, x2,,xn. A continuous random variable takes either each or any value of the region of definition. 2.1 Probability functions of ecological data In general, a probability distribution function of a random variable is defined by F(x) = P(X < x) where x takes all real values. A probability distribution function of a discrete random variable X = {x1, x2,,xn} with single probabilities P(X = xi) = pi (i = 1,,n,) is given by F(x) = P(X = xi) = pi for all xi < x. A probability distribution function of a continuous random variable is defined by F(x) = P(X < x) = f(x) dx for - < xi < x. f(x) is called the probability density function where always f(x) 0. Main characteristics of F(x) and f(x) are: 1. F(x) is a monotone non-decreasing continuous function of x which takes values between 0 and 1. 2. lim F(x) = F(+) = f(x)dx = 1 with - < x < + and lim F(x) = F(-) = 0. 3. f(x) = dF(x)/dx = F(x), if F(x) is continuous differentiable. 4. For x1 < x2: F(x2) F(x1) = P(X < x2) P(X < x1) = P(x1 X x2) = f(x)dx for x1 < x < x2. The formula defines the area under the function f(x) between the values x1 and x2 of abscissa. 5. For any fixed value a: P(X = a) = 0.
19
To solve practical problems it is sometimes impossible to determine the probability distribution function of a random variable X. For this reason, a characterisation of the probability distribution function can be given by estimates of parameters of this function. The most important parameters are expectation (or average) of X: EX or , and dispersion (variance) D2X or 2. The expectation of a discrete random variable X which takes values xi with probabilities pi belonging to it is defined by = EX = xipi for i = 1,,. The dispersion of a discrete random variable X is defined by 2 = D2X = E(X EX)2 = (xi - )2pi for i = 1,,. The expectation of a continuous random variable X with probability density f(x) is defined by = EX = xf(x)dx for - < x < +. The dispersion of a continuous random variable X with probability density f(x) is defined by 2 = D2X = (x - )2 f(x)dx = x2 f(x) dx - 2 for - < x < +. The coefficient of variation of a random variable X with 0 is defined by = / (%). Special discrete probability distributions are: Discrete equal probability distribution, binomial probability distribution, hypergeometrical probability distribution, Poisson probability distribution, geometrical probability distribution. Special continuous probability distributions are: Continuous equal probability distribution, Gaussian (normal or bell shaped) probability distribution, logarithmic Gaussian probability distribution, exponential probability distribution, Weibull probability distribution. 2.2 Normal and skewed probability distribution functions The most important probability distribution of a random variable X is the Gaussian probability distribution. Its probability density is given by f(x, , 2) = (1/22)e-(x - )/2 while the probability distribution function is given by F(x, , 2) = (1/22)e-(x - )/2dx. Most regression methods and multivariate statistical procedures are based on the assumption that random variables to be analysed follow a Gaussian probability distribution function. Often, probability density 20
functions (frequency distributions) indicate skewed probability distribution (fig. 7).

70
60
50
40
30
20
10 0 1,0 1,5 2,0 2,5 3,0 3,5 4,0
Figure 7: Skewed frequency distribution of a hydrological variable
Figure 8 contains some probability density functions which differ in form and shape. The upper panel contains the normal (Gaussian) probability distribution as well as
Figure 8: Examples of frequency distributions
21
2.4 Comparisons of expectations When comparing the position of mean, mode and median of a probability density function a simple test of normality can be carried out. For a Gaussian distribution the arithmetic mean, the median and the mode are arranged at the same position on the abscissa.
Figure 9: Comparison of mean, median and mode
In the case of a skewed probability distribution the arithmetic mean, median and mode differ from each other (fig. 9). Table 7 contains a list of sample statistics of heavy metal concentrations which were observed in a freshwater lake. Statistical computations are carried out by means of SPSS. No water quality variable will follow a normal (Gaussian) probability distribution.
Table 7: Statistical measures of heavy metal concentrations in a freshwater lake
Measure Al Mean 28.58 Median 28.40 Mode 20.40 g. mean 24.95 Variance 204.64 Std. dev. 14.31 Std. error 4.52 Min 7.20 Max 53.30 Range 46.10 Skewness 0.45 Excess -0.15
Pb 1.79 1.30 1.27 1.31 1.77 1.33 0.42 0.24 3.91 3.67 0.74 -0.76
Cd 0.28 0.14 0.04 0.18 0.43 0.13 0.00 1.40 1.40 2.47 6.31
Cr 1.77 0.45 0.33 0.54 7.59 2.76 0.87 0.01 8.71 8.70 2.17 4.68
Fe 0.15 0.04 0.03 0.06 0.04 0.19 0.06 0.00 0.60 0.60 1.74 3.03
Cu 20.45 14.02 11.10 15.13 384.96 19.62 6.20 6.32 68.50 62.18 2.04 3.94
Ni 0.71 0.23 0.17 0.77 0.88 0.28 0.00 2.50 2.50 2.74 7.86
Zn 24.25 17.35 17.00 21.30 282.05 16.79 5.31 15.00 70.10 5.10 1.37 0.67
22
2.4 Statistical measures Averages, variances and correlation coefficients are often called statistical measures. In this chapter, measures of expectations and dispersions are presented. Correlation measures will be given in chap. 4.4. Statistical measures of expectation - averages 1. Arithmetic mean: x* = 1/n xi 2. Empirical median: x~ 3. Empirical mode: M 4. Geometric mean: x 5. Weighted arithmetic mean: x*g 6. Weighted geometric mean: lg x Statistical measures of dispersion - variances 1. Range (spanning width): R = xmax - xmin 2. Empirical variance: s2 3. Empirical standard deviation: s = s2 4. Empirical coefficient of variation: v = s/x*100 (%)
23
3. Statistical test procedures

In sample statistics the characteristics of interest are often expressed in terms of sample parameters such as average or variance . Other questions arise from comparing two or more samples. They may be expressed by the differences of averages. 3.1 Introduction A statistical hypothesis is a statement about the sample distribution of some random environmental variables. Hypothesis testing consists of comparing some statistical measures called test criteria (or statistics) deduced from data sample with the values of these criteria taken on the assumption that a given hypothesis is correct. In hypothesis testing one examines a Null hypothesis H0 against one or more alternative hypotheses H1, H2, , Hn which are stated explicitly or implicitly. To reach a decision about the hypothesis an arbitrary significance level is selected which should be small (0.05, 0.01 or 0.001). The confidence coefficient is given by = 1 . For hypothesis testing the test criterion (or test statistics) is set up. When this statistic falls into the range of acceptance, the Null hypothesis is not rejected. On the other hand, when this statistics falls into the region of rejection the Null hypothesis is rejected. The probability of the test statistic falling in the region of rejection is equal to . It is expressed in %-values. 3.2 General procedure of hypothesis testing 1. The Null hypothesis H0 and an alternative hypothesis H1 have to be formulated. 2. The significance level is selected. 3. The test statistic is chosen. 4. The region of rejection of the test statistic on the basis of its probability distribution and the significance level is determined. 5. Test statistic is calculated from data set. 6. Decision: The Null hypothesis is rejected and the alternative hypothesis is accepted when the value of the test statistic falls into the region of re-
24
jection. The Null hypothesis is accepted if the value of test statistic does not fall into the region of rejection. 3.3 Rules of decision From sampled data an average m was calculated and is now compared with a fixed number (standard value) K. The Null hypothesis H0: m = K is tested against the alternative hypothesis H1: m K. The significance level = 0.05 is selected. The test statistic is chosen: t = |m - K|/s n. If the test statistic falls into the region of acceptance of the Null hypothesis, that means t/2 < t < t1-/2 . H0 cannot be rejected. The power of the test depends on sample size n. The bigger the sample size (more information is available), the stronger the confidence of the test. Rules of decision: 1. If t* < t(95), then a difference between m and K cannot ascertained. 2. If t(95) t* < t(99), then there is probably a difference between m and K. 3. If t(99) t* < t(99,9), then a significant difference exists between m and K. 4. If t(99,9) < t*, then a high significant difference exists between m and K. The rules of decision can be adapted to all test procedures. A change of test procedure can lead to other (sharper) results of hypothesis testing. 3.4 Selected test procedures t Test (Student Test) Goal: Comparison of a sample average with a standard value. Prerequisite: n, x*, s, 0. Test statistic: tcalc = |x* - 0|/sn, where x* - sample mean, 0 expectation value of the ensemble, s standard deviation, n sample size. Decision: Acceptance if tcalc < ttab, otherwise rejection (cf. table 8).
25
Table 8: Table of t Test (according to Kaiser and Gottschalk 1974)
f=n-1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 50 100 200 300 500 700 Example:
P(95) 12,71 4,30 3,18 2,78 2,57 2,45 2,37 2,31 2,26 2,23 2,18 2,15 2,12 2,10 2,08 2,01 1,98 1,97 1,97 1,96 1,96 1,96
P(99) 63,66 9,92 5,84 4,60 4,03 3,71 3,50 3,36 3,25 3,17 3,06 2,98 2,92 2,88 2,85 2,68 2,63 2,60 2,59 2,59 2,58 2,58
P(99,9) 636,62 31,60 12,92 8,61 6,86 5,96 5,41 5,04 4,78 4,59 4,32 4,14 4,02 3,92 3,85 3,50 3,39 3,34 3,33 3,31 3,30 3,29
After waste water input in a river DO measurements were carried out to check the water quality and to answer the question wether the river water fulfils the requirements for water quality class II after LAWA regulations. x* = 5,9 mg/l, 0 = 6 mg/l, n = 25, s = 0,3 The test statistic holds tcalc = |x* - 0|/sn: tcalc = |5,9 - 6|/0,325 = 0.1/0,35 = 1,6666. Comparison: tcalc and ttab for f = n 1 = 24: tcalc = 1,67; t(95) = 2,060; t(99) = 2,787; t(99,9) = 3,725. Decision: If tcalc < ttab, then accept x*. Interpretation: The absolute value of average is smaller than the standard value of LAWA. This means the water quality standard of LAWA is not fulfilled. When testing the average by t-test it turns out that the average of the sample differs not significantly from the standard value. The sample average has to be accepted. Of course, the waste water input leads to a lower water quality but 26
there is no significant difference to the quality standard value. In the case of x* = 5.8 mg/l exists a significant difference between the average and the standard. Comparison of means The test statistic holds: t = |x* - x**|/sd n*n** / (n* + n**), where x* - first sample mean, x** second sample mean, s* first standard deviation, s** second standard deviation, n* first sample size, n** second sample size, n-1 degrees of freedom and sd = ((n*-1)s* + (n**-1)s**)/(n*+n**-2) Decision: Acceptance if tcalc < ttab, otherwise rejection (cf. table 8). Comparison of variances (F Test) Goal: Evaluation of standard deviations of two homogeneous data sets. Prerequisite: n1, n2, s1, s2. The test statistic holds: F = (s*/s**)2 1, where s* is the standard deviation of the first sample, s** is the standard deviation of the second sample. Decision: Acceptance if Fcalc < Ftab, otherwise rejection (cf. table 9a 9c).
Table 9a: Table of F Test (according to Kaiser and Gottschalk 1974) for P(95)
f2\f1 1 2 3 4 5 10 12 14 16 18 20 22 25 30 40 60
1 161,4 18,51 10,13 7,71 6,61 4,96 4,75 4,60 4,49 4,41 4,35 4,30 4,24 4,17 4,08 4,00 3,84
5 230,2 19,30 9,01 6,26 5,05 3,33 3,11 2,96 2,85 2,77 2,71 2,66 2,61 2,53 2,45 2,37 2,21
10 241,9 19,39 8,74 5,91 4,68 2,91 2,69 2,60 2,49 2,41 2,35 2,30 2,24 2,16 2,07 1,99 1,83
20 248,0 19,44 8,66 5,80 4,56 2,77 2,54 2,39 2,28 2,19 2,12 2,07 2,01 1,93 1,84 1,75 1,57
254,3 19,50 8,53 5,63 4,36 2,54 2,30 2,13 2,01 1,92 1,84 1,78 1,71 1,62 1,51 1,39 1,00
27
Table 9b: Table of F Test (according to Kaiser and Gottschalk 1974) for P(99)
f2\f1 1 2 3 4 5 10 12 14 16 18 20 22 25 30 40 60 f2\f1 1 2 3 4 5 10 12 14 16 18 20 22 25 30 40 60 Example:
1 4052 98,49 34,12 21,20 16,26 10,04 9,33 8,86 8,53 8,29 8,10 7,94 7,77 7,56 7,31 7,08 6,64 1 4,1105 998,2 167,5 74,14 47,04 21,04 18,64 17,14 16,12 15,38 14,82 14,38 13,89 13,29 12,61 11,97 10,83
5 5764 99,25 28,71 15,98 11,39 5,64 5,06 4,70 4,44 4,25 4,10 3,99 3,86 3,70 3,51 3,34 3,02 5 5,8105 999,3 134,6 51,71 29,75 10,48 8,89 7,95 7,27 6,81 6,46 6,19 5,89 5,53 5,13 4,76 4,10
10 6056 99,40 27,23 14,54 10,04 4,85 4,30 3,94 3,69 3,51 3,37 3,25 3,13 2,97 2,80 2,63 2,31 10 6,1105 999,4 129,2 48,05 26,91 8,75 7,28 6,40 5,81 5,38 5,07 4,82 4, 56 4,23 3,87 3,53 2,95
20 6208 99,45 26,69 14,02 9,55 4,41 3,86 3,51 3,25 3,08 2,94 2,83 2,70 2,55 2,37 2,20 1,87 20 6,2105 999,5 126,5 46,16 25,40 7,80 6,40 5,55 4,99 4,59 4,28 4,05 3,80 3,49 3,14 2,81 2,25
6366 99,50 26,12 13,46 9,02 3,91 3,36 3,00 2,75 2,57 2,42 2,31 2,17 2,01 1,80 1,60 1,00 6,4105 999,5 123,5 44,05 23,78 6,76 5,42 4,60 4,06 3,67 3,38 3,15 2,90 2,59 2,23 1,90 1,00
Table 9c: Table of F Test (according to Kaiser and Gottschalk 1974) for P(99,9)
From laboratory analysis of water quality exist two small data sets of BOD data with x11 = 30,4 mg/l, x12 = 30,1 mg/l, x13 = 30,5 mg/l, x14 = 30,9 mg/l, x15 = 29,2 mg/l and x21 = 30,5 mg/l, x22 = 30,3 mg/l, x23 = 30,5 mg/l, x24 = 30,4 mg/l, x25 = 30,2 mg/l, x26 = 30,8 mg/l, x27 = 30,1 mg/l. x* = 30,2; s* = 0,638 ; n* = 5; x** = 30,4; s** = 0,231; n** = 7.
28
Test statistic: F = (s*/s**)2 1: F = (0,638/0,231)2 = 7,6281 1. Comparison: Fcalc < Ftab, for f* = n1 1 = 3, f** = n2 1 = 6: Fcalc = 7,6281, F(95) = 4,76; F(99) = 9,78; F(99,9) = 23,70. Decision: If Fcalc < Ftab: No difference between standard deviations. Interpretation: Both data sets can be combined. Outlier Test (NALIMOV-Test) Goal: Detection of outlier within a data set, testing of homogeneity of data set. Prerequisite: Data set n > 3, x*, s. The test statistic: r = (|(x+ - x*)|/s)n/(n-1), where x+ is to be expected as an outlier, x* is the expectation of the sample, s is the standard deviation of the sample, and n sample size. Choice of significance level = 0.05, degrees of freedom f = n 2. Decision: Acceptance if rcalc < rtab, otherwise rejection (cf. table 10).
Table 10: Table of r test (according to Kaiser and Gottschalk 1974)
f=n-1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 50 100 200 300 500 700
P(95) 1,409 1,645 1,757 1,814 1,848 1,870 1,885 1,895 1,903 1,910 1,920 1,926 1,931 1,935 1,937 1,951 1,956 1,958 1,958 1,959 1,959 1,960
P(99) 1,414 1,715 1,918 2,051 2,142 2,208 2,256 2,294 2,324 2,348 2,385 2,412 2,432 2,447 2, 460 2,529 2,553 2,564 2,566 2,570 2,572 2,576
P(99,9) 1,414 1,730 1,982 2,178 2,329 2,447 2,540 2,616 2,678 2,730 2,812 2,874 2,291 2,205 2,990 3,166 3,227 3,265 3,271 3,279 3,283 3,291
29
Example: From laboratory analysis of water quality exist a small data set of BOD data with x1 = 30,4 mg/l, x2 = 30,1 mg/l, x3 = 30,5 mg/l, x4 = 30,9 mg/l, x5 = 29,2 mg/l. The last value is expected to be an outlier. That would mean the data set is inhomogeneous. x* = 30,2; s = 0,638 ; n = 5 Test statistic: r = (|(29,2 30,2)|/0,638)5/(5-1) = (1,0/0,638)5/4= 1,5671,118 = 1,752 Comparison: rcalc and rtab for f = n 2 = 3: rcalc = 1,752, r(95) = 1,757; r(99) = 1,918; r(99,9) = 1,982. Decision: If rcalc < rtab, then accept x5: 1,752 < 1,757. Result and interpretation: The value x5 is not an outlier and belongs to the data set. The data set itself seems to be homogeneous. In the case that a value has been found as an outlier the average and variance have to be re-calculated and tested again.
30
4. Linear regression and correlation analysis

A regression analysis is required for problems in which stochastic dependencies (stochastic cause-effect relationships) have to be described by functions with one or more several variables. Linear regression analysis is one of the best studied statistical methods. Goal of a simple or multiple linear regression analysis is the determination of a linear relationship between two or more measurable (or observable) variables or characteristics X and Y of a hydrological system. The measurement values of size n consist of n pairs of data (x1, y1), (x2, y2),, (xn, yn) (or n-tupels of data) which can be considered as realisations of a two-dimensional (or n-dimensional) random vector (X, Y). 4.1 Steps of linear regression 1. Step: Scatter-plot of variables of interest (fig. 10).
9,0
8,8
8,6
8,4
pH
8,2 8,0 7,8 7,6 0,0 5,0 10,0 15,0 20,0 25,0 30,0
Temp
Figure 10: Scatterplot of hydrological variables
2. Step:
Estimate the relationship (positive or negative) between variables.
Directions of relationships 1. 2. 3. Positive relationship: Increasing values of X and increasing values of Y. Negative relationship: Increasing values of X and decreasing values of Y. No relationship between X and Y (e. g. parallels to the axes).
The relationships can be strong or weak.
31
3. Step:
Formulate the (linear) model equation (fig. 11): pH = 7.868 + 0.025 Temp
Linear regression between Temp and pH
9,0
observed linear
8,8
8,6
8,4
8,2
8,0
7,8
7,6 0,0 5,0 10,0 15,0 20,0 25,0 30,0
Temp
Figure 11: Linear relationship between variables
4.2 Confidence region of regression line 4. Step: Calculate the confidence region of the regression line.
The general model of linear regression is given by y = a +bx. Using the confidence intervals of a and b a confidence region of the (mean) linear model EY = a + bx can be defined by gu < EY < go where gu = y* - sy*t and go = y* + sy*t. The limits of confidence are symmetric hyperbolas around the linear regression model y* = a* + b*x. They get their minimum for x = x* and increase with for other x values. Therefore, the confidence statements will be fuzzier. The width of the confidence band L depend from sy* and can be calculated by L = 2 sy*t. 4.3 The power of linear regression The strength of a relationship is expressed by the empirical (linear) correlation coefficient: r = (xi x*)(yi y*)/(xi x*)2(yi y*)2. By means of this formula (explanation see chapter 4.4) the next step of linear regression procedure is derived.
32
5. Step:
Calculate the power of relationship: r = 0.493 or B = r2 = 0.243.
The calculation algorithm is presented in chapter 4.4. To derive statistical characteristics of a linear regression model the following cases should be distinguished: 1. b high, r high, s low, 2. b high, r low, s high, 3. b low, r low, s low, 4. b low, r very low, s high. 4.4 Empirical Covariance and statistical measures of correlation A correlation analysis answers the question about the strength and direction of a linear (but not severe functional) relationship between two or more variables. The power or intensity of such a relationship is expressed by correlation. Measures of correlation are the correlation coefficient r, the performance index B = r2 or the partial correlation coefficient rxy,z. Combining data series of different water quantity or water quality variables referring to two or more measurable characteristics sets of pairs of data (x1, y1), (x2, y2) ,, (xn, yn) or n-tupel of data will be obtained (fig. 12).
120 100 80 60 40 20
0 -20 0 20 40 60 80 100 120
X
Figure 12: Scatterplot of a bivariate relationship.
These sets of data can be seen as realisations of a two- or multi-dimensional stochastic vector (X, Y,). Normal probability distribution of data pairs or data tupel is a (strong) prerequisite.
33
A visualisation of a relationship between three variables is possible but in some cases not really helpful. The information content is high but cannot be extracted very clearly (fig. 13).
120 100 80
60 40 20 0 120 100 140 100120 60 80 20 40
80 60 40 20
Figure 13: 3-D scatterplot of variables
Such relationships are characterised by statistical measures which are denoted as correlation measures. In principle, arithmetic means and empirical variances of data series are used: x* = 1/n xi and y* = 1/n yi sx2 = 1/(n-1) (xi x*)2 and sy2 = 1/(n-1) (yi y*)2. A new data series with n pairs of data (xi, yi), i = 1, , n is formed by two variables {X} and {Y}. The empirical covariance sxy will be calculated as follows:
1 n 1 n ( x i x ) ( yi y) = ( x i y i nxy ) . s xy = n 1 i =1 n 1 i =1
sxy can be positive or negative. For small values of xi, the difference between arithmetic mean and xi will be negative. For big values of xi, the difference between arithmetic mean and xi will be positive. This is also valid for data yi. For this reason, a negative covariance characterises a relationship where big values xi are connected with small values yi mostly and vice versa. By normalisation of sxy with empirical standard deviations sx und sy one gets the empirical coefficient of correlation rxy:
34
r xy =
s xy . sy sx
Because of sxy = syx also rxy = ryx is valid. rxy is a measure of strength and direction of a linear relationship between hydrological variables X and Y. Statistical measures of correlation between two or more hydrological variables are mainly based on the assumption that the data sets are subsets of Gaussian distributed data sets. The rank correlation procedure functions without assuming a normal probability distribution of the data set to be analysed. Empirical bivariate correlation coefficient r = (xi x*)(yi y*)/(xi x*)2(yi y*)2 Performance index (coefficient of determination) B = r2 Partial correlation coefficients rxy,z = (rxy - rxzryz)/(1 rxz2)(1 ryz2) rxz,y = (rxz - rxyryz)/(1 rxy2)(1 ryz2) ryz,x = (ryz - rxyrxz)/(1 rxy2)(1 rxz2) Multiple correlation coefficients x, y, z x = f(y, z) Rx, yz = rxy2 + rxz2 2rxyrxzryz)/(1 - ryz2) Multiple performance index Bx.yz = (rxy2 + rxz2 2rxyrxzryz)/(1 - ryz2)
B
SPEARMANs rank correlation (Valid for small sample size, normal probability distribution not necessary)
rS = 1
6 ( xi y i ) n (n 1)
2 i =1
=1
6 Di
i =1 2
n (n 1)
Table 11 contains data and an explanation of the the ranking procedure for a SPEARMAN-test.
35
Table 11: Data and procedure of rank correlation
xi 0,5 0,8 1,1 0,5 0,4 0,3 0,9 0,8 0,3 0,3
R(xi) 5,5 7,5 10 5,5 4 2 9 7,5 2 2
yi 4 6 2 10 8 12 5 3 9 11
R(yi) 3 5 1 8 6 10 4 2 7 9
Di 2,5 2,5 9 -2,5 -2 -8 5 5,5 -5 -7
Di2 6,25 6,25 81 6,25 4 64 25 30,25 25 49 297
Result: rS = -0,8 Comparison of rS and rStab (positive values only): For n 30 the table of probability values of rS has to be used. For n > 30 the table of standardised normal probability distribution should be used: rSTab(95) = 0.5515; rSTab(99) = 0.7333; rSTab(99,9) = 0,8667. Decision: If rS rStab, then reject rS. the example shows that for each significance level rS rStab is valid. Result and interpretation: Between both data sets exists a relatively strong negative correlation.
36
5.
Nonlinear regression analysis
In the case that a linear regression model is not valid or insufficient other regression models should be tested. From this statement the following step of (linear) regression procedure is derived: 6. Step: Find out other model types if the linear model is insufficient (fig. 13).
Figure 14 contains some standard nonlinear regression models computed by means of SPSS. The results are presented in table 12.
pH
9,0
observed linear logarithmic
8,8
invers squared cubic
8,6
composed power S-shaped growth exponential logistic
8,4
8,2
8,0
7,8
7,6 0,0 5,0 10,0 15,0 20,0 25,0 30,0
Temp
Figure 14: Linear and nonlinear regression curves
Table 12: Results of nonlinear regression models
Model LIN LOG INV QUA CUB COM POW S GRO EXP LGS
B 0.243 0.198 0.158 0.308 0.432 0.238 0.194 0.156 0.238 0.238 0.238
b0 7.8683 7.6798 8.3568 8.1097 7.4486 7.8731 7.6972 2.1219 2.0635 7.8731 0.1270 37
b1 0.0252 0.2194 -1.2875 -0.0310 0.2131 1.0030 0.0262 -0.1544 0.0030 0.0030 0.9970
b2
b3
0.0022 -0.0191
0.0005
When comparing the performance indexes of these standard models the best statistical model is the cubic one. But this model represents the data cloud by 43.2% only. The remaining 56.8% are not described by the model. As an overall outcome of this analysis all of these models should be rejected and other types of nonlinear models should be investigated. 5.1 Polynomial regression The basic model is given by y = a0 + ai xi, i = 1,, where n is called the order of the polynomial. Figure 15 shows polynomials of different order. Each of the polynomials represents the given data set by a relatively high degree of performance. For 6th and 7th order polynomials the performance will be B = 1.
Figure 15: Examples of polynomial regression
38
By comparing the graphs different interpretations are possible. For the polynomial of 7th order the graph indicates negative values which do not exist. The advantage of polynomial regression is to get an algorithm for calculation of the existing nonlinear relationship between hydrological variables. Disadvantages are the high number of coefficients and sometimes physically not realistic results. The best models are not the ones where the graphs are joining all data points. Other model types used in water quality management are multiple linear or nonlinear regression models (e. g. DO(t) = a0 + a1TW + a2Q + a3BSB or DO(t) = a0+ a1TW + a2Q + a3BSB + a4TW + a5Q + a6BSB + a7TW) or models derived from control theory (e. g. stochastic transfer method). A continuous dynamic process is described by a time discrete model applying the z-transformation on a difference equation, G(z) = B(z-1)/A(z-1) +(z) 5.2 Periodic regression The basic relationship is given by y = a + b1sin x + b2cos x. The equation represents the simplest form of periodic regression or so-called Fourier polynomial. In an extended form this method is called Fourier analysis (see chapter 7). In figure 16 water temperature of a reservoir at three depth levels (0m, 10m, 25m) and the approximating graphs are presented.
Figure 16: Periodic regression of water temperature in a reservoir
39
It can clearly be seen that water temperature (and all other hydrological cycling variables) can be approximated very well by periodic functions. The advantage of this family of regression type functions is the visualisation of a cycling process, the disadvantage is that the functions are valid for fixed cycling periods only. 5.3 Trend functions Medium-term and long-term temporal and spatial developments (trends) of hydrological variables can be estimated by simple, explicitly given functions. Parameter estimation is done by the method of least squares (MKQ). Figure 17 shows the development of BOD in along a river stretch following a polynomial of 2nd order.
2,5 2,0 BOD (mg/l) 1,5 1,0 0,5 0,0 TeK0030 y = 0,0908x 2 - 0,5374x + 2,6386 R 2 = 0,9501
SPK0010
SPK0020 sampling point
Hv0190
Hv0200
Figure 17: Polynomial trend function for BOD in a river
Other examples of linear and nonlinear trend functions are presented in figures 18 to 20.
0,20 o-PO4-P (mg/l) 0,15 0,10 0,05 0,00 25014
y = 0,0166x + 0,0854 R2 = 0,8938

TeK0030 SPK0010 SPK0020 sampling point Hv0190 Hv0200
Figure 18: Linear trend of phosphate phosphorus in a channel
40
The linear function (also denoted as a polynomial of 1st order) is able to follow the increasing trend of phosphate phosphorus load due to waste water input in a low flow channel with acceptable accuracy. The deviations of regression line from measurements are small. For the same river stretch, the approximating 2nd order polynomial of water flow (fig. 19) shows stronger deviations after conjunction of the main river with a channel. The reason for this are changing hydraulic conditions and increasing values of water flow. The stationary or uniform flow conditions of the first part of the water body are disturbed now. Considering the performance index the graph should be acceptable. But the regression model is not able to compensate the positive jump in water flow because it works with fixed parameters (coefficients). Therefore, another regression model should used.
70 60 50 flow (m3/s) 40 30 20 10 0 25014
y = 3,3914x2 - 18,053x + 51,117 R2 = 0,809
TeK0030
SPK0010
SPK0020
Hv0190
Hv0200
sampling point
Figure 19: Quadratic trend function of water flow
On the other hand, for the same river stretch the trend of chlorophyll-a is expressed by a 2nd order polynomial again (fig. 20).
Chlorophyll-a (g/l) 80 60 40 20 0 25014
y = 1 ,4 6 2 7 x 2 - 6 ,4 2 2 1 x + 6 7 ,1 1 5 R 2 = 0 ,6 4 5 9
TeK0030 SPK0010 SPK0020 s a m p lin g p o in t H v0190 H v0200
Figure 20: Quadratic trend function of chlorophyll-a
41
The performance index is lower than before in fig. 19 for water flow because of some disturbances caused by hydrophysical phenomenon. But the trend follows the computed polynomial. Taking into account the variations in chlorophyll measurements the trend polynomial is quite acceptable. The following table gives a survey on trend functions used to estimate the developments of water quality in a river (table 13). All polynomials are of 2nd order. The signs in the last column indicate significance on a 95% probability level.
Table 13: Trend functions of water quality in the River Havel
Water quality indicator Water flow Temperature Conductivity Chloride DO BOD CSV NH4-N NO2-N NO3-N O-PO4-P TP SiO2 Suspended matter Chlorophyll-a Inorg. part of biomass Loss of org. matter
Trend polynomial polynomial polynomial polynomial polynomial polynomial polynomial exponential exponential exponential exponential polynomial polynomial polynomial polynomial polynomial polynomial
R 0,8126 0,6177 0,1971 0,0382 0,3858 0,4264 0,7611 0,5669 0,4879 0,4746 0,8683 0,0822 0,8888 0,0227 0,6032 0,6742 0,1418
P (95%) + + + + + + + + + + + + -
As can be seen from table 13, polynomial and exponential trend functions are sufficient to describe the changing water quality mathematically. Interpretations of trend functions can be given as follows: Linear trend: y(t) = a0 (t) + a1 (t) x(t). (Interpretation of parameters: (a0) mean initial value, (a1) mean rate of change) Squared trend: y(t) = a0 (t) + a1 (t) x(t) + a2 (t) x2 (t). (Interpretation of parameters: (a0) - mean initial value, (a1) - mean rate of change, (a2) mean process acceleration)
42
Polynomial trend: y(t) = a0 (t) + a1 (t) x(t) + a2 (t) x2 (t) + ..... + an (t) xn (t). (Interpretation of parameters is mostly impossible). Exponential trend: x(t) = x(0) e - kt + E. (Interpretation according to 1st order kinetics: x(0) initial concentration value, k rate of change, E random quota). 5.4 Comparison of regression functions To describe one and the same data set different nonlinear models can be applied.
Figure 21: Comparison of different regression functions for the same data set
43
By comparing the initial and the final reach of regression functions the best functional relationship will be selected (fig. 21). Also the linear model seems to be suitable. As can be seen in part H, the middle range of all computed models shows very small variations while the initial and the final part of the graphs show a spreading of curves. An evaluation of the quality of fit can be given by: Linear coefficient of determination (performance index):
R2 = B = ( y - y ) / ( y - y ) , Nonlinear performance index: Bnl = 1 - ( (y - y )2 / (n-1) sy ), Residual sum of squares: SR = (yi - y )2, or
Residual dispersion: s2 = SR/(n m 1) (n number of data, m number of parameters).
44
6.
Time series analysis
The distinction between discrete and continuous variables is not a clear dichotomy because continuous processes (seen from a physical point of view of understanding nature) will be observed at discrete time events. Therefore, mostly random variables are observed. 6.1 Dynamic behaviour of time series Freshwater ecosystems may be seen as switching networks where inputs are transformed into outputs by an operator which describes the transient behaviour of ecological processes (fig. 22). The overall operator transforms input signals into output signals: y(t) = x(t) where the signals will be smoothed (damped), and there exists some redundancy between input and output signals.
x(t)
y(t)
Figure 22: Schematic diagram of a transfer process
Therefore, water related processes are represented by time varying signals. In figure 23, NO3-N raw data are described by a polynomial trend as follows: NO3N(t) = 1,8987 0,0754 t + 0,0028 t2 - 0,00003 t3. An exact mathematical (or functional) description of random fluctuations is not possible. The function describes more or less the mean behaviour of the process.
3,5 3,0 2,5 2,0 1,5 1,0 ,5 0,0 0 10 20 30 40 50 60 70
Figure 23: Approximation of a time varying process by a function
45
6.2 Description of time series in time and in frequency domain Hydrological systems can be seen as stochastic transfer systems described by system state variables and parameters. They are characterised by measurable inputs, not measurable (stochastic) disturbances as well as by measuring errors. Disturbances, input signals and measurement errors will be overlaid and will produce output signals. Mathematical descriptions of hydrological time series can be represented by time domain functions (cf. transfer functions, pp. 8 and 9). In the frequency domain hydrological time series are represented by Fouriertransforms of correlation functions, by coherency functions as well as by wavelets. 6.3 Stationary processes Because of time lags between input and output processes stationary processes will then be reached when all transient processes are decayed. Therefore, some statistical characteristics of signals should only be grasped. If statistical characteristics do not change in time, then these processes are called stationary processes. Process averages and dispersions will not change so much in time. Therefore, stationary random processes can be investigated on different time intervals between - < t < + . Statistical characteristics of stationary random processes can be expressed by 1. Probability density function p(x) of signals X(t), 2. Auto-correlation function xx(), 3. Spectral power density function Sxx() A time varying process is expressed by a stochastic signal X(t). For each time stroke tn one measured value Xn(t) will be obtained. The further development of the process can be predicted only for a short time interval. When the process is described by an analytical (deterministic) function f(t) then the time behaviour can be predicted completely. Only some statistical statements on the future development of the process X(t) can be given: Prob(X(tn+1) x) P(x),
46
or Prob(a < X(t) b) = p(x)dx. The Gaussian distribution with a bell-shaped density is one of the most important probability density distributions where p(x) = 1/2exp-(x-x*)2/22. Important expectations are linear average: E(x) = xp(x) dx and squared average: E(x2) = x2p(x) dx. 6.4 Correlation and spectral functions The probability density function gives an information about the probability of the process X(t) that the amplitude at time t lies between x and (x + x): Prob(x < X(t) x + x) p(x)x. No statements on changes of X(t) within time intervals x are made. It cannot be seen whether a process contains lower and/or higher frequencies. Therefore, multiple probability distribution functions are necessary to describe the time varying process behaviour. The probability that X(t) at time t = t1 lies between x1 and x1 + x1 and at time t = t2 = t1 + between x2 + x2 (after time units) is approximately given by Prob(x1 < X(t1) x1 + x1, x2 < X(t2) x2 + x2) p(x1, x2)x1x2. The parameter gives information on the statistical coupling of data x(t1) and x(t1 + ). The auto-correlation function (ACF) gives information on the inner correlation between data with the distance on the time axes: (x(t)x(t + )p[x(t), x(t + )]dx(t)dx(t + ) xx() The cross-correlation function (CCF) gives information on the statistical correlation of two different processes X(t) and Y(t): (x(t)y(t + )p[x(t), y(t + )]dx(t)dy(t + ) xy(). By transforming the time correlation functions into the frequency domain one gets the auto-power spectrum or the cross-power spectrum. The auto-power spectrum Sxx() of x(t) is the Fourier transform of the ACF: Sxx() = Sxx(-) = 1/2xx()e-j d. The auto-power spectrum of an ecological process or signal is visualised by a periodogram. It represents the dominant frequency of the process. It gives the 47
spectrum of a stationary signal which is a distribution of the variance of the signal as a function of frequency. The frequency components that account for the largest share of the variance are revealed. Each peak represents the part of the variance of the signal that is due to a cycle of a different period or length. Significant periodicity in the signal will induce a sharp peak in a periodogram. The auto-covariance function is the time domain counterpart of the periodogram. The periodogram of water temperature (figure 24) shows a single distinct peak which indicates the major cyclic behaviour. The low frequency component is responsible for the general tendency of the indicator.
Figure 24: Periodogram of water temperature of the Lower Havel River
Figure 25: Periodogram of pH
The periodogram of pH in figure 25 shows that the highest variance is displayed by a low frequency. Small fluctuations are not dominant and can be neglected.
48
Only long term changes are responsible for the overall observed behaviour of the indicator. The periodogram of pH is similar to that of dissolved oxygen presented in figure 26. High variances at low frequencies are observed. This means that the general tendency of this indicator is determined by long term changes.
Figure 26: Periodogram of dissolved oxygen.
For the indicator of phytoplankton biomass the periodogram is shown in figure 27. The periodogram represents low frequency components which exhibit the highest variances and some small fluctuation at higher frequencies. They determine the long term behaviour of the indicator. Two distinct peaks reveal two cycles of different periods and amplitudes.
Figure 27: Periodogram of chlorophyll-a
The cross-power spectrum Sxy() of two stochastic ecological processes x(t) and y(t) is the Fourier transform of the CCF:
49
Sxy() = 1/2xy()e-j d It is a complex function. The coherency function Co() is a measure of synchronicity of (two) signals. It is calculated on the base of periodograms of both signals by Coxy() = |Sxy()|2/Sxx()Syy(), where |Sxy()| = Re(Sxx())2 + Re(Syy())2 and for the phase shift between both signals () = arc tan (Im(Sxy())/Re(Sxy())) is valid. The limitation of CCF is considered by what is called a window function h():
~
Sxy() = 1/2xy()h()e-j d = Sxy()H( - ),
where H() is the Fourier transform of h() which distorts Sxy() to ~Sxy().
50
7.
Analysis of cycling processes
Cycling processes in hydrology are natural. In fig. 28 some examples of cycling processes with different periods and frequencies are presented.
200
3 Q (m /s)
100 0 1000 600 200 30
Tw (C )
E ( /cm C S )
15 0 20 10 0 10
pH
O2 (m g/l)
8 6 1985
1987
1989 1991 time (a)
1993
1995
Figure 28: Cycling water quality indicators
Such processes are caused mostly by natural external driving forces but also by natural internal driving forces. They lay out different time and frequency behaviour of the water quality (hydrological) processes. Water quality processes are characterised by different time parameters such as time delay, threshold values, altering, physiological parameters and others. State transitions take place on intervals (ai (t), bi (t)) with probability densities wi (t) of time delays of system variables and probabilities pi (t) for each realisation of a state transition: For ai (t) wi (t) bi (t): pi (t) = wi (t) dt. On the other hand, hydrological variables vary often with high frequencies because of random changes of internal system states and/or fluctuations of variables. Switching processes of input variables take place at certain different time events. Time delays in the courses of action of system components lead to retardations in the changes of system states and to redundancies in the data transfer.
51
A state transition can be characterised by a quadrupel
i (t) = {ai (t), bi (t), wi (t), pi (t)).

A classification of hydrological systems can be given by its characteristics of signals and by the type of change of dynamic properties (table 14).
Table 14: Classification of hydrological systems
Classification Characteristics of signals Modulation Quantification Adaptability of system adaptive
Remark Change of amplitudes, frequencies and phases of signals Discretisation of time domain of amplitudes and duration interval of signals Change of systems states, change of inputs and disturbances, change of parameters, change of system structure fixed parameters, no change of ecosystem structure
non-adaptive
7.1 Introduction Mathematical equations describe either the time dependency (function of time t) which is called description in the time domain or the frequency dependency (function of frequency or cycles per time unit) which is called description in the frequency domain. Mostly, cycling (or periodic) processes in hydrological context are caused by natural external driving forces. On the other hand, aperiodic hydrological processes are mainly influenced by artificial (man-made) external driving forces. Another distinction can be made by the ability to reproduce a time-varying process. In the case of correct reproduction and forecast of a process it is called a deterministic one. Otherwise it is called a non-deterministic or stochastic (random) process. Each deterministic process x(t) is characterised by its time development (or behaviour) x = x(t) with - < t < +. A harmonic process is described by a trigonometric function x(t) = x0cos(i + i) with - < t < +. i = 2/Ti is the basic cycling frequency (circle frequency), Ti is the period of cycle, and i is the shift of phase. 52
7.2 Fourier analysis A periodic process with period T0 is described by a Fourier series of the form x(t) = a0/2 + aicos(i0t) + bisin(i0t), with - i + , 0 = 2/T0 frequency of the basic cycle, T0 period of cycle. The amplitudes ai and bi are calculated as follows: ai = 1/T0 x(t)cos(i0t)dt, bi = 1/T0 x(t)sin(i0t)dt and a0 = 1/2T0 x(t)dt. The Fourier polynomial is an approximation which represents the minimum mean squared deviation of a cycling process. Then, the amplitudes of the approximating function are given by Ai = ai2 + bi2. Phase shifts are given in the interval [0, 2] by i = arc tan bi/ai. Figure 29 shows a Fourier approximation of global radiation process. It can be seen that the approximation is shifted from the real frequencies due to a fixed frequency. This fact causes some error.
500 450 400 global radiation (W/m ) 350 300 250 200 150 100 50 0 1996 1997 1998 1999 2000 time (a) 2001 2002 2003
2
raw data component with max. amplitude (f=1/352d)
Figure 29: Fourier approximation of global radiation
Fourier approximations can be used to explain the variance of a cycling process by its basic frequency. Table 14 gives an example on the usefulness of this method for physical, chemical and biological environmental or ecological variables respectively. The 3rd column of table 14 contains the values of total variance of the time series under consideration. The last column contains the val53
ues of variance which are explained by the dominant cycle contained in the timw series. The best results will be obtained for physical variables, followed by chemical variables. Insufficient results are obtained for biological variables.
Table 14: Fourier analysis of water quality indicators
Indicator
TEMP
Reservoir
Saidenbach Neunzehnhain Kliava Slapy Saidenbach Neunzehnhain Kliava Slapy Saidenbach Neunzehnhain
Total variance (%)

90.0 90.0 95.7 95.9 71.6 76.2 75.9 76.2 36.8 35.4
Average
12.0 11.9 11.1 12.0 10.5 10.1 10.2 7.7 5.4 1.7
Std. dev.
44.43 33.41 56.80 52.25 2.90 1.98 4.37 8.88 32.87 1.59
Variance of the yearly cycle (% of total variance)

84.62 74.35 92.76 90.63 22.74 22.19 37.35 34.81 1.94 1.96
DO
CHA
Example: Approximation of water temperature of reservoirs (yearly dominant harmonic cycle): Reservoir Saidenbach TEMP(t) = 12.0 + 1.458cos((6/180)t) 4.462sin((6/180)t Reservoir Neunzehnhain TEMP(t) = 11.9 + 0.693cos((6/180)t) + 4.415sin((6/180)t Reservoir Kliava TEMP(t) = 11.1 - 6.650cos((9/180)t) - 7.820sin((9/180)t) Reservoir Slapy TEMP(t) = 12.0 - 7.073cos((10/180)t) - 6.684sin((10/180)t) 7.3 Digital data filter Digital filter function transfer sequences of input signals to sequences of output signals by compressing or decompressing noisy information contained in the measured signals of hydrological processes. The results of applying digital filters are consistent data series which can be used for modelling, simulation and optimisation in hydrological sciences. Basic filter functions are derived from an ideal low pass filter:
54
Ideal low pass
|H ( )|
1 1+ F ( )
2
Butterworth filter (power low pass)
|H ( )|
1 1+
2n
(Amplitude response should be as flat as possible in the pass band).
Tschebyshev filter, type 1
|H ( )|
1 1+ c ( )
2 2 n
( - ripple factor (or eccentricity), = 0.1526. In the pass band a ripple is accepted. The transition from pass band to stop band is steeper than for the Butterworth filter).
Tschebyshev filter, type 2 (inverse Chebyshev-Filter)
|H ( )|
1 1+ * c ( )
2 2 n
with * = 2/(1-).
(In the stop band a ripple is accepted.)
Elliptic filter (Cauer filter)
|H ( )|
1 1+ F * ( )
2 2 n
(Ripples arise in the pass band and in the stop range. One gets the steepest transition between both frequency bands). To get an acceptable transfer behaviour filters of order 1 to 3 should be used only. Figures 30 to 33 represent the transfer behaviours of digital filters for different water quality time series. Higher order filters show rippling transfer behaviours and cause nonlinear effects in the output sequences of signals. This leads to misinterpretations and unexplainable events within the data series.
55
3.5 3 standard error O2 (mg/l) 2.5 2 1.5 1 0.5 0
order 1 order 2 order 3 order 4 order 5 order 6
50
100 150 200 250 300 reciprocal of critical frequency (d)
350
Figure 30: Selection of filter order of a Butterworth filter for DO
The higher order filters lead to changing (welling) transfer behaviour during the filtering process as can be seen in figs. 30 and 31.
0.8 0.7 standard error pH-value 0.6 0.5 0.4 0.3 0.2 0.1 0 order 1 order 2 order 3 order 4 order 5 order 6
50
350
Figure 31: Selection of filter order of a Butterworth filter for pH
They show this behaviour for a Butterworth filter. Tchebychev 1 filters for chlorophyll-a and for water temperature (figs. 32 and 33) demonstrate the disturbances within the transfer process.
56
standard error total chlorophyll-a (mg/l)
0.035 0.03 0.025 0.02 0.015 0.01 0.005 0
50
350
Figure 32: Tschebychev 1 filter for chlorophyll-a
standard error water temperature (C)
2.5
1.5
0.5
50
350
Figure 33: Tschebychev 1 filter for water temperature
The first step of digital data filtering procedures is the selection of a complete hydrological time series. If the data series contains some gaps interpolation methods should be used to get a time series with equidistant data. This is a strong prerequisite for all further steps. Fig. 34 shows such a data series for the variable conductivity of the Oder River at Frankfurt.
57
1600 1400 1200 1000 800 600 400 1993 1994 1995 1996 1997 time (a) 1998 1999 2000
conductivity (S/cm)
Figure 34: Original data series of conductivity
In the next step the critical frequency is calculated from spectral density function (fig. 35). As confidence band the 95% - confidence region should be selected. For the example a critical frequency fg = 0.053 was used.
10 10 power density (conductivity) 10 10 10 10 10
6
power density upper bound (confidence interval 95%) lower bound (confidence interval 95%)
fg 0 0.1 0.2 0.3 frequency (1/d) 0.4 0.5
Figure 35: Selection of critical frequency of the filter
The last step consists of computation of the digital filter and reconstruction of the original data series. In case of the Oder River an elliptic filter was used to reconstruct the original time series and to get a consistent time series for modelling (fig. 36).
58
1200 1100 1000 conductivity (S/cm) 900 800 700 600 500 400 300 200
Raw Data Elliptic Filter (1. order, f<0.041 1/d)
1980
1990 time (a)
2000
Figure 36: Reconstruction of a time series by an elliptic filter
Table 15 shows an overview of the usage of digital data filters to produce consistent time series for modelling of freshwater ecosystems. The filters given in parenthesis indicate a 99% - significance level. All others are given on 95% significance level.
Table 15: Digital data filters for water quality variables
Indicator CHA DOC Conduct. NH4-N NO2-N NO3-N DO o-PO4-P pH
Filter 1. order Elliptic Elliptic Butterworth Elliptic Chebychev1 Elliptic Elliptic Elliptic Butterworth Elliptic Chebychev1 Elliptic Butterworth Elliptic Chebychev1 Butterworth Chebychev1 Elliptic Butterworth Chebychev1 Elliptic
Filter2. order (Elliptic) (Elliptic) Butterworth (Elliptic) (Elliptic) (Elliptic) Butterworth (Elliptic) Butterworth Chebychev2 Butterworth Chebychev1 Butterworth Chebychev1 Chebychev2 Elliptic
Q TW
Filter 3. order (Elliptic) (Elliptic) Butterworth Elliptic Chebychev1 (Elliptic) (Elliptic) (Elliptic) Butterworth Elliptic Chebychev1 (Elliptic) Butterworth Elliptic Chebychev1 Chebychev2 Butterworth Elliptic Chebychev1 Butterworth Elliptic Chebychev1 Chebychev2
59
7.4 Wavelets Wavelet analysis has been proven quite useful for time scale based signal analysis. It is a solution for the time scale analysis problem because it offers an effective approach to extract both the information on the time localization and the frequency content of the time series. It has the ability to decompose time series into several sub-series which may be associated with particular time scales. As a result, the interpretation of features in hydrological time series may be facilitated by first applying an appropriate wavelet transform and subsequently interpreting each individual sub-series. The following questions can be effectively answered with the help of wavelet analysis: 1. What is the dominant scale of variation influencing the observed general tendency of the indicator? 2. Are the variations from one day to the next more prominent than the variations from one week to the next? 3. Are the statistical variations in the hydrological indicator homogenous across time? 4. What are the time dependent variations such as the presence of trends? 5. How are two indicators related on a scale by scale basis? How do they covary at different scales? The wavelet analysis imitates the windowed Fourier analysis by using basis functions (wavelets) that are better suited to capture local behaviour of nonstationary signals. The wavelet transformation is a function of two variables W(u,s) obtained by projecting a signal X(t) on to a particular wavelet and is given by
W (u, s) = X (t )u , s (t )dt,
u , s (t ) =
1 t u s s
which gives a translated and dilated version of the original wavelet function. The coefficients that are obtained are a function of the location and scale parameters. Applying shifted and scaled versions of a wavelet function decomposes the signal into simpler components. It is the effect of the shifting and scaling proc-
60
ess what makes this representation possible and is referred to as multiresolution analysis. The wavelet transform is usually applied in the form of a filter bank, comprising two filters. The scaling filter known as the father wavelet is a low pass filter while the wavelet filter known as the mother wavelet is a high pass filter. Given a signal X(t) of length n = 2j, the filtering procedure can be performed a maximum of j time, giving rise to j different wavelet scales. The wavelet coefficients or detail coefficients are produced by the wavelet filter while the scaling filter gives rise to the smooth version of the signal used at the next scale. Given the respective father and mother wavelets,
J ,k = 2
J 2
t 2J k 2J
(t )dt = 1
and
j ,k = 2
j 2
t 2jk 2j
(t )dt = 0
where J,k is the father wavelet and j,k is the mother wavelet with the scale parameter s being restricted to the dyadic scale 2j. If a signal is projected onto a given basis function
S J ,k = f (t ) J ,k ,
then
d j ,k = f (t ) j ,k
will be obtained with SJ,k being the coefficients for the father wavelet at a maximum scale of 2j (the smooth coefficients) and dj,k being the detail coefficients from the mother wavelet at all scales from 1 to j, to the maximal scale. Based on these coefficients, the function f(t) can be represented by f (t ) = S J ,k J ,k (t ) + d j ,k J ,k (t ) + .... + d1,k 1,k (t )
k k k
and can be equally represented by f(t)=Sj+ Dj + Dj-1+ + Dj + D1
61
where S J = S J ,k J ,k (t )
k
and D j = d j ,k j ,k (t ) .
k
Multiresolution decomposition (MRD) reveals the variations at different scales denoted by d. Figure 37 shows the details of the multiresolution analysis of dissolved oxygen sampled at daily interval.
Figure 37: Multiresolution analysis details of dissolved oxygen signal sampled at daily intervals
The details reveal the high frequency variations present in the dissolved oxygen time series or provide an additive decomposition of the high frequency variation on a scale by scale basis. The notations d1, d2, d3, d4, d5, d6 and d7 reveal the variations occurring at one day, 2 days, 4 days, 8 days, 16 days 32 and 64 days respectively. This progressive decomposition reveals the differences in fluctuations from one scale to another. It effectively shows that the lower scales are less important compared to the higher scales of variation.
62
Multiresolution analysis (MRA) filters information in the signal at different scales represented by a. In fig. 38 an example of a MRA and MRD is given for longterm observations of dissolved oxygen in eutrophic freshwater ecosystem. Taking of from the original signal s all high frequent events the basic nature of the cycling process comes out. This can be seen at level a7.
Figure 38: Wavelet analysis of DO
Figure 38 reveals that the variations occurring at a time scale of 1 day are equally of relatively low intensity and are not able to influence the general tendency observed in the dissolved oxygen signal. However, the fluctuations occurring at higher time scales such as scale 8 are strong enough to influence the long term behaviour of the signal. Hence, the long term tendency observed in the dissolved oxygen time series is significantly influenced only by the fluctuations occurring at the higher scales and not the lower scales. At the lower scales, the fluctuations are higher during the warmer months than during the colder months. At the higher scales such as scale 32, the fluctuations are high throughout the year. It is quite interesting to examine the variance at different scales to effectively quantify these variations.
63
An overview on the respective frequencies is given in table 15.

Table 15: Frequencies and scales of MRA and MRD
MRA scale a1 a2 a3 a4 a5 a6 a7 a8 a9 a10 a11 a12
MRD scale d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12
Frequency 1 2 4 8 16 32 64 128 256 512 1024 2048
The variance of a signal can equally be decomposed by using this technique. For a signal Xt the time varying variance of the scale Sj of a wavelet coefficient Wj,t can be calculated by
x2,t ( S j ) =
1 var(w j ,t ) . 2S j
Similar to the wavelet variance of a univariate signal, the wavelet covariance decomposes the covariance between two signals on a scale by scale basis by
j =1
( S j ) = cov( x1,t , x x ,t ) .
The wavelet variance shown in figure 39 reveals the intensity of variation from one scale to the next of the dissolved oxygen time series. This graphical representation of the wavelet variance enables the researcher to answer questions concerning the dominant scale of variation in the time series, the homogeneity of variations from one scale to the next, the importance of the variations at one scale compared to the variations occurring at another scale.
64
2.00 1.00 0.50 0.20 0.10 0.05 U * L 1 2 4 8 16 32 U * L L U * L L U * U * U *
Wavelet Scale
Figures 39: Wavelet variance of dissolved oxygen with db4
65
Literature Adorf, H.-M., 1995: Interpolation of Irregularly Sampled Data Series - A Survey. In: Shaw, R. A., H. E. Payne and J. J. E. Hayes (eds.): Astronomical Data Analysis Software and Systems IV. ASP Conference Series, Vol. 77, Academic Press, New York, pp. 1-4. Box, G. E. P., G. M. Jenkins and G. C. Reinsel, 1994: Time Series Analysis. 3rd ed., Prentice Hall, Englewood Cliffs. Brmaud, P., 2002: Mathematical Principles of Signal Processing. Springer, New York, 2002. Brockwell, P. J. and R. A. Davis, 1998: Introduction to Time Series and Forecasting. Springer, Berlin. Franses, P. H., 1999: Periodicity and Structural Breaks in Environmetric Time Series. In: Mahendrarajah, S., A. J. Jakeman and M. McAleer (eds.): Modelling Change in Integrated Economic and Environmental Systems. Wiley, New York. Gentili, S., Magnaterra, L., and G. Passerini, 2004: An Introduction to the statistical filling of environmental data time series. In: Latini, G. and G. Passerini (eds.): Handling Missing Data. WIT Press, Southampton, pp. 127. Han, J. and M. Kamber, 2006: Data Mining Concepts and Techniques. Morgan Kaufmann, New York. Hipel, K. W. and A. I. McLeod, 1994: Time Series Modelling of Water Resources and Environmental Systems. Elsevier, Amsterdam. Jrgensen, S. E. und W. J. Mitsch (eds.), 1983: Application of Ecological Modelling in Keith, L. H. (ed.), 1988: Principles of Environmental Sampling. ACS Professional Reference Book, ASC, Salem. Latini, G. and G. Passerini (eds.), 2004: Handling Missing Data. WIT Press, Southampton. Little, R. J. A. and D. B. Rubin, 1983: Missing Data in Large Data Sets. In: Wright, T. (ed.): Statistical Methods and the Improvement of Data Quality. Academic Press, London, pp. 73-82. Little, R. J. A. and D. B. Rubin, 1987: Statistical Analysis with Missing Data. Wiley, Chichester. 66
Mallat, S. (1998): A wavelet tour of signal processing . Academic Press, New York. Mller, W. G., 2001: Collecting Spatial Data. Springer, Berlin. Pollock, D. S. G., 1999: A Handbook of Time-Series Analysis, Signal Processing and Dy Powell, T. M. and J. H. Steele (eds.), 1995: Ecological Time Series. Chapman & Hall, New York. Rebecca, M., 1998: Spectral analysis of time-series data. Guilford Press, New York. Reckhow, K. H. und S. C. Chapra, 1983: Engineering Approaches for Lake Management. Vol. 1: Data Analysis and Empirical Modelling. Butterworth, Woburn. Shumway, R. H. and D. S. Stoffer, 2000: Time Series Analysis and Its Applications. Springer, New York. Stein, M. L., 1999: Interpolation of Spatial Data. Springer, Berlin. Strakraba, M. und A. Gnauck, 1985: Freshwater Ecosystems Modelling and Simulation. Elsevier, Amsterdam.
67

Statistical Modelling: Univ.-Prof. Dr. Habil. Albrecht Gnauck

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Statistical Modelling: Univ.-Prof. Dr. Habil. Albrecht Gnauck

Caricato da

Copyright:

Formati disponibili

Brandenburg University of Technology at Cottbus Dept.

of Ecosystems and Environmental Informatics

Univ.-Prof. Dr. habil. Albrecht Gnauck

International Master Course of Study Hydroinformatics EuroAquae

Winter term 2010/2011

1. Events and data

Feature High dimension Uncertainty

Restricted information structure

Figure 1: Switching processes within a freshwater ecosystem

Increase of information content

Figure 2: Data scales in hydrological research

Scale Ratio Scale Interval Scale Ordinal Scale Nominal Scale

Arithmetic operation +, -, , / +, none none

0.6 1.5 1 0.5 0 40

7 15 O2 (mg/l) 10 5 0 NO3 (mg/l)

Figure 3: Data series of water quality samples

Figure 4: Interpolation, approximation and digital filtering of data

Method Nearest neighbour Linear Cubic Hermite polynomial

t < (t k + t k +1) / 2 x ~ x (t ) = k x k +1 t (t k + t k +1) / 2 ~ x (t ) = xk +1 xk (t t k ) + xk t k +1 t k

Figure 5: Results of interpolation for two-weekly sampled data

Table 4: Standard error of data series with biweekly sampling interval

Figure 6: Results of interpolation for (nearly) monthly sampled data

Probability distribution functions and statistical measures

functions (frequency distributions) indicate skewed probability distribution (fig. 7).

10 0 1,0 1,5 2,0 2,5 3,0 3,5 4,0

Figure 7: Skewed frequency distribution of a hydrological variable

Figure 8: Examples of frequency distributions

Figure 9: Comparison of mean, median and mode

3. Statistical test procedures

Table 8: Table of t Test (according to Kaiser and Gottschalk 1974)

f=n-1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 50 100 200 300 500 700 Example:

f2\f1 1 2 3 4 5 10 12 14 16 18 20 22 25 30 40 60 f2\f1 1 2 3 4 5 10 12 14 16 18 20 22 25 30 40 60 Example:

f=n-1 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 50 100 200 300 500 700

4. Linear regression and correlation analysis

Figure 10: Scatterplot of hydrological variables

Estimate the relationship (positive or negative) between variables.

The relationships can be strong or weak.

7,6 0,0 5,0 10,0 15,0 20,0 25,0 30,0

Figure 11: Linear relationship between variables

Calculate the power of relationship: r = 0.493 or B = r2 = 0.243.

0 -20 0 20 40 60 80 100 120

60 40 20 0 120 100 140 100120 60 80 20 40

Figure 13: 3-D scatterplot of variables

Table 11: Data and procedure of rank correlation

R(xi) 5,5 7,5 10 5,5 4 2 9 7,5 2 2

Di 2,5 2,5 9 -2,5 -2 -8 5 5,5 -5 -7

Di2 6,25 6,25 81 6,25 4 64 25 30,25 25 49 297

Nonlinear regression analysis

observed linear logarithmic

invers squared cubic

composed power S-shaped growth exponential logistic

7,6 0,0 5,0 10,0 15,0 20,0 25,0 30,0

Figure 14: Linear and nonlinear regression curves

Table 12: Results of nonlinear regression models

Figure 15: Examples of polynomial regression

Figure 16: Periodic regression of water temperature in a reservoir

SPK0020 sampling point

Figure 17: Polynomial trend function for BOD in a river

y = 0,0166x + 0,0854 R2 = 0,8938

Figure 18: Linear trend of phosphate phosphorus in a channel

y = 3,3914x2 - 18,053x + 51,117 R2 = 0,809

Figure 19: Quadratic trend function of water flow