Sei sulla pagina 1di 20

OLAP Data Analysis with DB2

1 Data Analysis and OLAP


Aggregate functions summarize large volumes of data. Online Analytical Processing (OLAP): Interactive analysis of data. Allows data to be summarized and viewed in dierent ways in an online fashion (with negligible delay). OLAP data is modeled multi-dimensionally. It can be modeled as dimension attributes and measure attributes: Given a relation used for data analysis, we can identify some of its attributes as measure attributes, since they measure some value and can be aggregated upon (e.g., number of sales, inhabitants, passengers, etc.). Some of the other attributes of the relation are identied as dimension attributes, since they dene the dimensions on which measure attributes and summaries of measure attributes are viewed.
c 2005 Jens Teubner, Andr e Seifert, University of Konstanz 1

1.1

Cross Tabulation and its Relational Representation

A cross tabulation, also referred to as a pivot table, is a table where values of one of the dimension attributes form the row headers, values of another dimension attribute form the column headers, and values in individual cells are (aggregates of) the values of the dimension attributes that specify the cell. The table below is an example of a cross tab: Sex Australia Denmark Germany Netherlands United States Total Male 9,913, 658 2,676, 377 40,413, 132 8,079, 392 143,957, 558 205,040, 117 Female 9,999, 486 2,737, 015 42,011, 477 8,238, 807 149,070, 013 212,056, 798 Total 19,913, 144 5,413, 392 82,424, 609 16,318, 199 293,027, 571 417,096, 915

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

Country

In relational DBMSs cross tabs are represented as relations: The value all is used to represent aggregates. The SQL 1999 standard uses null values in place of all. DB2 used the minus sign (-) to denote aggregate values.

Country Australia Australia Australia Denmark Denmark Denmark Germany Germany Germany Netherlands Netherlands Netherlands United States United States United States all all all

Sex male female all male female all male female all male female all male female all male female all

Population 9,913, 658 9,999, 486 19,913, 144 2,676, 377 2,737, 015 5,413, 392 40,413, 132 42,011, 477 82,424, 609 8,079, 392 8,238, 807 16,318, 199 143,957, 558 149,070, 013 293,027, 571 205,040, 117 212,056, 798 417,096, 915

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

1.2

OLAP Terminology

The operation of changing the dimensions in a cross tab is called pivoting. Suppose an analyst wishes to see a population cross tab on countries and sex for a xed value of the size of the states of the respective countries, for example, 10, 000 km2 instead of the sum across all states: Such an operation is referred to as slicing. If values from multiple dimensions are xed, the operation is called dicing. The operation of moving from ner-granularity data to a coarser granularity is called a roll-up. The opposite direction that of moving from coarse granularity data to ne granularity data is called drill down.

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

2
2.1

Extended Aggregation
SQL-92 vs. SQL-99

SQL-92 aggregation functionality quite limited. Very useful aggregates are either very hard or impossible to specify. Data cube operations, Complex aggregates (e.g., median, variance), Binary aggregates (e.g., correlation, regression curves), Ranking queries (e.g., assign each football team a rank based on the total number of point, goal dierence, goals scored). SQL-99 OLAP extensions provide a variety of aggregation functions to address the above limitations. Supported by DB2 version 6.

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

2.2

Extended Aggregation in DB2

GROUP BY and GROUPING SETS statements are used to group individual rows into combined sets based on the value in one, or more, columns. ROLLUP and CUBE statements are short-hand forms of particular types of GROUPING SETS statement. 2.2.1 Cube Operation

CUBE operation computes union of GROUP BYs on every subset of the specied attributes. Consider the following example query:
SELECT country, sex, sum(population) FROM population GROUP BY CUBE(country, sex);

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

This computes the union of 2n with n = 2 groupings of the population relation:


{(country, sex), (country), (sex), ()},

where () denotes an empty group by list. For each grouping, the result contains the null value for attributes not present in the grouping. Query above computes the relational representation of the population cross tab that we saw earlier. The function grouping() can be used to identify what rows come from which particular grouping set. A value of 1 indicates that the corresponding data eld is null because the row is from of a grouping set that does not involve this row. Otherwise, the value is zero.

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

Example:
SELECT country, sex, sum(population), grouping(country) AS country_flag, grouping(sex) AS sex_flag, FROM population GROUP BY CUBE(country, sex);

You can use the CASE expression in the SELECT clause to replace such nulls (presented as -) by a value such as all. For example: Replace country in the previous query by:
CASE WHEN grouping(country) = 1 THEN all ELSE country END AS country

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

2.2.2

Rollup Operation

ROLLUP operation generates union on every prex of specied list of attributes. The following example query:
SELECT country, sex, sum(population) FROM population GROUP BY ROLLUP(country, sex);

generates the union of three groupings:


{(country, sex), (country), ()}

Rollup can be used to generate aggregates at multiple levels of a hierarchy. Suppose their exists the dimension stretch in the population relation which can be used to aggregate by town, state, and country.

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

Then the query


SELECT country, state, town, sex, sum(population) FROM population GROUP BY ROLLUP(country, state, town, sex);

would give a hierarchical summary by sex and by stretch. Multiple roll-ups and cubes can be used in a single group by clause. Each generates set of group by lists. Cross product of sets gives overall set of group by lists. The following example query:
SELECT year, country, sex, sum(population) FROM population GROUP BY ROLLUP (year), ROLLUP(country, sex);

generates the groupings:


{(year), ()} X {(country,sex), (country), ()}

{(year,country,sex), (year,country), (year), (country,sex), (country), ()}


c 2005 Jens Teubner, Andr e Seifert, University of Konstanz 10

Having multiple CUBE statements is allowed, but not always useful: The following example query:
SELECT year, country, sex, sum(population) FROM population GROUP BY CUBE (year), CUBE(country, sex);

would generate the groupings:


{(year), ()} X {(country,sex), (country), (sex), ()}

{(year,country,sex), (year,country), (year,sex), (year), (country,sex), (country), (sex), ()}

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

11

2.2.3

Grouping Sets Operation

GROUPING SETS statement enables us to get multiple GROUP BY result sets using a single statement. Nested (i.e., in secondary parenthesis), and non-nested GROUPING SETS sub-phrases can be distinguished: Nested list of columns works as a simple GROUP BY. Non-nested list of columns works as separate simple GROUP BY statements, which are then combined in an implied UNION ALL.

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

12

Example:
GROUP BY GROUPING SETS ((year,country,sex)) GROUP BY GROUPING SETS (year,country,sex) GROUP BY year, country, sex

GROUP UNION GROUP UNION GROUP

BY year ALL BY country ALL BY sex

GROUP BY GROUPING SETS (year,(country,sex))

GROUP BY year UNION ALL GROUP BY country, sex

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

13

Multiple GROUPING SETS in the same GROUP BY are combined together as if they were simple elds in a GROUP BY list. Example:
GROUP BY GROUPING SETS (year), GROUPING SETS (country), GROUPING SETS (sex) GROUP BY GROUPING SETS (year), GROUPING SETS ((country,sex)) GROUP BY GROUPING SETS (year), GROUPING SETS (country,sex) GROUP BY year, country, sex

GROUP BY year, country, sex

GROUP BY year, country UNION ALL GROUP BY year, sex

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

14

ROLLUP and CUBE statements are short-hand forms of particular types of GROUPING SETS statement. ROLLUP expression displays sub-totals for the specied elds. Example:
GROUP BY ROLLUP(year, country, sex) GROUP BY GROUPING SETS((year,country,sex), (year,country),(year), ())

CUBE expression displays a cross tab of the sub-totals for any specied elds. Example:
GROUP BY CUBE(year, country, sex) GROUP BY GROUPING SETS((year,country,sex), (year,country), (year,sex), (country,sex), (year), (country), (sex), ())

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

15

Ranking
Ranking is done in conjunction with an ORDER BY clause. Given the relation population(country, number) nd the rank of each country.
SELECT country, rank() OVER (ORDER BY number DESC) AS n_rank FROM population

ORDER BY clause is required to return query results in sorted order.


SELECT country, rank() OVER (ORDER BY number DESC) AS n_rank FROM population ORDER BY n_rank

Ranking may leave gaps: If multiple rows have equal values, they all get the same rank. If 2 countries have the same top population number, both have rank 1, and the next have rank 3. Function dense rank() does not leave such gaps, i.e., next dense rank would be 2.
c 2005 Jens Teubner, Andr e Seifert, University of Konstanz 16

Example query: Find the rank of the countries within each sex in terms of their population size.
SELECT country, rank() OVER (PARTITION BY sex ORDER BY number DESC) AS n_rank FROM population ORDER BY sex, n_rank

Multiple independent rankings can be specied in the same query: Example:


SELECT country, rank() OVER (ORDER BY number DESC) AS n_rank_desc, rank() OVER (ORDER BY number ASC) AS n_rank_asc, rank() OVER (ORDER BY year ASC) AS y_rank_asc FROM population ORDER BY n_rank_desc

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

17

DB2 provides special syntax for top-n (rank) queries. Example query: Find the ve highest populated countries.
SELECT country, rank() OVER (ORDER BY number DESC) AS n_rank FROM population ORDER BY n_rank FETCH FIRST 5 ROWS ONLY

When writing the ORDER BY clause, one can specify whether to count null values as high or low. The default, for an ascending eld is that they are counted as high (i.e. come last), and for a descending eld, that they are counted as low: Example:
SELECT country, rank() OVER (ORDER BY number DESC NULLS LAST) AS n_rank FROM population ORDER BY n_rank

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

18

Windowing
Windowing constructs allow us to do things like get cumulative totals or running averages. For example: Given population values for each country and year, calculate the average population rate for each country and year on the basis of the current, previous, and next year. Query in SQL:
SELECT country, year, avg(population) OVER (ORDER BY country, year ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS p_avg FROM population order by country, year, p_avg;

c 2005 Jens Teubner, Andr e Seifert, University of Konstanz

19

Examples of other window clause specications: ROWS UNBOUNDED PRECEDING, ROWS UNBOUNDED PRECEDING AND CURRENT ROW, ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, RANGE BETWEEN 10 PRECEDING AND CURRENT ROW, RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING, RANGE BETWEEN CURRENT ROW AND CURRENT ROW. We can do windowing within partitions. For example: Find the average male and female population rate for each country and year on the basis of the current, previous, and next year. Query in SQL:
SELECT country, sex, year, avg(population) OVER (PARTITION BY sex ORDER BY name, year ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING) AS p_avg FROM population ORDER BY country, sex, year, p_avg;
c 2005 Jens Teubner, Andr e Seifert, University of Konstanz 20