Sei sulla pagina 1di 89

Data Warehousing and OLAP Technology for Data Mining

What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data cube technology From data warehousing to data mining
1

What is Data Warehouse?


Defined in many different ways but not rigorously ways, rigorously. A decision support database that is maintained separately from the organizations operational database d t b Support information processing by providing a solid p platform of consolidated, historical data for analysis. y A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management s decision-making process. W. H. managements decision making process.W. Inmon Data warehousing: The Th process of constructing and using data f t ti d i d t warehouses
2

Data Warehouse Subject Oriented WarehouseSubject-Oriented


Organized around major subjects, such as customer, p product, sales. , Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process.
3

Data Warehouse Integrated WarehouseIntegrated


Constructed by integrating multiple heterogeneous multiple, data sources relational databases, flat files, on-line transaction records d Data cleaning and data integration techniques are pp applied. Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources
E.g., Hotel price: currency, tax, breakfast covered, etc.

When data is moved to the warehouse, it is converted. converted


4

Data Warehouse Time Variant WarehouseTime


The time horizon for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. O i ld b l d Data warehouse data: provide information from a historical hi t i l perspective (e.g., past 5-10 years) ti ( t 5 10 ) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain ti t i time element. l t
5

Data Warehouse Non Volatile WarehouseNon-Volatile


A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing:

initial loading of data and access of data. g


6

Data Warehouse vs. Heterogeneous DBMS g


Traditional heterogeneous DB integration:
Build wrappers/mediators on top of heterogeneous databases Q y Query driven approach pp When a query is posed to a client site, a meta-dictionary is used to translate the query into queries appropriate for individual heterogeneous sites involved, and the results are involved integrated into a global answer set Complex information filtering, compete for resources

Data warehouse: update-driven, high performance


Information from heterogeneous sources is integrated in advance and stored in warehouses for direct query and analysis

Data Warehouse vs Operational DBMS vs.


OLTP (on-line transaction processing) Major task of traditional relational DBMS Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries
8

OLTP vs. OLAP vs


OLTP users function DB design g data clerk, IT professional day to day operations application-oriented pp current, up-to-date detailed, flat relational isolated repetitive read/write index/hash on prim. key short, simple transaction tens thousands 100MB-GB transaction throughput OLAP knowledge worker decision support subject-oriented j historical, summarized, multidimensional integrated, consolidated ad-hoc ad hoc lots of scans complex query y millions hundreds 100GB-TB query throughput, response
9

usage access unit of work # records accessed #users DB size metric

Why Separate Data Warehouse?


High performance for both systems DBMS tuned for OLTP: access methods, indexing, concurrency control, recovery Warehouse tuned Warehousetuned for OLAP: complex OLAP queries, multidimensional view, consolidation. Different functions and different data: missing data: D i i support requires hi t i l data i i d t Decision t i historical d t which operational DBs do not typically maintain data consolidation: DS requires consolidation q (aggregation, summarization) of data from heterogeneous sources data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled
10

From Tables and Spreadsheets to Data Cubes


Ad data warehouse is based on a multidimensional data model which h b d l d ld d l h h views data in the form of a data cube A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) i (d k h ) Fact table contains measures (such as dollars_sold) and keys to each of the related dimension tables a o a dd o ab In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. Th lattice of cuboids i ti i ll d th b id The l tti f b id forms a data cube.
11

Cube: A Lattice of Cuboids


all time item location supplier 1-D cuboids
time,item time item time,location time location item,location item location location,supplier location supplier

0-D(apex) cuboid

time,supplier time,item,location

2-D cuboids
item,supplier

time,location,supplier

3-D cuboids
item,location,supplier

time,item,supplier

4-D(base) cuboid
time, item, location, supplier
12

Conceptual Modeling of Data Warehouses D t W h


Modeling data warehouses: dimensions & measures Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation
13

Example of Star Schema


time
time_key time key day day_of_the_week month quarter year

item
Sales Fact Table time_key time key item_key branch_key branch key
item_key item_name brand type supplier_type

branch
branch_key ba c a e branch_name branch_type

location location_key units_sold units sold dollars_sold avg_sales Measures


14

location_key street city province_or_street country

Example of Snowflake Schema


time
time_key time key day day_of_the_week month quarter year

item
Sales Fact Table time_key time key item_key branch_key y
item_key item_name brand type t supplier_key

supplier
supplier_key supplier_type

branch
branch_key ba c a e branch_name branch_type

location location_key units_sold dollars_sold avg_sales Measures


location_key street city_key it k

city
city_key city province_or_street province or street country
15

Example of Fact Constellation p


time
time_key day day_of_the_week month quarter year

item
Sales Fact Table time_key y item_key branch_key
item_key item_name brand type yp supplier_type

Shipping Fact Table time_key item_key shipper_key from_location

branch
branch_key branch_name branch_type

location_key units_sold dollars_sold avg_sales Measures M

location
location_key street city province_or_street country

to_location dollars_cost units_shipped shipper


shipper_key shipper_name location_key shipper_type 16

Measures: How are measures computed? p A data cube measure is a numerical function that can be evaluated at each point in the data cube space. A measure value is computed for a given point by aggregating the data corresponding to the respective dimension-value pairs defining the given point.

17

Measures: Three Categories


distributive: if the result derived by applying the function to n aggregate values is the same as that derived by applying the function on all the data without partitioning.
E.g., co nt() sum(), min() ma () E g count(), s m() min(), max().

algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded integer), each of which is obtained by applying a distributive aggregate function.
E.g., avg(), min_N(), standard_deviation().

holistic: if there is no constant bound on the storage size needed to describe a subaggregate subaggregate.
E.g., median(), mode(), rank().

18

all region

A Concept Hierarchy: Dimension (location) A Concept hierarchy defines a sequence of mappings from a set of low level concepts to higher-level concepts, p g p , more general concepts. all Europe ... North_America

country

Germany

...

Spain

Canada

...

Mexico

city office

Frankfurt

...

Vancouver ... L. Chan ...

Toronto

M. Wind
19

View of Warehouses and Hierarchies

A concept hierarchy that is a total or partial order among attributes in a database schema is called a schema hierarchy hierarchy.

Specification of h f f hierarchies h Schema hierarchy day { d < {month < th quarter; week} < year Set_grouping Set grouping hierarchy {1..10} < inexpensive

20

Multidimensional Data
Sales volume as a function of product, month, and region
Dimensions: Product, Location, Time Hierarchical summarization paths Industry Region Year

Category Country Quarter

Pro oduct

Product

City Office

Month Week Day

Month
21

A Sample Data Cube


TV PC VCR sum 1Qtr 2Qtr

Date
3Qtr 4Qtr sum

Total annual sales of TV in India India U.S.A Canada d sum

C Country

22

Cuboids Corresponding to the Cube


all 0-D(apex) cuboid
product

date
product,country

country 1-D cuboids


date, country

product,date

2-D cuboids 3-D(base) b id 3 D(b ) cuboid


product, date, country

23

Browsing a Data Cube

Visualization OLAP capabilities Interactive manipulation


24

Typical OLAP Operations


Roll up (drill-up): summarize data p( p)

by climbing up hierarchy or by dimension reduction


Drill down (roll down): reverse of roll-up

from hi h level summary to lower level summary or detailed f higher l l t l l l d t il d data, or introducing new dimensions
Slice and dice:

project and select


Pivot (rotate):

reorient the cube, visualization, 3D to series of 2D planes.


Other operations

drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its backend relational tables (using SQL)
25

A Star-Net Query Model The querying of multidimensional databases can be based on starnet model. A starnet model consists of radial lines emanating from a central p point, where each line represents a concept hierarchy for a dimension. , p p y Each abstraction level in the hierarchy is called a footprint. These represent the granularities available for use by OLAP operations.

Shipping Method

Customer Orders Customer CONTRACTS

AIR-EXPRESS TRUCK Time ANNUALY QTRLY CITY SALES PERSON COUNTRY DISTRICT REGION Location DIVISION Promotion Organization
26

ORDER PRODUCT LINE Product

DAILY

PRODUCT ITEM PRODUCT GROUP

Each circle is called a footprint

Designing and rolling out a data warehouse is complex process consisting of the following activities:

Define the architecture, do capacity planning, and select the storage servers, database and OLAP servers, and tools. Integrate the servers, storage, and client tools. Design the warehouse schema and views. Define the physical warehouse organization, data placement, partitioning, and access methods. Connect the sources using gateways, ODBC drivers, or other wrappers. Design and implement scripts for data extraction. Cleaning, transformation, load transformation load, and refresh. refresh Populate the repository with the schema and view definitions, scripts, and other metadata. Design and implement end-user applications end user applications. Roll out the warehouse and application.
27

What does the data warehouse provide for business analysts? l t ?


A competitive advantage by presenting relevant information from which to measure performance and make critical adjustments in order to help win over competitors. A data warehouse can enhance business productivity since it is able to quickly and efficiently gather information that accurately describes the organization. A data warehouse facilitates customer relationship management since it provides a consistent view of customers and items across all lines of business, all departments, and all markets. ll l fb ll d d ll k A data warehouse may bring about cost reduction by tracking trends, patterns, and exceptions over long periods of time in consistent and reliable manner. manner

To design an effective data warehouse one needs to understand and analyze business needs and construct business analysis framework.

28

Design of a Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse Top-down view
allows selection of the relevant information necessary for the data warehouse. This information matches the current and coming business needs.

Data source view D i


exposes the information being captured, stored, and managed by operational systems. This b ope ational s stems information may be documented at various levels o d a a d a u a y, o of detail and accuracy, from individual data source d dua da a ou tables to integrated data source tables.
29

Design of a Data Warehouse: A Business Analysis Framework Four views regarding the design of a data warehouse (Contd.) Data warehouse view
consists of fact tables and dimension tables. It represents the information that is stored inside the data warehouse, including pre-calculated totals and counts, counts as well as information regarding the source, data, and time of origin, added to provide historical context.

Business query view


sees the perspectives of data in the warehouse p p o da a a ou from the view of end-user
30

Building and using a data warehouse is a complex task since it requires business skills, technology skills, and program management skills: t kill Business skills
Building a data warehouse involves understanding how such system store and manage their data, how to build extractors that transfer data from the operational system to the data warehouse, and how to build warehouse refresh software that keeps the data reasonably up-to-data with the ope ational s stems data Using a data warehouse in ol es ith operational systems data. a eho se involves understanding and translating the business requirements into queries that can be satisfied by data warehouse. Technology Skills Data analysts are required to understand how to make assessments from quantitative information and derive facts based on conclusions from historical patterns and trends, to extrapolate trends based on history and look for anomalies or paradigm shifts, and to presents coherent managerial recommendations based on such analysis. Program Management Skills Involves the need to interface with many technologies, vendors, and end users in o de to deli e results in a timel and cost-effective se s order deliver es lts timely cost effecti e manner.
31

Data Warehouse Design Process


Top down, bottom up Top-down, bottom-up approaches or a combination of both Top-down: Starts with overall design and planning (mature) Bottom-up: Starts with experiments and prototypes (rapid) Combined: Exploit the strategic nature of the top down top-down approach while retaining the rapid implementation and opportunistic application of bottom-up approach From software engineering point of view Waterfall: structured and systematic analysis at each step before proceeding to the next Spiral: rapid generation of increasingly functional systems, short l d f l f l h turn around time, quick turn around

32

Typical data warehouse design process

Choose a business process to model, e.g., orders, invoices, etc. If the business process is organizational and involves multiple object collections, a data warehouse model should be followed. If the process is departmental and focuses on the analysis of one kind of business process, a data mart model should be chosen h Choose the grain (atomic level of data) of the business process. The grain is the fundamental, atomic level of data to be represented in the fact table for this process. E g individual process E.g., transactions, individual daily snapshots and so on Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item, customer, supplier, yp , , , pp , warehouse, transaction type Choose the measure that will populate each fact table record. Typical measures are numeric additive quantities like dollars_sold and unit_sold. d ll ld d it ld
33

MultiMulti-Tiered Architecture
Metadata

other

sources
Operational Extract Transform Load Refresh

Monitor & Integrator

OLAP Server

DBs

Data Warehouse

Serve

Analysis A l i Query Reports Data mining

Data Marts

Data Sources

Data Storage

OLAP Engine Front-End Tools


34

Bottom tier: A warehouse database server


Data from operational databases and external sources are extracted using application program interfaces known as gateways gateways. A gateway is supported by the underlying DBMS and allows client programs to generate SQL code to be executed at a server.
Ex: ODBC (Open Database Connection) and OLE-DB (Open Linking and Embedding for Databases), by Microsoft, and ga d bedd g o atabases), c oso t, a d JDBC (Java Database Connection)

35

The middle tier : OLAP server


Typically implemented using either a relational OLAP (ROLAP) model, that is, an extended relational DBMS that maps operations on multidimensional data to standard relational operations; A multidimensional OLAP (MOLAP) model, that is, special-purpose is a special purpose server that directly implements multidimensional data and operations.
36

The top tier : client


Which contains query and reporting tools, analysis tools, and/or data mining tools (e.g. trend analysis, prediction etc.)

37

Three Data Warehouse Models


Enterprise warehouse collects all of the information about subjects spanning the entire organization Data Mart a subset of corporate-wide data that is of value to a spec c groups o users. ts specific g oups of use s Its scope is co s confined to ed specific, selected groups, such as marketing data mart
Independent vs. dependent (directly from warehouse) data mart

Virtual warehouse l h A set of views over operational databases Only some of the possible summary views may be materialized
38

Data Warehouse Development: A Recommended Approach


Distributed Data Marts Multi-Tier Data Warehouse

Data Mart

Data Mart

Enterprise E t i Data Warehouse

Model refinement

Model refinement

Define a high-level corporate data model


39

OLAP Server Architectures


Relational OLAP (ROLAP) Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services ti i ti l i d dditi lt l d i greater scalability Multidimensional OLAP (MOLAP) Array-based multidimensional storage engine (sparse matrix A b d ltidi i l t i ( ti techniques) fast indexing to pre-computed summarized data Many MOLAP servers adopt a two-level storage representation to M d t t l l t t ti t handle sparse and dense data sets: the dense subcubes are identified and stored as array structures, while the sparse p y p y p gy subcubes employ employ compression technology for efficient storage utilization
40

OLAP Server Architectures


Hybrid OLAP (HOLAP) y ( ) Benefiting from the greater scalability of ROLAP and faster computation of MOLAP Ex: a HOLAP server may allow large volumes of detail data to be stored in relational database, while aggregations are kept in a separate MOLAP server. Specialized SQL servers specialized support for SQL queries over star/snowflake schemas

41

Data Storage
Data stored as relational tables in the ware house Detailed and light summary data available All data access from the warehouse storage Data stored in specialized multidimensional databases. Large multidimensional array form the storage structures Various summary data kept in proprietary databases (MDDBs) Moderate data volumes Summary data access from MDDB, detailed data access from warehouse

Underlying technologies
Use of complex SQL to fetch data from warehouse ROLAP engine in analytical server creates data cubes on the fly Multidimensional views by presentation layer Creation of pre-fabricated data cubes by MOLAP engine. Proprietary technology to store multidimensional views in arrays, not tables. High speed matrix data retrieval Sparse matrix technology to manage data sparsity in summaries

Functions and Features


Known environment and availability of many tools Limitations on complex analysis functions Drill-through to lowest level easier. Drill-across not always easy Faster access Large library of functions Easy analysis irrespective of the number of dimensions Extensive drill-down ans slice-and-dice capabilities

R O L A P

M O L A P

42

Efficient Data Cube Computation


Data cube can be viewed as a lattice of cuboids The bottom-most cuboid is the base cuboid The top-most cuboid (apex) contains only one cell p ( p ) y How many cuboids in an n-dimensional cube with L levels? n
T = ( Li + 1) i =1

Materialization of data cube Materialize every (cuboid) (full materialization), none (no materialization), or some (partial materialization) Selection f hi h b id t S l ti of which cuboids to materialize t i li
Based on size, sharing, access frequency, etc.
43

Cube Operation
Cube definition and computation in DMQL p Q define cube sales[item, city, year]: sum(sales_in_dollars) compute cube sales Transform it into a SQL-like language (with a new operator cube by, introduced by Gray et al.96) SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year Need compute the following Group Bys Group-Bys
(city) (item) (year) ()

(date, product, customer), (city, item) (city, year) (item, year) (date,product),(date, customer), (product, customer), (date), (product), (customer) (d t ) ( d t) ( t ) () (city, item, year)
44

Cube Computation: ROLAP Based Method ROLAP-Based


Efficient cube computation methods
ROLAP-based cubing algorithms (Agarwal et al96) Array-based cubing algorithm (Zhao et al97) Bottom-up computation method (Bayer & Ramarkrishnan99)

ROLAP-based cubing algorithms


Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples Grouping is performed on some subaggregates as a partial partial grouping step. These partial groupings may be used to speed up the computation of other subaggregates Aggregates may be computed from previously computed aggregates, rather than from the base fact table
45

Cube Computation: ROLAP-Based Method (2) p ( )


Hash/sort based methods (Agarwal et. al. VLDB96) Smallest-parent: computing a cuboid from the smallest cuboid previously computed cuboid. Cache-results: caching results of a cuboid from C h lt hi lt f b id f which other cuboids are computed to reduce disk I/Os Amortize scans: Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used b id h b d h di d Share-partitions: sharing the partitioning cost cross multiple cuboids when hash based algorithms are used hash-based
46

Multi-way Array Aggregation for Cube Computation


Partition arrays into chunks (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset) Compute aggregates in multiway by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost.

c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c0

b3

B 13
9 5 1 a0

14

15

16

b2 b1 b0

2 a1

3 a2

4 a3

60 44 28 56 40 24 52 36 20 0

What is the best traversing order to do multi-way aggregation?

47

3-D data array containing the three dimensions A, B, C.


The 3-D array is partitioned into small memory chunks. The array is partitioned into 64 chunks. Dimension A is organized into four equisized partitions, a0, a1, g q p , , , a2, a3. Dimensions B and C are similarly organized into four partitions each. Chunks 1, 2, .,64 correspond to the subcubes a0b0c0, a1b0c0, a3b3c3, respectively. Suppose the size of the array for each dimension A, B, C, is dimension, A B C 40, 400, 4000, respectively. The size of each partition in A, B, C is therefore 10, 100, 1000, respectively.

48

3-D data array containing the three dimensions A, B, C.


Full materialization of the corresponding data cube involves the computation of all the cuboids defining this cube. These cuboid consist of The base cuboid, denoted by ABC (from which all of the other cuboids are directly or indirectly computed). This cube is already computed and correspond to the given 3-D array. The 2-D cuboids, AB, AC, and BC, which respectively correspond to the group bys AB, AC, and BC. These cuboids must be computed computed. The 1-D cuboids, A, B, and C, which respectively correspond to the group bys A, B, and C. These cuboids must be computed. The 0-D (apex) cuboids, denoted by all, which corresponds to cuboids all the group by (); that is, there is no group by here. This cuboid must be computed.

49

Multi-way Array Aggregation for Cube Computation

c3 61 62 63 64 c2 45 46 47 48 1 c1 29 30 31 32 c0 B 13 9 5 1 a0 2 a1 3 a2 4 a3 14 15 16 28 24 20 40 36 52 60 44 56

b3

B b2
b1 b0

50

Multi-way Array Aggregation for Cube Computation

c3 61 62 63 64 c2 45 46 47 48 1 c1 29 30 31 32 c0 B 13 9 5 1 a0 2 a1 3 a2 4 a3 14 15 16 28 24 20 40 36 52 60 44 56

b3

B b2
b1 b0

51

In computing the BC cuboid, we will have scanned each of the 64 chunks. Is there a way to avoid having to rescan all of these chunks for Is the computation of other cuboids, such as AC and AB? The answer is : YES. This is where the multiway computation idea comes in in. Ex: when chunk 1(i.e., a0b0c0) is being scanned (for the computation of the 2-D chunk b0c0 of BC), all of the other 2-D chunks relating to each of the three chunks b0c0 a0c0 a0b0 chunks, b0c0, a0c0, a0b0, on the three 2-D aggregation planes, BC, AC, and AB, should be computed as well. Multiway computation aggregates to each of the 2-D planes while 2D a 3-D chunk is in memory.
52

How different orderings of chunk scanning and of cuboid computation can affect the overall data cube computation efficiently? t ti ffi i tl ?

The size of the dimension A, B, and C is 40, 400, 4000, respectively. Therefore the largest 2-D plane is BC (400 4000 = 1,600,000). 2D (400*4000 The second largest 2-D plane is AC (40*4000 = 160, 000). The smallest 2-D plane is AB (40*400 = 16, 000). Suppose that the chunks are scanned in the order shown from shown, chunk 1 to 64. By scanning in this order, one chunk of the largest 2-D plane, BC, BC is fully computed for each row scanned scanned.
That is, b0c0, is fully aggregated after scanning the row containing chunks 1 to 4. B1c0 is fully aggregated after scanning chunks from 5 to 8, and so 8 on.
53

In comparison, the complete computation of one chunk of the second largest 2-D plane, AC requires scanning 13 chunks. For ex., a0co is fully aggregated after scanning of chunks 1, 5, 9, and 13. Finally, the complete computation of one chunk of the smallest 2-D plane, AB, requires scanning 49 chunks. A0b0 is fully aggregated after scanning chunks 1, 17, 33, and 49. Hence, AB requires the longest scan of chunks in order to complete its computation.

54

To avoid bringing a 3-D chunk into memory more than once, the minimum memory requirement for holding all relevant 2-D planes in chunk memory, according to the chunk ordering 1 to 64 is as follows: 40 * 400 (for the whole AB plane) + 10 * 4000 (for the one row of the AC plane) + 100 * 1000 (for one chunk of the BC plane) = 16,000 + 40,000+100,000 = 156,000 Suppose instead, that the chunks are scanned in the order 1, 17, 33, 49, 5, 21, 37, 53, and so on. That is, the scan is in the order of first aggregating towards the AB plane, and then towards the AC plane, and lastly towards the BC plane.

55

The minimum memory requirement for holding 2-D planes in chunk memory would be as follows: 400 * 4000 (for the whole BC plane) + 10 *4000 (for one row 4000 of the AC plane) + 10 * 100 (for one chunk of the AB plane) = 1,600,000+40,000+1000=1,641,000. This is more than 10 times the memory requirement of the scan ordering of 1 to 64.

56

Multi-Way Array Aggregation for Cube Computation (Cont.) (Cont )


Method: th l M th d the planes should be sorted and computed h ld b t d d t d according to their size in ascending order. Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small number of dimensions If there are a large number of dimensions, bottomth l b f di i b tt up computation and iceberg cube computation methods can be explored

57

Discovery-Driven Exploration of Data Cubes


Hypothesis driven: Hypothesis-driven: exploration by user, huge search space user Discovery-driven (Sarawagi et al.98) p pre-compute measures indicating exceptions, guide user in the o pu a u d a g p o , gu d u data analysis, at all levels of aggregation Exception: is a datacube cell value that is significantly different from the value anticipated, based on a statistical model Visual cues such as background color are used to reflect the degree e ception deg ee of exception of each cell Computation of exception indicator (modeling fitting and computing SelfExp, InExp, and PathExp values) can be overlapped with cube construction
58

Discovery-Driven Exploration of Data Cubes


Three measures are used as exception indicators to help identify data anomalies. These measures indicate the degree of surprise that the quantity in a cell holds, with respect to its expected value holds These measures are computed and associated with every cell, for all levels of aggregation. They are: SelfExp: This indicated the degree of surprise of the cell value, relative to other cells at the same level of aggregation. InExp: This indicated the degree of surprise somewhere beneath the cell, if we were to drill down from it. PathExp: This indicated the degree of surprise for each drill down path from the cell cell.
59

Discovery-Driven Exploration of Data Cubes


Suppose you would like to analyze the monthly sales as a percentage difference from the previous month. The dimensions involved are item, time, and region. To i T view the exception indicators, click on a button marked th ti i di t li k b tt k d highlight exceptions on the screen.
This translates the SelfExp and InExp values into visual cues, displayed with each cell. The darker the color, the grater the degree of exception.

Example: the dark, thick boxes for sales during July, august, and p g y g September signal the user to explore the lower-level aggregations of these cells by drilling down Drill down can be executed along the agreegated item or region dimensions.
60

Discovery-Driven Exploration of Data Cubes


Which path has more exceptions? Which To find out this, you select a cell of interest and trigger a path exception module that colors each dimension based on the PathExp value of the cell cell. This value reflects the degree of surprise of that path.

61

Examples: Discovery-Driven Data Cubes

62

How are the exception values computed?


The SelfExp, InExp, and PathExp measures are based on a statistical method for table analysis They take into account all of the group-bys (aggregations) in which a given cell value participates. A cell value is considered an exception based on how much it differs from its expected value, where h h diff f it t d l h its expected value is determined with a statistical model. The difference between a given cell value and its expected value is called a residual. The larger the residual, the more the given cell value is an exception.
63

The comparison of residual values requires us to scale the values based on the expected standard deviation associated with the residuals. A cell value is therefore considered an exception if its scaled ll l i th f id d ti it l d residual value exceeds a pre-specified threshold. The SelfExp, InExp, and pathExp measures are based on this scaled residual residual. The expected value of a given cell is a function of the higher-level group-bys of the given cell Ex: given a cube with the three dimensions A, B, and C, the A B C expected value for a cell at the ith position in A, the jth position in B, and the kth position in C is a function of , A, B, C, AB, AC, BC, which are coefficients of statistical model used. The coefficients reflect how different the values at more detailed levels are, based on generalized impressions formed by looking at higher-level aggregations. In this way, the exception quality of a cell value is based on way the exceptions of the value below it. Thus, when seeing an exception, it is natural for the user to further explore the exception by drilling down.

64

Complex Aggregation at Multiple Granularities: Multi-Feature Cubes Multi Feature


Multi-feature cubes (Ross, et al. 1998): Compute complex queries involving multiple dependent aggregates at multiple granularities Many complex data mining queries can be answered by multi-feature cubes without any significant increase in computational cost. Ex: Purchase data Where an item is purchsed in a sales region on a abusiness day (year, month, day). ( th d ) The shelf life in months of a given item is stored in shelf. The item price and sales (in dollars) at a given region are stored p ( ) g g in price and sales, respectively.

65

Complex Aggregation at Multiple Granularities: Multi-Feature Cubes Multi Feature


Query1: A simple data cube query. Find the total sales in 2003, broken down by item, region, and month, with subtotals for each dimension. To answer query1, a data cube is constructed that aggregates the total sales at the following eight different levels of granularity: {(item, region, month), (item, region), (item, month), (month, region), (item), (month) (region), ()} region) (item) (month), (region) ()}, where () represents all all. Query1 uses a simple data cube since it does not involve any dependent aggregates.

66

Complex Aggregation at Multiple Granularities: Multi-Feature Cubes Multi Feature


What is meant by dependent aggregates? Query2: A complex query. Grouping by all subsets of {item, region, month}, find the maximum price in 2000 for each group, and the total sales among all maximum price tuples Query2 can be specified concisely using extended SQL syntax as follows: select it l t item, region, month, MAX(price), SUM(R.sales) i th MAX( i ) SUM(R l ) from purchases y where year = 2003 cube by item, region, month: R such that R.price = MAX(price)

67

Complex Aggregation at Multiple Granularities: Multi-Feature Cubes Multi Feature


The tuples representing purchases in 2003 are first p p gp selected. The cube by clause compute aggregates(or group-bys) for all possible combinations of the attributes item, region, and month. It is an n-dimensional generalization of the group by clause clause. The attributes specified in the cube by clause are the grouping attributes. Tuples with the same value on all grouping attributes form one group.

68

Complex Aggregation at Multiple Granularities: Multi-Feature Cubes Multi Feature


Let the groups be g1, ..gr. For each group of tuples gi, the maximum price maxgi among the tuples forming the group is computed. The variable R is a grouping variable, ranging over all tuples in group gi whose price is equal to maxgi (as specified in the such that clause) The sum of sales of the tuples in gi that R ranges over is computed and returned with the values of the grouping attributes of gi. The resulting cube is a multifeature cube in that it supports complex data mining queries for which multiple dependent aggregates are computed at a variety of granularities. Ex.: The sum of sales returned in Query2 is dependent on the set of Q y p maximum price tuples for each group.
69

Complex Aggregation at Multiple Granularities: Multi-Feature Cubes Multi Feature


Q y Query3: An even more complex query. Grouping by all p q y p g y subsets of {item, region, month}, find the maximum price in 2003 for each group. Among the maximum price tuples, find the minimum and maximum item shelf lives. l fi d h i i d i i h lf li Also find the fraction of the total sales due to tuples that have minimum shelf life within the set of all maximum price tuples, and the fraction of the total sales due to tuples that have maximum shelf life within the set of all maximum price tuples.

70

Complex Aggregation at Multiple Granularities: Multi-Feature Cubes Multi Feature


{ = MIN(R1 h lf)} MIN(R1.shelf)} R2 { = MAX(R1.shelf)} R3

R1

= {MAX(price)}

R0
The multifeature cube graph illustrate the aggregate dependencies in the query

71

There is one node for each grouping variable, plus an additional initial node, R0. Starting from node R0, the set of maximum price tuples R0 in 2003 is first computed (node R1). The graph indicates that grouping variables R2 and R3 are dependent on R1, since a direct line is drawn dependent R1 from R1 to each of R2 and R3. In a multifeature cube graph, a directed line from grouping variable Ri to Rj means that Rj always ranges bl h l over a subset of the tuples that Ri ranges over. When expressing the query in extended SQL, we write p g q y Q , Rj in Ri. Ex.: The minimum shelf life tuples at R2 range over the maximum price tuples at R1, I.e. R2 in R1 . R1 I e R2 R1 Similarly, the maximum shelf life tuples at R3 range over the maximum price tuples at R1, I.e., R3 in R1 .

72

From the graph, we can express Query3 in extended SQL as follows:


select item, region, month, MAX(price), MIN(R1.shelf), MAX(R1.shelf), SUM(R1 sales) SUM(R2 sales) SUM(R3 sales) MAX(R1 shelf) SUM(R1.sales), SUM(R2.sales), SUM(R3.sales)

from purchases where year = 2003 cube by item, region, month: R1, R2, R3 such that R1.price = MAX(price) and R2 in R1 and R2.shelf=MIN(R1.shelf) and R2 shelf=MIN(R1 shelf) R3 in R1 and R3.shelf=MAX(R1.shelf)
73

Q y Query1: For each customer, find the longest call and , g the area code to which it was made? In the query we shall use the following relation: CALLS(FromAC, FromTel, ToAC, ToTel, Date, Length) The CALLS relation stores the calls placed on telephone network over the period of one year year. It includes the From number (area code and telephone number), the To number (area code and telephone number), the date and the length of the call. Each from number corresponds to a customer. customer
74

select FromAC, FromTel, R.ToAC, R.Length from CALLS group by FromAC, FromTel : R suc that such t at R.Length = max(Length) e gt a ( e gt )

75

Query2: For each customer, show the averagelength of calls made to area codes 022 and 011 (in the same output record).

76

select FromAC, FromTel, avg(R.Length), avg(S.Length) from CALLS group by FromAC, FromTel : R, S such that R.ToAC = 022 and S.toAC = 011

77

Query3: For each customer, show the number of calls made during the first 6 months that exceeded the average-length of all calls made during the year, and the number of calls made during the second 6 months that exceeded the same average length.

78

select FromAC, FromTel, count(X.*), count(Y.*) from CALLS group by FromAC, FromTel : X, Y suc that ( such t at (X.Date<2003/07/01 and ate 003/0 /0 a d X.Length>avg(Length)) and (Y.Date> 2003/06/30 (Y.Date>2003/06/30 and Y.Length>avg(Length))

79

Query4: Suppose we are interested in those customers for whom the total length of their calls during summer (June to august) exceeds one-third of the total length of their calls during the entire year. For these customers we would year customers, like to find the longest call made during the summer period, and the area code to which it was made.

80

select FromAC, FromTel, R.ToAC, R.Length from CALLS group by FromAC, FromTel : R suc that such t at R.Date > 2003/05/31 and ate 003/05/3 a d R.Date< 2003/09/01 having sum(R.Length)*3>sum(Length) sum(R.Length) 3>sum(Length) AND R.Length=max(R.Length)

81

Data Warehouse Usage


Three kinds of data warehouse applications Information processing supports querying, basic statistical analysis, and reporting using tables, charts and graphs i t bl h t d h Analytical processing multidimensional analysis of data warehouse data y supports basic OLAP operations, slice-dice, drilling, pivoting Data mining knowledge discovery from hidden patterns k l d d f h dd supports associations, constructing analytical models, performing classification and prediction, and presenting the mining results using visualization tools. Differences among the three tasks
82

Data Warehouse Usage


Do OLAP systems perform data mining? Are OLAP system actually Do data mining systems? The functionalities of OLAP and data mining can be viewed as disjoint: OLAP is a data summarization/aggregation tools helps simplify data analysis, while data mining allows the automated discovery of implicit patterns and interesting knowledge hidden in large amounts of data data. OLAP tools are targeted towards simplifying and supporting interactive data analysis, whereas the goal of data mining tools is to automate as much of the process as possible, while still allow users to guide the process. , gg p y In this sense, data mining goes one step beyond traditional online analytical processing.
83

Data Warehouse Usage


Data mining covers a much broader spectrum than simple OLAP operations because it not only performs data summary and comparison, but also performs association, classification, clustering, time-series analysis clustering time series analysis, and other data analysis task task. Data mining is not confined to the analysis of data stored in data warehouses. It may analyze data existing at more detailed granularities than the summarized data provided in a data warehouse. It may also analyze transactional, spatial, textual and multimedia data that are difficult to model with current multidimensional database technology.

84

From On-Line Analytical Processing to On Line Analytical Mining (OLAM)


Why online analytical mining?
Among many different paradigms and architectures of data mining systems, on-line analytical mining (OLAM) which integrates on-line analytical processing (OLAP) with data mining and mining knowledge in multidimensional databases, is particularly important for following reasons: High quality of data in data warehouses DW contains integrated, consistent, cleaned data d l dd Available information processing structure surrounding data warehouses Comprehensive information processing and data analysis C h i i f ti i dd t l i infrastructures have been or will be systematically constructed surrounding data warehouses, including g, g , , accessing, integration, consolidation, and transformation of multiple heterogeneous databases, ODBC, OLEDB, Web accessing, reporting and OLAP tools.
85

From On-Line Analytical Processing to On Line Analytical Mining (OLAM)


OLAP based OLAP-based exploratory data analysis OLAP provides facilities for data mining on different subsets of data and at different levels of abstraction, by drilling, dicing, pivoting, etc. This, together with data/knowledge visualization tools, will greatly enhance the power and flexibility of exploratory data mining. On line On-line selection of data mining functions Often a user may not know what kinds of knowledge she would like to mine. By integrating OLAP with multiple data mining functions, functions OLAM provides users with the flexibility to select desired data mining functions and swap data mining tasks dynamically.

86

An OLAM Architecture
Mining Mi i query
User GUI API

Mining Mi i result l

Layer4 L 4 User Interface Layer3 OLAP/OLAM

OLAM Engine g
Data Cube API

OLAP Engine g

Layer2

MDDB
Meta Data
Filtering&Integration

MDDB

Database API
Data cleaning

Filtering

y Layer1 Databases Data Data integration Warehouse Data Repository

87

An OLAM Architecture
An OLAM server performs analytical mining in data cubes in a similar manner as an OLAP server performs on-line analytical processing. processing OLAM and OLAP servers both accept user on-line queries via an graphical user interface API and work with the data cube in the data analysis via a cube API. API A metadata directory is used to guide the access of the data cube. The data cube can be constructed by accessing and/or integrating multiple databases via an MDDBAPI and/or b filtering a data lti l d t b i d/ by filt i d t warehouse via a database API that may support OLEDB or ODBC connections. Since an OLAM server may perform multiple data mining tasks Si f lti l d t i i t k such as concept description, association, classification, prediction, clustering, time-series analysis, and so on, it usually consists of multiple integrated data mining modules and is more sophisticated than OLAP server.
88

The capability of OLAP to provide multiple and dynamic views of summarized data in a data warehouse sets a solid foundation for successful data mining mining. OLAP sets a good example for interactive data analysis p yp p p y and provides the necessary preparation for exploratory data mining.

89

Potrebbero piacerti anche