OLAP and Data Mining Transparencies

OLAP and Data Mining Transparencies
Objectives
Purpose
of online analytical processing (OLAP) and how OLAP differs from data warehousing. features of OLAP applications.
Key
Potential
benefits associated with successful OLAP applications. for OLAP tools and main types of tools including: multi-dimensional OLAP (MOLAP), relational OLAP (ROLAP), and managed query environment (MQE).
2
Rules
Objectives
OLAP
extensions to SQL. associated with data mining.
Concepts Main
data mining operations including predictive modeling, database segmentation, link analysis, and deviation detection. between data mining and data warehousing.
3
Relationship
Data warehousing and end-user access tools

Accompanying
growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities.
Key
developments include:
Online analytical processing (OLAP) SQL extensions for complex data analysis Data mining tools.
Introducing OLAP
The
dynamic synthesis, analysis, and consolidation of large volumes of multidimensional data, Codd (1993). a technology that uses a multidimensional view of aggregate data to provide quick access to strategic information for purposes of advanced analysis.
Describes
Introducing OLAP
Enables
users to gain a deeper understanding and knowledge about various aspects of their corporate data through fast, consistent, interactive access to a wide variety of possible views of the data. users to view corporate data in such a way that it is a better model of the true dimensionality of the enterprise.
6
Allows
Introducing OLAP
Can
easily answer who? and what? questions, however, ability to answer what if? and why? type questions distinguishes OLAP from general-purpose query tools. of analysis ranges from basic navigation and browsing (slicing and dicing) to calculations, to more complex analyses such as time series and complex modeling.
7
Types
OLAP Benchmarks
OLAP Council published an analytical processing benchmark referred to as the APB1 (OLAP Council, 1998). Aim is to measure a servers overall OLAP performance rather than the performance of individual tasks.
OLAP Benchmarks
APB-1
assesses most common business operations including:

bulk loading of data from internal/external data sources incremental loading of data from operational systems; aggregation of input level data along hierarchies; calculation of new data based on business models; time series analysis; queries with a high degree of complexity; drill-down through hierarchies; ad hoc queries; multiple online sessions.
9
OLAP Benchmarks
OLAP
applications are judged on their ability to provide just-in-time (JIT) information, a core requirement of supporting effective decision-making.
Assessing
a servers ability to satisfy this requirement is more than measuring processing performance but includes its abilities to model complex business relationships and to respond to changing business requirements.
10
OLAP Benchmarks
APB-1
uses a standard benchmark metric called AQM (Analytical Queries per Minute). represents number of analytical queries processed per minute including data loading and computation time. Thus, AQM incorporates data loading performance, calculation performance, and query performance into a singe metric.
11
AQM
OLAP Benchmarks
Publication
of APB-1 benchmark results must include both the database schema and all code required for executing the benchmark. essential requirement of all OLAP applications is ability to provide users with JIT information, to make effective decisions about an organization's strategic directions.
An
12
OLAP Applications
JIT
information is computed data that usually reflects complex relationships and is often calculated on the fly. Also as data relationships may not be known in advance, the data model must be flexible.
13
Examples of OLAP applications in various functional areas
14
OLAP Applications
Although
OLAP applications are found in widely divergent functional areas, all have following key features:
multi-dimensional views of data support for complex calculations time intelligence.
15
OLAP Applications - multi-dimensional views of data

Core
requirement of building a realistic business model. basis for analytical processing through flexible access to corporate data. underlying database design that provides the multi-dimensional view of data should treat all dimensions equally.
16
Provides
The
OLAP Applications - support for complex calculations
Must provide a range of powerful computational methods such as that required by sales forecasting, which uses trend algorithms such as moving averages and percentage growth. Mechanisms for implementing computational methods should be clear and non-procedural.
17
OLAP Applications time intelligence
Key feature of almost any analytical application as performance is almost always judged over time. Time hierarchy is not always used in same manner as other hierarchies. Concepts such as year-to-date and period-overperiod comparisons should be easily defined.
18
OLAP Benefits
Increased productivity of end-users. Reduced backlog of applications development for IT staff. Retention of organizational control over the integrity of corporate data. Reduced query drag and network traffic on OLTP systems or on the data warehouse. Improved potential revenue and profitability.
19
Representing Multi-dimensional Data
Example of two-dimensional query.

What is the total revenue generated by property sales in each city, in each quarter of 1997?
Choice of representation is based on types of queries end-user may ask. Compare representation - three-field relational table versus two-dimensional matrix.
20
Multi-dimensional Data as Three-field table versus Two-dimensional Matrix
21

Example
of three-dimensional query.
What is the total revenue generated by property sales for each type of property (Flat or House) in each city, in each quarter of 1997?
Compare
representation - four-field relational table versus three-dimensional cube.
22
Multi-dimensional Data as Four-field Table versus Three-dimensional Cube
23

Cube
represents data as cells in an array.
Relational
table only represents multidimensional data in two dimensions.
24
Multi-dimensional OLAP Servers

Use
multi-dimensional structures to store data and relationships between data. structures are best visualized as cubes of data, and cubes within cubes of data. Each side of cube is a dimension. cube can be expanded to include other dimensions.
Multi-dimensional
25

A
cube supports matrix arithmetic.
Multi-dimensional
query response time depends on how many cells have to be added on the fly. number of dimensions increases, number of the cubes cells increases exponentially.
As
26

However,
majority of multi-dimensional queries use summarized, high-level data. is to pre-aggregate (consolidate) all logical subtotals and totals along all dimensions. is valuable, as typical dimensions are hierarchical in nature.
(e.g. Time dimension hierarchy - years, quarters, months, weeks, and days)
27
Solution
Pre-aggregation
Predefined hierarchy allows logical preaggregation and, conversely, allows for a logical drill-down. Supports common analytical operations
Consolidation Drill-down Slicing and dicing.
28

Consolidation
- aggregation of data such as simple roll-ups or complex expressions involving inter-related data. - is reverse of consolidation and involves displaying the detailed data that comprises the consolidated data. and Dicing - (also called pivoting) refers to the ability to look at the data from different viewpoints.
29
Drill-Down
Slicing
Multi-dimensional OLAP servers

Can
store data in a compressed form by dynamically selecting physical storage organizations and compression techniques that maximize space utilization. data (ie., data that exists for high percentage of cells) can be stored separately from sparse data (ie., significant percentage of cells are empty).
30
Dense

Ability
to omit empty or repetitive cells can greatly reduce the size of the cube and the amount of processing. analysis of exceptionally large amounts of data.
Allows
31

In
summary, pre-aggregation, dimensional hierarchy, and sparse data management can significantly reduce the size of the cube and the need to calculate values on-the-fly. need for multi-table joins and provides quick and direct access to arrays of data, thus significantly speeding up execution of multi-dimensional queries.
32
Removes
Codds Rules for OLAP Systems

In
1993, E.F. Codd formulated twelve rules as the basis for selecting OLAP tools. Multi-dimensional conceptual view Transparency Accessibility Consistent reporting performance Client-server architecture Generic dimensionality
33
Codds rules for OLAP

Dynamic sparse matrix handling Multi-user support Unrestricted cross-dimensional operations Intuitive data manipulation Flexible reporting Unlimited dimensions and aggregation levels.
34
Codds Rules for OLAP Systems

There
are proposals to re-defined or extended the rules. For example, to also include:
Comprehensive database management tools Ability to drill down to detail (source record) level Incremental database refresh SQL interface to the existing enterprise environment
35
Categories of OLAP Tools
OLAP tools are categorized according to the architecture of the underlying database. Three main categories of OLAP tools include
Multi-dimensional OLAP (MOLAP or MD-OLAP) Relational OLAP (ROLAP), also called multirelational OLAP Managed query environment (MQE)
36
Multi-dimensional OLAP (MOLAP)

Use
specialized data structures and multidimensional Database Management Systems (MDDBMSs) to organize, navigate, and analyze data. is typically aggregated and stored according to predicted usage to enhance query performance.
Data
37

Use
array technology and efficient storage techniques that minimize the disk space requirements through sparse data management. excellent performance when data is used as designed, and the focus is on data for a specific decision-support application.
Provides
38
Traditionally, require a tight coupling with the application layer and presentation layer. Recent trends segregate the OLAP from the data structures through the use of published application programming interfaces (APIs).
39
Typical Architecture for MOLAP Tools
40
MOLAP Tools - Development Issues
Underlying data structures are limited in their ability to support multiple subject areas and to provide access to detailed data. Navigation and analysis of data is limited because the data is designed according to previously determined requirements.
41
MOLAP Tools - Development Issues

MOLAP
products require a different set of skills and tools to build and maintain the database, thus increasing the cost and complexity of support.
42
Relational OLAP (ROLAP)
Fastest-growing style of OLAP technology. Supports RDBMS products using a metadata layer - avoids need to create a static multidimensional data structure - facilitates the creation of multiple multi-dimensional views of the two-dimensional relation.
43
Relational OLAP (ROLAP)
To improve performance, some products use SQL engines to support complexity of multidimensional analysis, while others recommend, or require, the use of highly denormalized database designs such as the star schema.
44
Typical Architecture for ROLAP Tools
45
ROLAP Tools - Development Issues

Middleware
to facilitate the development of multi-dimensional applications. (Software that converts the two-dimensional relation into a multi-dimensional structure). of an option to create persistent, multi-dimensional structures with facilities to assist in the administration of these structures.
Development
46
Managed Query Environment (MQE)

Relatively
new development.
Provide
limited analysis capability, either directly against RDBMS products, or by using an intermediate MOLAP server.
47
Managed Query Environment (MQE)
Deliver selected data directly from DBMS or via a MOLAP server to desktop (or local server) in form of a datacube, where it is stored, analyzed, and maintained locally. Promoted as being relatively simple to install and administer with reduced cost and maintenance.
48
Typical Architecture for MQE Tools
49
MQE Tools - Development Issues
Architecture results in significant data redundancy and may cause problems for networks that support many users. Ability of each user to build a custom datacube may cause a lack of data consistency among users. Only a limited amount of data can be efficiently maintained.
50
OLAP Extensions to SQL

SQL
promoted as easy-to-learn, nonprocedural, free-format, DBMS-independent, and international standard. major disadvantage has been inability to represent many of the questions most commonly asked by business analysts. and Oracle jointly proposed OLAP extensions to SQL early in 1999, adopted as an amendment to SQL.
51
However,
IBM
OLAP Extensions to SQL

Many
database vendors including IBM, Oracle, Informix, and Red Brick Systems have already implemented portions of specifications in their DBMSs. Brick Systems was first to implement many essential OLAP functions (as Red Brick Intelligent SQL (RISQL)), albeit in advance of the standard.
52
Red
OLAP Extensions to SQL - RISQL

Designed for business analysts. Set of extensions that augments SQL with a variety of powerful operations appropriate to data analysis and decision-support applications such as ranking, moving averages, comparisons, market share, this year versus last year.
53
Use of the RISQL CUME Function

Show
the quarterly sales for branch office B003, along with the monthly year-to-date figures.
SELECT quarter, quarterlySales, CUME(quarterlySales) AS Year-to-Date FROM BranchSales WHERE branchNo = 'B003';
54
Use of the RISQL MOVINGAVG / MOVINGSUM Function
Show the first six monthly sales for branch office B003 without the effect of seasonality.
SELECT month, monthlySales, MOVINGAVG(monthlySales) AS 3-MonthMovingAvg, MOVINGSUM(monthlySales) AS 3-MonthMovingSum FROM BranchSales WHERE branchNo = 'B003';
55
Data Mining
The
process of extracting valid, previously unknown, comprehensible, and actionable information from large databases and using it to make crucial business decisions, (Simoudis,1996). analysis of data and use of software techniques for finding hidden and unexpected patterns and relationships in sets of data.
56
Involves
Data Mining
Reveals
information that is hidden and unexpected, as little value in finding patterns and relationships that are already intuitive. and relationships are identified by examining the underlying rules and features in the data. to work from the data up and most accurate results normally require large volumes of data to deliver reliable conclusions.
57
Patterns
Tends
Data Mining
Starts
by developing an optimal representation of structure of sample data, during which time knowledge is acquired and extended to larger sets of data. mining can provide huge paybacks for companies who have made a significant investment in data warehousing. new technology, however already used in a number of industries.
58
Data
Relatively
Examples of Applications of Data Mining
Retail / Marketing Identifying buying patterns of customers Finding associations among customer demographic characteristics Predicting response to mailing campaigns Market basket analysis
59
Banking
Detecting patterns of fraudulent credit card use Identifying loyal customers Predicting customers likely to change their credit card affiliation Determining credit card spending by customer groups
60

Insurance
Claims analysis Predicting which customers will buy new policies.

Medicine
Characterizing patient behavior to predict surgery visits Identifying successful medical therapies for different illnesses.
61
Data Mining Operations

Four
main operations include:
Predictive modeling Database segmentation Link analysis Deviation detection.
There
are recognized associations between the applications and the corresponding operations.
e.g. Direct marketing strategies use database segmentation.
62
Data Mining Techniques

Techniques
are specific implementations of the data mining operations. operation has its own strengths and weaknesses. mining tools sometimes offer a choice of operations to implement a technique.
Each
Data
63
Data Mining Techniques
Criteria for selection of tool includes

Suitability for certain input data types Transparency of the mining output Tolerance of missing variable values Level of accuracy possible Ability to handle large volumes of data.
64
Data Mining Operations and Associated Techniques
65
Predictive Modeling
Similar to the human learning experience

uses observations to form a model of the important characteristics of some phenomenon.
Uses generalizations of real world and ability to fit new data into a general framework. Can analyze a database to determine essential characteristics (model) about the data set.
66
Predictive Modeling
Model is developed using a supervised learning approach, which has two phases: training and testing.
Training builds a model using a large sample of historical data called a training set. Testing involves trying out the model on new, previously unseen data to determine its accuracy and physical performance characteristics.
67
Predictive Modeling
Applications of predictive modeling include customer retention management, credit approval, cross selling, and direct marketing.
Two
techniques associated with predictive modeling: classification and value prediction, distinguished by nature of the variable being predicted.
68
Predictive Modeling - Classification
Used to establish a specific predetermined class for each record in a database from a finite set of possible, class values. Two specializations of classification: tree induction and neural induction.
69
Example of Classification using Tree Induction
70
Example of Classification using Neural Induction
71
Predictive Modeling - Value Prediction
Used to estimate a continuous numeric value that is associated with a database record. Uses the traditional statistical techniques of linear regression and nonlinear regression. Relatively easy-to-use and understand.
72

Linear
regression attempts to fit a straight line through a plot of the data, such that the line is the best representation of the average of all observations at that point in the plot. is that the technique only works well with linear data and is sensitive to the presence of outliers (ie., data values, which do not conform to the expected norm).
73
Problem

Although
nonlinear regression avoids the main problems of linear regression, still not flexible enough to handle all possible shapes of the data plot. measurements are fine for building linear models that describe predictable data points, however, most data is not linear in nature.
74
Statistical
Data mining requires statistical methods that can accommodate non-linearity, outliers, and non-numeric data. Applications of value prediction include credit card fraud detection or target mailing list identification.
75
Database Segmentation
Aim
is to partition a database into an unknown number of segments, or clusters, of similar records. unsupervised learning to discover homogeneous sub-populations in a database to improve the accuracy of the profiles.
Uses
76
Less
precise than other operations thus less sensitive to redundant and irrelevant features. can be reduced by ignoring a subset of the attributes that describe each instance or by assigning a weighting factor to each variable. of database segmentation include customer profiling, direct marketing, and cross selling.
77
Sensitivity
Applications
Example of Database Segmentation using a Scatterplot
78
Associated with demographic or neural clustering techniques, distinguished by: Allowable data inputs Methods used to calculate the distance between records Presentation of the resulting segments for analysis.
79
Link Analysis
Aims to establish links (associations) between records, or sets of records, in a database. There are three specializations Associations discovery Sequential pattern discovery Similar time sequence discovery Applications include product affinity analysis, direct marketing, and stock price movement.
80
Link Analysis - Associations Discovery

Finds
items that imply the presence of other items in the same event. between items are represented by association rules.
e.g. When customer rents property for more than 2 years and is more than 25 years old, in 40% of cases, customer will buy a property. Association happens in 35% of all customers who rent properties.
Affinities
81
Link Analysis - Sequential Pattern Discovery

Finds
patterns between events such that the presence of one set of items is followed by another set of items in a database of events over a period of time.
e.g. Used to understand long term customer buying behavior.
82
Link Analysis - Similar Time Sequence Discovery

Finds
links between two sets of data that are time-dependent, and is based on the degree of similarity between the patterns that both time series demonstrate.
e.g. Within three months of buying property, new home owners will purchase goods such as cookers, freezers, and washing machines.
83
Deviation Detection
Relatively
new operation in terms of commercially available data mining tools. a source of true discovery because it identifies outliers, which express deviation from some previously known expectation and norm.
Often
84
Deviation Detection
Can be performed using statistics and visualization techniques or as a by-product of data mining.
Applications
include fraud detection in the use of credit cards and insurance claims, quality control, and defects tracing.
85
Example of Database Segmentation using a Visualization
86
Data Mining Tools

There
are a growing number of commercial data mining tools on the marketplace. characteristics of data mining tools
Important
include:
Data preparation facilities Selection of data mining operations Product scalability and performance Facilities for visualization of results.
87
Data Mining and Data Warehousing

Major
challenge to exploit data mining is identifying suitable data to mine. mining requires single, separate, clean, integrated, and self-consistent source of data.
Data
88

A
data warehouse is well equipped for providing data for mining. quality and consistency is a pre-requisite for mining to ensure the accuracy of the predictive models. Data warehouses are populated with clean, consistent data.
Data
89

Advantageous
to mine data from multiple sources to discover as many interrelationships as possible. Data warehouses contain data from a number of sources. relevant subsets of records and fields for data mining requires query capabilities of the data warehouse.
Selecting
90
Results of a data mining study are useful if there is some way to further investigate the uncovered patterns. Data warehouses provide capability to go back to the data source.
91

OLAP and Data Mining Transparencies

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

OLAP and Data Mining Transparencies

Caricato da

Copyright:

Formati disponibili

OLAP and Data Mining Transparencies

extensions to SQL. associated with data mining.

Data warehousing and end-user access tools

assesses most common business operations including:

Examples of OLAP applications in various functional areas

OLAP Applications - multi-dimensional views of data

OLAP Applications - support for complex calculations

OLAP Applications time intelligence

Representing Multi-dimensional Data

Example of two-dimensional query.

Multi-dimensional Data as Three-field table versus Two-dimensional Matrix

Representing Multi-dimensional Data

representation - four-field relational table versus three-dimensional cube.

Multi-dimensional Data as Four-field Table versus Three-dimensional Cube

Representing Multi-dimensional Data

represents data as cells in an array.

table only represents multidimensional data in two dimensions.

Multi-dimensional OLAP Servers

Multi-dimensional OLAP Servers

cube supports matrix arithmetic.

Multi-dimensional OLAP Servers

Multi-dimensional OLAP Servers

Multi-dimensional OLAP Servers

Multi-dimensional OLAP servers

Multi-dimensional OLAP Servers

Multi-dimensional OLAP Servers

Codds Rules for OLAP Systems

Codds rules for OLAP

Codds Rules for OLAP Systems

Categories of OLAP Tools

Multi-dimensional OLAP (MOLAP)

Multi-dimensional OLAP (MOLAP)

Multi-dimensional OLAP (MOLAP)

Typical Architecture for MOLAP Tools

MOLAP Tools - Development Issues

MOLAP Tools - Development Issues

Relational OLAP (ROLAP)

Relational OLAP (ROLAP)

Typical Architecture for ROLAP Tools

ROLAP Tools - Development Issues

Managed Query Environment (MQE)

Managed Query Environment (MQE)

Typical Architecture for MQE Tools

MQE Tools - Development Issues

OLAP Extensions to SQL

OLAP Extensions to SQL

OLAP Extensions to SQL - RISQL

Use of the RISQL CUME Function

Use of the RISQL MOVINGAVG / MOVINGSUM Function

Examples of Applications of Data Mining

Examples of Applications of Data Mining

Examples of Applications of Data Mining

Claims analysis Predicting which customers will buy new policies.

Data Mining Operations

main operations include:

Predictive modeling Database segmentation Link analysis Deviation detection.

Data Mining Techniques

Data Mining Techniques

Criteria for selection of tool includes

Data Mining Operations and Associated Techniques

Similar to the human learning experience

Predictive Modeling - Classification