Sei sulla pagina 1di 83

DWBI Essential Guide

Data Warehouse/Mart
Chapter 1 Data Warehouse definition- What is Data Warehouse?
Data Warehouse is repository of Data picked from Transaction systems, and filtered and transformed to make it available for data analysis reporting. Data-Warehouse is a repository of Data, which can provide most OR all of the Data and information requirements of an enterprise. It means that it pulls data from all the production and other sources. Once the data is pulled onto an offline staging area, it is cleansed, transformed and loaded into a sanitized, uniform and well-organized manner so that you can run queries, reports and all kind of analysis on the data.

What is not Data Warehouse?


Data Warehouse typically does not provide the online information, as mostly the data Extraction and consolidation happens in the end of day batchprocessing. For online, transaction based queries, the OLTP (Online Transaction Processing systems) is used. Data Warehouse is not business intelligence OR OLAP. It is a repository of sanitized and consolidated data, which can be used for any purpose including Business Intelligence. The usage can be for transaction reports, data mining, Data Analysis, Statistical forecasting, valuation systems OR so on. Technically speaking, we can have a good data warehouse without good business intelligence (which is a combination of Data Analysis +Data Mining+ Performance Management Reporting).

Data Warehouse vs. Operational Data Store (ODS) - They are different
Data Warehouse is an 'offline' integration of data, whereas Operational Data Store is an 'online' integration of data. ODS is used, when the data at a transaction (processing as well as querying) level is dispersed across various systems, and one needs to bring it together on online basis. For example- Let us say that you want to have a single view of customer to 1

DWBI Essential Guide be used by customer service, whereby they can also update the data in that single view online basis. However, the data on the customer (OPD Records, Hospitalization records, diagnostic records, pharmaceutical purchase records ...) is lying in different databases. ODS could be a good choice. The above-said concept of ODS is an ideal one. Another option for an Operational Data Store is to be used for online queries, but the information, which it provides is not real-time, but pertaining to last End of Day. For example you have a single customer view, but it does not include the transactions which the customer has done today. For this kind of need, sometimes even Data Warehouse repository can also used for this purpose.

Data Warehouse vs. Data Mart in Business Intelligence


Data Warehouse is at an enterprise level repository, which is having a combination of various data marts. Data Warehouse carries all the dimensions and measures required and it ensures the integrity of same dimensions and measures across different data marts. Data mart is a limited set of dimensions and measures used for specific business theme. They are populated out of the Data-Warehouse Data Sets. Typically an organization's business intelligence agenda starts with few data marts, before maturing to a full-blown data warehouse. However, most of the design & development concept apply equally to a Data-mart. Please refer De-Normalized Data Warehouse/Data Mart comparison between a Data Warehouse and Data Marts. for detailed

DWBI Essential Guide

Chapter 2 Data Warehouse Components and Framework


Data Warehouse framework starts from extracting data from source systems, transforming and cleansing it, before loading into the repository. It ends with the data being accesses, analyzed, mined and dash boarded using end user tools.

This page presents a high level listing of the components linked to Data Warehouse. PLEASE REFER TO Business Intelligence Architecture for a complete big picture on the relevance and positioning of Data Warehouse.

DWBI Essential Guide

Source systems and Databases


Source Systems are all those 'transaction/Production' raw data providers, from where the details are pulled out for making it suitable for Data Warehousing. The sources can be quite diverse: Production Databases like Oracle, Sybase and SQL. Excel Sheets. Database of small time applications like in MS Access. ASCII/Data flat files.

Data Staging 'Area'


The data staging area is the place where all 'grooming' is done on data after it is pulled from the Source Systems. The end point of grooming is for the Data to be loaded into the 'Analysis OR Presentation Server'. Data staging covers most of the 'back-bone' activities of a Data-Warehouse, which typically are also the biggest analytical and technical challenge of a project. These activities are 'Extraction' and 'Transformation'

ETL-Data Extraction
Data Extraction is an activity, which pulls the data from various data sources. Most of these sources are production systems OR are used for transaction level work.

ETL-Data Transformation
If Data Extraction is mining the iron ore, Transformation is to create the steel billets. The Transformation makes sure that the transaction level raw data is transformed into a form (while still being detailed) so that it can be loaded into the 'presentation/Loaded' area.

ETL-Presentation/Loaded 'Area'
This is the repository where the data is finally loaded after going through all the works of Extraction and Transformation. This becomes the ultimate source for information for various reasons ranging from queries to advanced data modeling.

Dimensional Model
The presentation area has data model, which is different from that of production system. This is called Dimensional Model. It is the way data is 4

DWBI Essential Guide organized in data-warehouse. This concept has been dealt with fair degree of detail as this is the engine of Data Warehouse.

Meta Data
Meta Data subject is covered in a separate section. It contains all the business and technical designs, rules and locations etc. of all the data starting from the Extraction to final data usage.

End User Tools and Applications


Data is cooked for consumption. There is a long list of applications to which the data can be put to and the tools, which can make it happen. This includes the reporting, publishing, analysis, modeling and mining tools.

Data-Warehouse Administration and Tools


Data warehouse is a large platform, which has large number of users, data sources and data targets. Just like production systems, it has to be administered in terms of performance, timelines and availability. This also includes activity logging, data security, backing-up and archiving.

Data- Marts
The entire section of Data Warehouse is equally applicable to a Data-Mart. A Data-Mart is a Data repository with a more restricted and short-term perspective. Please refer to De-Normalized Data Warehouse/Data Mart for similarities and differences between a Data Warehouse and a Data Mart.

OLAP Servers & Data Marts


While Data Warehouse can be accessed for any end-user tools application, it also feeds to the downstream OLAP Layer. For example, HR wants to have its own data mart in their separate servers due to confidential reasons. Similarly people who are traveling may need to have their own offline data Mart.

DWBI Essential Guide

Chapter 3 Data Warehouse Challenges and Issues


Data Warehouse initiative is more challenging as compared to a transactional system. You better read it to understand what you are in for. In spite of their proven ROIs for well implemented projects, the proportion of Data Warehouse project failures is fairly high. Failures can take various forms in terms of

Functional The project is not able to deliver the functionality and analysis capabilities. Technical The technology platform and services dont work. Publishing- The availability of data is not as per expectations. Usage- Even if the project is well delivered, the capabilities and information is not used. Achievement of business goals Even if the information is used, it is not able to drive the expected business goals.

The driver behind these failures is that hype & glamour of Data Warehouse has overtaken the diligence (especially , when Data Warehouse and Data Management initiatives and knowledge base still to become a mass awareness and expertise). The diligence required is because Data Warehouse projects are quite different from business systems projects. . While typical business systems projects also may suffer from similar challenges, the intensity of them is much higher in Data Warehouse projects. Here are Data Warehouse challenges, which are unique

Data Warehouse vs. OLTP Transaction Systems


Project Domain
Business Benefits

Transaction/Production Business System

Data Warehouse System

Tangible benefits in terms ofThere is a lesser proportion of functional capabilities,initiatives where there is 'heaven business processes that will bewill fall', if the project is not done. automated, number of headcounts and reduction etc.The benefits can be appreciated by

DWBI Essential Guide


Typically, the process getting automated is being donefewer people and much fewer at a manually, and there is enoughground level. visible pain at the ground level and customers. A business system onceData Warehouse platform has a implemented drives the usagelesser compulsion for usage. Unless as it typically automates athere are critical operational reports business process. required. One can specify the measureWhile number of users and the of usage for a business systemnumber of queries does represent in terms of processed unit,the level of usage, but it no means number of users. suggest that the usage is resulting in delivery of final outcome. Business system requires theMore knowledge is required expertise on business processhorizontally and vertically. One knowledge needs a much higher domain experience as well as crossfunctional knowledge for an effective business role fulfillment in Data Warehouse project. The domain expertise also includes all three levels (strategic, managerial and operational). Ability of defining the businessWhat analysis one needs, why and requirements, prioritization iswhat one will do post its availability easier as a business systemare questions, which automates an existing processdemand/challenge the management and/or severely neededand strategic thought process. business functionality. Unlike a business process, analysis for any problem can be done in hundred different ways. Therefore, business requirements tend to change throughout a Data Warehouse project. Business users are moreIts easier to provide and confirm available and engaged. the requirements of a business process automation, and difficult to define the information and analysis needs. Business users are too busy doing day-to-day work to dwell upon these questions. The queries and data access isData-Warehouse cannot predict the predictable as they are drivenkind and incidences of queries on by the mapping of type ofthe system. A query can access all transaction, instances etc. Athe tables and records. typical transaction touches only certain tables and certain records. Mostly the large and

Usage

Measure Usage

of

Skills and Expertise RequirementsBusiness

Business Requirements

Business users availability and engagement

The demands on the Database

DWBI Essential Guide


all-encompassing processing happens at end of the day processing. Variety of A business system has a pre-A data-Warehouse could be having front-end defined back-end and front endnew front-end applications being applications applications accessing theadded on the ongoing basis. This back-end Database includes OLAP tools, Data mining applications, business performance management applications, online user query and reporting applications. Expectations A typical business system hasA Data-Warehouse is expected to of flexibility to an ever increasing list ofprovide granular enhancements for enhancements enhancements, However, it ismost cases. It has to have its design expected that theflexible enough to be able to enhancements will take timeincorporate new dimensions, and system will go throughmeasures and system sources well-spaced out releases. without unsettling the foundations.

Chapter 4 Data Warehouse Purpose and Objective- Why is Data Warehouse Needed?
Data Warehouse is designed to meet some business needs, which is transaction system cannot do (and vice versa). The big question is on why we have to create a Data-Warehouse? The reasons are as follows. As a starter, lets say you want to know the monthly variations in 3 months running average on your customer balances over last twelve months grouped by products+ channels+ customer segments. Lets see why you need a data-warehouse for this purpose.

Keeping Analysis/Reporting and Production Separate


If you run the above-said query on your production systems, you will find that it will lock all your tables and will eat-up most of your resources, as it will be accessing a lot of data doing a lot of calculations. This results in the production work to come to a virtual halt. Imagine hundreds of such abovesaid queries running at the same time on your production systems. Reporting and analysis work typically access data across the database tables, whereas production work typically accesses specific customer OR 8

DWBI Essential Guide product OR channel record at a point of time. Thats why it is important to have the Information generation work to be done from an offline platform (aka. Data Warehouse). Purpose of Data Warehouse is to keep analysis/reporting (non-production use data) separate from production data.

Information Integration from multiple systems- Single point source for information
As an example- Lets say you have different systems for say a loan product vs. credit card product. The above-said query, if run on production will need to pick the data on real time basis from these systems. This will make the query extremely slow, and will need to do connect in the intermediate tables OR in run-time memory. Moreover it will not be a reliable result as at a particular point of time, the databases may not be in synch as many of such synching happens in the end of day batch runs.

DW purpose for Data Consistency and Quality


Organizations are riddled with tens of important systems from which their information comes. Each of these systems may carry the information in different formats and also may be having out of synch information. (Different customer ID formats, mismatch in the supplier statuses). By bringing the data from these disparate sources at a common place, one can effectively undertake to bring the uniformity and consistency in data (Refer to cleansing and Data Transformation).

High Response Time- Production Databases are tuned to expected transaction load
Even if you run the above-said query on an offline database, it will take a lot of time on the database design, which is same as that of production. This is because the production databases are created to cater to production work. In production systems, there is some level of expected intensity for different kind of actions. Therefore, the indexing and normalization and other design considerations are for given transaction loads. However, the Data-warehouse has to be ready for fairly unexpected loads and type of queries, which demands a high degree of flexibility and quick response time.

DWBI Essential Guide

High Response Modeling

time-

Normalized

Data

vs.

Dimensional

Production/Source system database are typically normalized to enable integrity and non-redundancy of data. This kind of design is fine for transactions, which involved few records at a time. However, for large analysis and mining queries, the response time in normalized databases will be slow given the joins that have to be created.

Data Warehouse objective of providing an adaptive and flexible source of information


Its easier for users to define the production work and functionalities they want, but difficult to define the analysis they need. The analysis needs keep on changing and Data-Warehouse has the capabilities to adapt quickly to the changing requirements. Please refer to 'Dimension Modeling'

Establish the foundation for Decision Support


Decision making process of an organization will involve analysis, data mining, forecasting, decision modeling etc. By having a common point, which can provide consistent, quality data with high response time provides the core enabler for making fast and informed decisions.

Chapter 5 Data Warehouse Dimensional Components Concept Model

Dimensional model is equivalent of logical data design of Data Warehouse, and much more. It is more simplistic in design and suits the purpose of a data warehouse. Dimensional Model is a logical design technique that seeks to present the data in a standard, intuitive framework that allows for high-performance access. It is inherently dimensional, and it adheres to a discipline that uses the relational model with some important restrictions. Every dimensional model is composed of one table with a multi-part key, called the fact table, and a set of smaller tables called dimension tables. Each dimension table has a single-part primary key that corresponds exactly to one of the components 10

DWBI Essential Guide of the multi-part key in the fact table. (See Figure) This characteristic 'starlike' structure is often called a star join. A fact table, because it has a multi-part primary key made up of two OR more foreign keys, always expresses a many-to-many relationship. The most useful fact tables also contain one OR more numerical measures, OR 'facts,' that occur for the combination of keys that define each record. In Figure, the facts are Units_Sold, Dollars_Sold, and Avg_sales. The most useful facts in a fact table are numeric and additive. Additivity is crucial because data warehouse applications almost never retrieve a single fact table record; rather, they fetch back hundreds, thousands, OR even millions of these records at a time, and the only useful thing to do with so many records is to add them up. Dimension tables, by contrast, most often contain descriptive textual information, and the attributes (also called classification attributes), which are used for analysis. Dimension attributes are used as the source of most of the interesting constraints in data warehouse queries, and they are virtually always the source of the row headers in the SQL answer set.

Fact Table and Dimension Tables in a Dimensional Model Schema


Lets consider a Data-Warehouse cube. This cube has 4 dimensions and three measures. This means that for every value of each of these 4 dimensions there will two values of coordinates. For example: Co-ordinate [City(X), Product(Y), channel (Z), Month] = [Sales (Quantity), Sales (Value)] OR [NY, Standard Desk-top, Mail, September 2005] = [2000 units, $15000] In the dimensional modeling schema, the FACT table contains the value of coordinates against the lowest granularity of all the possible combinations of dimensions. The dimension tables contain the details of the dimensions, which include the attributes of dimensions including all the higher-level hierarchies. The link between the fact table and all the associated dimension tables is through a dimension key, which is the lowest level granularity primary key of the dimension tables.

11

DWBI Essential Guide

Fact Table- The central linkage in Dimensional Modeling


A fact table contains the value of all the measures linked to the set of dimensions linked to the FACT table. It contains the measure values for the combination of lowest level of granularity of dimensions. The measures are typically numeric, which can undergo mathematical aggregation and analysis.
Families of FACT Tables

Chains and Circles. Heterogeneous products. Transactions and snapshots. Aggregates

Dimension Table- What does and should it contain


The dimension table contains all the information on the dimension. This includes:
a. The primary key (Equivalent foreign key in the Fact Table). b. All attributes of the dimension. These include:

The hierarchy attributes- Consider a business hierarchy-- pin-code to city to district to state to country for location dimension. This means that each hierarchy element will be an attribute. Textual as well as the code attributes- Location code as well as the name of the location. This is required, because both could be used for

12

DWBI Essential Guide


different reasons by different users. A power user could be looking for location code (NY01), whereas an end user could be looking for more explicit header (New Jersey). Include all parallel hierarchies A product could be having different hierarchies, depending upon if CFO OR Head of sales is looking at it. This enables the done on all hierarchies as well as cross-hierarchies. Production Primary Key Refer Surrogate primary key link to FACT table These keys are used because the production keys could change OR could be reused. For example a bill number could be reused after 5 years, OR a part number (especially FMCG) could be reused after few years. Production OR source system key- This is required for audit ability OR link to the Extraction data and source systems.

Chapter 6
13

DWBI Essential Guide

Dimensional Model SchemasFlake and Constellation

Star,

Snow-

Dimensional model can be organized in star-schema or snow-flaked schema.

Dimensional Model Star Schema using Star Query

The star schema is perhaps the simplest data warehouse schema. It is called a star schema because the entity-relationship diagram of this schema resembles a star, with points radiating from a central table. The center of the star consists of a large fact table and the points of the star are the dimension tables. A star schema is characterized by one OR more very large fact tables that contain the primary information in the data warehouse, and a number of much smaller dimension tables (OR lookup tables), each of which contains information about the entries for a particular attribute in the fact table. A star query is a join between a fact table and a number of dimension tables. Each dimension table is joined to the fact table using a primary key to foreign key join, but the dimension tables are not joined to each other. The cost-based optimizer recognizes star queries and generates efficient execution plans for them. 14

DWBI Essential Guide A typical fact table contains keys and measures. For example, in the sample schema, the fact table, sales, contain the measures quantity_sold, amount, and average, and the keys time_key, item-key, branch_key, and location_key. The dimension tables are time, branch, item and location. A star join is a primary key to foreign key join of the dimension tables to a fact table. The main advantages of star schemas are that they

Provide a direct and intuitive mapping between the business entities being analyzed by end users and the schema design. Provide highly optimized performance for typical star queries. Are widely supported by a large number of business intelligence tools, which may anticipate OR even require that the data-warehouse schema contains dimension tables

Snow-Flake Schema in Dimensional Modeling

The snowflake schema is a more complex data warehouse model than a star schema, and is a type of star schema. It is called a snowflake schema because the diagram of the schema resembles a snowflake. Snowflake schemas normalize dimensions to eliminate redundancy. That is, the dimension data has been grouped into multiple tables instead of one large table. For example, a location dimension table in a star schema might be normalized into a location table and city table in a snowflake schema. While this saves space, it increases the number of dimension tables and 15

DWBI Essential Guide requires more foreign key joins. The result is more complex queries and reduced query performance. Figure above presents a graphical representation of a snowflake schema.

Fact Constellation Schema

This Schema is used mainly for the aggregate fact tables, OR where we want to split a fact table for better comprehension. The split of fact table is done only when we want to focus on aggregation over few facts & dimensions.

16

DWBI Essential Guide

Chapter 7 Dimensional Modeling Modeling vs. Relational

Dimensional modeling is different from the OLTP normalized modeling to enable analysis and querying through massive and unpredicted queries. Something which is a relational model is illequipped to handle.

How Dimensional model is different from an E-R diagram?

An E-R diagram (used in OLTP or transactional system) has highly normalized model (Even at a logical level), whereas dimensional model aggregates most of the attributes and hierarchies of a dimension into a single entity. An E-R diagram is a complex maze of hundreds of entities linked with each other, whereas the Dimensional model has logical grouped set of starschemas. The E-R diagram is split as per the entities. A dimension model is split as per the dimensions and facts. In an E-R diagram all attributes for an entity including textual as well as numeric, belong to the entity table. Whereas a 'dimension' entity in dimension model has mostly the textual attributes, and the 'fact' entity has mostly numeric attributes.

Dimensional modeling is a better approach for Data warehouse compared to standard Data Model.
The dimensional model has a number of important data warehouse advantages that the ER model lacks. First advantage of the dimensional model is that there are standard type of joins and framework. All dimensions can be thought of as symmetrically equal entry points into the fact table. The logical design can be done independent of expected query patterns. The user interfaces are symmetrical, the query strategies are symmetrical, and the SQL generated against the dimensional model is symmetrical. In other words,

You will never find attributes in fact tables and facts in dimension tables. If you see a non-fact field in the fact table, you can assume that it is a key to a dimension table

Second advantage of the dimensional model is that it is smoothly extensible to accommodate unexpected new data elements and new design decisions. First, all existing tables (both fact and dimension) can be changed 17

DWBI Essential Guide in place by simply adding new data rows in the table. Data should not have to be reloaded. Typically, No query tool OR reporting tool needs to be reprogrammed to accommodate the change. All old applications continue to run without yielding different results. You can, respectively, make the following graceful changes to the design after the data warehouse is up and running by:

Adding new unanticipated facts (that is, new additive numeric fields in the fact table), as long as they are consistent with the fundamental grain of the existing fact table. Adding completely new dimensions, as long as there is a single value of that dimension defined for each existing fact record Adding new, unanticipated dimensional attributes. Breaking existing dimension records down to a lower level of granularity from a certain point in time forward.

Third advantage of the dimensional model is that there is a body of standard approaches for handling common modeling situations in the business world. Each of these situations has a well-understood set of alternatives that can be specifically programmed in report writers, query tools, and other user interfaces. These modeling situations include:

Slowly changing dimensions, where a 'constant' dimension such as Product OR Customer actually evolves slowly and asynchronously. Dimensional modeling provides specific techniques for handling slowly changing dimensions, depending on the business environment. Heterogeneous products, where a business such as a bank needs to: o Track a number of different lines of business together within a single common set of attributes and facts, but at the same time.. o It needs to describe and measure the individual lines of business in highly idiosyncratic ways using incompatible measures.

18

DWBI Essential Guide

Chapter 8 Foundation & Conformed Dimensions and Facts in Data Warehouse Dimensional Model
Data Warehouse is a repository which feeds data marts, and other downstream systems. It has to be designed to have global or re-usable set of dimensions and measures.

Data Warehouse modeling has two components:


Foundation to support medium to long-term capabilities, without the need to unsettle the structure time and again. The individual phases for developments of Data Marts eventually merge into the enterprise wide Data Warehouse.

A project has to address both the foundation and phase elements. Every stage in the Data Warehouse project will address these two elements in distinct and overt manner. For dimensional modeling, the following foundation setting elements will work like reusable components. They will be same across the Data-Marts/Data Warehouse for current and the future phases of developments:

Standard set of foundation or conformed dimensions. This means that:

Dimensions are super-sets of all possible attributes for that dimension. For example, customer 'age' attribute may not be required for sales analysis, but required for Credit Analysis. Therefore, when creating the standard dimensions, one make the superset of attributes. Dimensions include all possible levels of business hierarchy. For example- A portfolio analysis of a channel may not require the branch level location, but the agent productivity analysis could. Dimensions to include not only categories, but descriptive textual attributes as well wherever needed. For example- A textual detail for a location code could be needed for distribution analysis, but many not be needed for portfolio analysis. Make the dimension most granular- Many a times the analysis does not need to go down to the most granular level of customer ID. In case, customer

19

DWBI Essential Guide


moves from his existing customer segment, the whole dimensional modeling could lead to issues, if the dimension is starting from customer group upwards

Examples of foundation dimensions are- Customer, Location, Channel, Sales Lead etc.

Standard set of foundation or conformed facts. This means that:

A fact table will include all possible units of measures for given set of dimensions. For example sales by numbers could need only the number of 'Crates' in one data mart and 'Pieces' in the other. However, both units for the given measure should be included even if there is a standard conversion rate. These standards conversion rates keep on changing with time. A Fact table logically groups a business instance. For example you could require distribution of a 'product' to retail outlet for distribution analysis. However, you will require the fact on final sale to the end customer for sales analysis. As a guideline, a highly linked business process should get combined in a single fact.

Standard set of foundation measures. This means that


All the measures and their possible units to be listed out. Measures are most susceptible to having confusing definitions OR to be misnamed. Detailed formulas behind measures are must. Refer Sales Revenue Fact-Measure as an example.

Examples of foundation measures are- Sales Measures, Customer Measures, etc.

20

DWBI Essential Guide

Chapter 9 Slowly Changing Dimensions Dimensional Modeling SCD in

Dimensional model has to address some complex situations liked slowly changing dimensions.

Slowly Changing Dimensions


Entities change over time. Customer demographics, product characteristics, classification rules, status of customers etc. lead to changes in the attributes of dimensions. In a transaction system, many a times the change is overwritten and track of change is lost. For example a source system may have only the latest customer PIN Code, as it is needed to send the marketing and billing statements. However, a data warehouse needs to maintain all the previous PIN Codes as well, because we need to track on how many customers move to new locations over what frequency. A key benefit for Data Warehouse is to provide historical information, which is typically over-written (and thus lost) in the transaction systems. How to handle slowly changing dimensions in a Dimensional Model is a key determinant to that benefit.

21

DWBI Essential Guide

There are three ways to handle the same:


Slowly Changing Dimension method 1 (In short SCD 1)

The way most of the source systems will handle it- Overwrite the attribute value. For example if a customers marital status has moved from 'Unmarried' to 'Married', we over-write 'unmarried' to 'Married'. Similarly, if an insurance policy status has moved from 'Lapsed' to 'Re-instated' the new status is over written on the old status. This is obviously done, when we are not analyzing the historical information.
Slowly Changing Dimension Method 2 (in short SCD 2)

This is the true-blue technique to deliver precise historical analysis. This is used, when there is more than one change in the attributes of an entity, and we need to track the date of change of the attribute. In this method, a new record is added whereby the new record is given a separate identifier as the primary key. We cannot use the production key as the primary key here as it has not changed (Customer ID has remained the same, while the value of its attribute 'marital status' has changed). This new identifier is called the surrogate key. Apart from adding a new record and providing a new primary (surrogate) key, the validity period for this new record is also added. For example- You have a dimensional table with customer_ID '110002' with marital status as 'single'. Overtime, customer gets married and also moved to a new location. The customer dimension record will be:
Surrogate Key 1100021 1100022 1100023 Customer Marital Date Valid ID Status 110002 110002 110002 Sept 2004 23, Single Date Birth of City

Jan8, 1982 Palo Alto Jan8, 1982 Palo Alto Jan8, 1982 San Francisco

Oct 25, 2005 Married Nov 2005 23, Married

Slowly changing dimension method 3 (SCD 3)

This is a mid-way between method 1 and method 2. Here we dont add an additional record, but add a new field 'old attribute value'. However, this has 22

DWBI Essential Guide limitations. This method has to know from the beginning on what attributes will change. This is because a new field/attribute has to be added in the design for every attribute, which can change. Secondly, attribute can change maximum once in the lifetime of the entity OR at least the lifetime of the data warehouse.
Surrogate Customer Marital Date Key ID Status Birth 1100021 110002 Married Jan8, 1982 of City Marital City Status Old Old Single Palo Alto

San Francisco

NOTE The term of 'Slowly changing dimension' is used because of it being a universally acknowledged term. However, the same methods will apply to fast changing dimensions as well.

Data Analysis/OLAP
Data Analysis/OLAP is most fundamental way to make sense out of your data. It involves looking at the data from all possible angles, slicing & dicing on various dimensions, drilling up/down, applying filters, exception highlighting, graphs and other presentation tools, doing time trending analysis. Whether you are doing a pivot on excel or creating advanced views in an upmarket OLAP tool, most of the usage of data in todays world falls within the realm of Data Analysis/OLAP It is essentially a . post graduate course before you go for fellowship in Data Mining.

Chapter 1 Online Analytic Processing (OLAP)-Overview

23

DWBI Essential Guide OLAP in Business Intelligence- What is OLAP? This topic provides a high level concept of OLAP and also on how it fits into BI frame-work. It provides different OLAP options and also shared that OLAP is a layer between Data Warehouse and BI end-user tools. Online Analytic Processing is the capability to store and manage the data in a way, so that it can be effectively used to generate actionable information. You are aware from the Business Intelligence Architecture; OLAP sits between the Data Warehouse and End-user tools. OLAP is explained in detail in OLAP vs. Data Warehouse

OLAP makes Business Intelligence happen, broadly by enabling the following:


Transforming the data into multi-dimensional cubes Summarized pre-aggregated and derived data Strong query management Multitude of calculation and modeling functions

A data-warehouse could be having data in various formats like dimensional (with a high degree of de-normalization) OR highly relational (like 3rd normal form). As a separate note- We have covered the entire data-warehouse chapter on the basis of dimensional modeling based storage. Most of the concepts in the data-warehouse chapter remain the same irrespective of the kind of storage and data-modeling one needs to do. The detail differential between OLAP vs. Data Warehouse is given in OLAP Layer OLAP provides the building blocks to enable analysis (like rich functions, multi-dimensional models, analysis types..). Mostly the end-user tools (like business modeling tools, Data mining tools, performance reporting tools..), which sit on top of the OLAP to provide rich user Business Intelligence interface. OLAP and Data warehouse work in conjunction to provide overall data-access for the end-user tools. You may like to refer to BI Architecture Scenarios to get a better back-ground. There are different ways to store the data in OLAP+ Data-warehouse combination. While you can refer to OLAP Architectures in BI Architecture, here is the brief:

MOLAP: OLAP storing the data in the multi-dimensional mode. To put it in a simplistic manner, there is one array for one combination of dimensions and associated measures. In this storage method there is no connect between the MOLAP database and data-warehouse database for query purpose. It means that a user cannot drill down from MOLAP summary data to the transaction level data of data-warehouse.

24

DWBI Essential Guide

ROLAP: OLAP storing the data in relational form in dimensional model. This is a de-normalized form in relational table structure. ROLAP database of OLAP server can be linked to the Data-warehouse database. HOLAP: The aggregate data is stored in the multi-dimensional model in the OLAP database and the transactional level data is stored in the relational form in the data-warehouse database. There is a linkage between the summary MOLAP database of OLAP and relational transactional database of Data-warehouse. This gives you the best of both the worlds.

Chapter 2 Basic Data Analysis Types- Building Blocks


Drill (horizontal) and Cross (horizontal) Navigation and Analysis

these are the methods of moving horizontally and vertically within the dimensional structure of Data-warehouse and OLAP. This term is more used in context with OLAP, because typically various End-user Business Intelligence tools sit on top of OLAP, which in turn sits over Data-Warehouse.

25

DWBI Essential Guide

Drill (horizontal) and Cross (horizontal) Navigation and Analysis


These are the methods of moving horizontally within the dimensional structure of Data-warehouse and OLAP. This term is more used in context with OLAP, because typically various End-user Business Intelligence tools sit on top of OLAP, which in turn sits over Data-Warehouse.
Drill-down Navigation

It is a method of exploring for more detailed data. It is done by revealing lower-level data than was previously displayed. For instance, you can drill down from State to City to offices. Available levels depend on the granularity of the data in OLAP and data warehouse.
Drill-link

A URL hyperlink to a destination, defining the parameters, such as the document name and prompt answers, for the drill. When the document is viewed in Web, a user can click the link to navigate to the link's destination.
Roll-up Navigation

A method of exploring for more widely summarized data. Its an antonym to Drill Down. Typically you move up a dimension hierarchy. For example you have the office level break-up of sales revenue, and you can roll it up to city, zone, region and country level figures.

Cross-Dimensional Navigation

(horizontal)

analysis

and

Cross-dimensional analysis is an analysis across multiple dimensions- the key reason why OLAP and its multi-dimensional structure exist. Most of the business reporting and analysis goes across dimensions. A single dimension analysis is, when you get measures for a single dimension. For example- when one looks for measures sales, headcount of employees, operating expenses etc. for 'location' dimension (office, city, state, region, country..) A cross-dimension example will be to look for measures sales, gross profit etc. for 'location' dimension (office, city, state, region, country..) for a given set of products, for a given number of quarters. If you top this kind of example with other analysis types (max-min, exception, filtration), you come 26

DWBI Essential Guide close to the real-life complexity of a business analysis query. One example can be: Identifying top ten of the offices where, the sales for 'washing and cleaning' product range is more than the average sales for this product range across all offices, for those offices, which are open for more than 3 years and have an average growth of 5% per quarter over last 4 quarters. Cross-dimensional analysis capability with an OLAP server is also manifested in the cross-dimensional navigation. For example- you are seeing a pie-chart of revenue share for different product-lines. By clicking on pie of a given product (general insurance- Vehicles), you may like to go for state-wise split for the revenue of that product. Going further, you may like to click on a given state (New-York, California..) and look for split across the channels (telemarketing, sales employees, tied agents, corporate agents, 3rd party brokers..). In the above examples, you are able to seamlessly navigate and drill across due to the cross dimensional linkages.

Here is the list of cross-dimensional analysis you can perform:


Drill-across Dimensions

You drill across dimensions, when you move from one dimension to another. For example you are looking at revenue break-up for the cities. However, now you want to have the break-up of revenue for various products (Say Fax machine, Telephone and copier) within that city (Say New York). Within the Fax Machine product in New York City, you want to find the break-up as per channels of telemarketing, mailers and direct sales. In the above example you have drilled across the Dimensions of 'Location'>'Products'->'Channel'.. This is one of the most important and features And is fundamental capability expected out of an OLAP tool
Drill across Measures

It is similar to Drill-across dimensions. For example, you are doing the sales revenue analysis and have been able to find out the best and least performing offices. However, to have a further understanding of the picture, you now move across measures to find out about the Sales transactions of these offices (a low revenue , but higher 27

DWBI Essential Guide sales transaction point to a certain level of activity) and number of sales staff (the low performing offices could have lesser staff) and number of months since the office is set-up (the new offices being in gestation period could be performing lower)..
Drill across Attributes

This is by all means same as 'drill-across dimensions'. For example you have the data for revenue in US as per the customer relationship value bands (say USD 10K to USD 20K/USD 20K to USD 50 K/USD 50K and above.). For USD >50K band, you want to have the break-up as per the age bands (18 years to 25 Years/25 Years to 40 Years/>40 Years), and within >40 years, you want to have the break-up for occupation (self employed, Practicing professional, employ ed..) In this example we drill across the attributes of relationship value band->Age Band-->Occupation All belong to the customer dimension.

Time Trending Data Analysis


Time trending analysis includes period to period comparisons, across the periods and within a period analysis. Apart from being an important analysis lever, time trending is also core to performance management. One always wants to see on how much needle is moving over time.
Period Analysis - Beginning OR End of Period

This is 'balance-sheet' kind of analysis, where you find out the status of various measures (for example the account balances, the number of offices, number of customers, number of defaults, number of patients in admission) at the end OR beginning of a period (say end of month, beginning of quarter..)
During the period activity

This is a 'profit & loss' kind of analysis, where you find out the extent of activity done within a period. For example-- Sales Revenue measure, Number of patients admitted, number of festival package flat screen TVs. sold.in given month, quarter, week

28

DWBI Essential Guide


Time Trending through Period to Period Comparison

This is typical business performance parameter. For example --the comparison of sales in the first quarter compared to first quarter last year OR the sales in the New Year season this year vs. last year
Time-Trending across a fixed vs. rolling period range

It is used when you see the time-trend across the periods in the fixed time range. For example the revenue figures across twelve months in the calendar year (OR business performance year) OR when you do YTD (year to date) and MTD (month-to-date) analysis, where the starting point for your reference is beginning of the period. The investment analysis typically uses the rolling period range, where the stock movement across last twelve months on the rolling basis is tracked. You can refer Time Dimension, to understand more facets of Time related analysis.

Exception Analysis
Exception analysis expectations. throws-up the areas not meeting the

No one wants to know on what is going as expected. People are interested in finding out on what is going worse and what is better than expected.
Range Exception

This is simplest of all. If the value of a certain measure goes beyond a range of values, the analysis should highlight it. The range can be a 'defined range' (the temperature of patient should be between 98 Degree F and 102 degree F) OR an 'undefined range' (the stock index movement more than 2% either direction).
Value Exception

This analysis identifies, if an attribute OR measure value is belonging to OR not belonging to a specific list of values. For example a high 'credit-risk' customer spending on high value 'product'.
Conditional Exception

When you want to identify the occurrence of certain conditions. 29

DWBI Essential Guide For example- Let us say you want to highlight the exception instances, when a high customer value relationship has not used your credit card for three months and has been paying only the minimum due amount.

Data Min-Max Analysis


Min-Max for Range Analysis

This is to see the maximum and minimum values, a measure takes. For example you want to find out on the maximum value geological disturbances in a given seismic location. OR you want to find out on the maximum load generation, when you switch on an electrical device.
Outliers Analysis

Outliers are similar to the range analysis, but with a different purpose. Outliers are exception values less in occurrence, but having an impact on the aggregations like sum and averages. For example you want to find out average delivery TAT for a product. These results could show that taking out these top 2% instances can bring down the overall average of delivery TAT by 10%.
Standard Deviation Analysis

Any performance is a combination of how one is performing on an average and how much is the consistency of performance. 'Standard deviation' is a measure of consistency.

Data Filtration Analysis


Like other analysis types covered in the chapter, Filtration analysis allows you to place filters for your queries. Applying filter can be seen both for 'exclusion' OR 'selecting specific values for inclusion'. In simplistic way, filtering can be equivalent to 'where' clause of an SQL query.

Here are different ways you can Data Filtration analysis:


Data Filter on specific values of a dimension by direct specifications:

Calculating sales for select set of offices in a city OR calculating the operating expense across a given set of expense lines. The filtration can be on combination of dimensions- For example- select set of office locations, which are selling a given line of products.

30

DWBI Essential Guide


Data Filter on specific value of a dimension given a certain conditions, related to a dimension:

Calculating average sales revenue for only those office locations, which have been operating for less than 6 months of time. There can be more complex conditions.
Filter on specific values of a dimension linked to measure values:

Calculating average sales value productivity for only those offices, where the sale is less than the average sales per office across all the offices.
Filtration analysis on tolerances and outliers:

When you are calculating the averages (say), you may like to count out the values, which are below certain tolerances (outliers). For examples, calculation of average write-off values from cancelled credit cards, where the write-off is more than USD 10 dollar and less than USD 20000.
Top and bottom filters:

Example-Filtering out the 'top 10', 'bottom 10', 'top 10%', 'bottom 10%'.

Pivoting, and Slicing & Dicing Analysis


Slicing means taking out the slice of a cube, given certain set of select dimension (customer segment), and value (home furnishings..) and measures (sales revenue, sales units..) or KPIs (Sales Productivity). Dicing means viewing the slices from different angles. For example -Revenue for different products within a given state OR revenue for different states for a given product. Slicing and Dicing leads to what you can call Pivot. Pivot is known in Excel context. Pivot is the standard and basic look and feel of the views you create on the OLAP cubes. A pivot creates ability for you to create the width and depth in your view of the data. A pivot is a two dimensional lay-out of the summary data. The x and y axis are the dimensions and the intersection cells for any two dimension values contain the value of the measures.

31

DWBI Essential Guide

Here is an example of how you can slice and dice through pivot:
Step1: Starting layout- You can have product list on y axis (say 10 products), the quarters (say four quarters) on the X-axis. You can have sales value as the measure shown in the table against intersection of a given product and a quarter. You will have 10 X 4 matrixes. Step 2: Adding depth Cross-Dimensionally-Taking a step further, you can add a dimension of locations under the product to give it more depth. Therefore now you can have different locations (say 3 locations) for each row of product. You will not have a 30 (3 locations for each of the 10 products) X 4 (quarters) matrix. Step 3: Adding depth within a single dimension: You can also add another dimension like months under quarters. Now you will have 30 X 12 (3 months for each quarter). You can also specify, if you want to have sub-totals for every dimension. For example, you can have the sub-totals for locations, productions, month and quarters. Step 4: Pivoting on an axis: You can also pivot your view and transpose the product+ location combination on X axis and quarter + month combination on Y axis. Step 5: Adding Width: Referring to starting layout-You can also add dimensions in 'width' instead of 'depth'. For example- instead of having location dimension under the product, you can add location dimension adjacent to the product dimension. Therefore, you will have a matrix, which on Y axis will have 10 rows (for 10 products) and 3 rows (for 3 locations), with a 13X4 matrix.

32

DWBI Essential Guide

Chapter 3 Advanced Blocks Data Analysis TypesBuilding

OLAP what if Analysis


What-if analysis is essentially scenario building capability of OLAP. You can draw a straight parallel of what-if analysis in MS Excel, with more sophisticated capabilities in OLAP.

This is what you do in what-if analysis:


Create a what-if calculation model:

This is the calculation model on which you are going to apply different scenarios. Profit and loss projection for next 5 years is one example of a calculation model. A Calculation model takes a set of input values, to give the set of out-put values. Each different set of these 'input and output value combinations' is called a what-if analysis scenario.
Creating different set of what-if scenarios:

Depending upon your needs you can build different scenarios of input values, and you can apply those scenarios on the calculation model, and generate the output values. Here is a simplistic example:

33

DWBI Essential Guide Let's say that you have Profit and Loss projections for next 5 years as the calculation model. Following can be some of the inputs values to the model:

Revenue on year 0 Expected revenue rate of growth CAGR (cumulative average growth rate) Gross profit margin % Non-operating expenses. Income tax rate Rate of Dividend.Etc..

The output values can include:


Revenue Gross margin Operating margin Profit before Tax Profit after tax Allocation to reserves.

How what-if analysis works in OLAP


An OLAP tool will have an end-user tool sitting on top of the OLAP server (like MS excel OR a business modeling tool). You will create a calculation model in the modeling tool, and the input and output data for each scenario is stored in OLAP. This is what you call 'write-back', and it is a key value add from OLAP tools. This is one key area where OLAP differs from Data Warehouse (which is read-only for good reasons).
Using combination of a good end-user tool and OLAP server, you can do the following:

All Input as well as output data can lie entirely in OLAP. Input and output data can lie partially in OLAP and partially in the enduser tool. The reason for partial OLAP storage is - you may like to keep only certain select scenarios in OLAP, and the rest of the scenarios could be more temporary OR non-serious scenarios, which you may not like to store in OLAP (but do it in the end-user tools) There is another cut for partial storage. You may like to keep only some part of input value-set in OLAP and rest in the analysis tool. This is because- some input values may not be aligned to the multi-dimensional design of OLAP. For example- in P&L calculation model, you may not have tax rate as a measure in OLAP dimensional model. To write back tax-rate, you may have to change the dimensional model, which may OR may not be worth the effort. Tagging of scenarios in term of how you want to present. For example-'most probable' to 'least probable' probable scenarios.

34

DWBI Essential Guide

OLAP Data Allocation Analysis


Allocation Engines is another differentiator for OLAP vs. Data Warehouse. Simplistically-It allows the users to automatically allocate or in other words 'split' the value into multiple values.

OLAP Allocation Engine enables users to:


Take a source data Define the basis of allocation Execute the allocation operation Store the allocated values to the target data.

A Simple example:

You want to allocate the enterprise IT expense (source) to IT expense for each line of business/departments (Target). The 'basis' of allocation is: 70% of the IT expense to be allocated on the basis of the LAN IDs and 30% of the expense is on the basis of the business revenue generated by the line of business (non-earning departments will not be included in this allocation basis). The operation will be 'proportional' allocation. This means that the expense will be allocation proportionally on the basis of number of LAN IDs and Business Revenue (for 70% and 30% of Enterprise IT expense respectively). From OLAP perspective, OLAP will pick the source data from within OLAP OR from Data Warehouse (if it is not stored in OLAP). It will apply the allocation basis and the output values (IT expense for the period calculated for each line of business) will be stored in OLAP (as a write-back). As a side note the source and target may be outside of OLAP, while using the allocation engine of OLAP server.

Here are the typical Allocation Analysis capabilities linked to OLAP:

The source and basis can be formulas, so you can perform computations on existing data and use the result as the source OR basis of the allocation. For example, you can have IT expense to be formula of summation of various IT expense lines stored in OLAP (like license fee, Data centre operations expenses, Network expenses). Basis of allocation is typically a formula. You can specify the method of operation of the allocation for a dimension. The operations range from simple to very complex. You can have:

35

DWBI Essential Guide


Proportional allocation Even Allocation Combination of proportional and Even calculation etc. You can specify whether the allocated value is added to OR replaces the existing value of the target cell. Taking the same example of Enterprise IT expense, the whole IT expense is lying in the IT account. When you run allocation engine, IT department will also be allocated certain expense, as it also has got some LAN IDs for its own employees OR on-site vendors. You can specify if you want to replace this value, add to this value (not a correct option) OR store it separately.
o o o

You can specify an amount to add to OR multiply by the allocated value before the result is assigned to the target cell. This is an extension of what you call the 'allocation operation'. Taking the same IT example- Let's say that you have calculated the IT expense figure for each line of business/department. You may like to add a 2% additional expense overhead, as you allocate. This expense is to take care of any special IT initiatives, which may not be linked to business case driven IT projects. You can exclude certain values within a dimension hierarchy so that both the source data and target data is not included in the whole allocation process. Taking the same IT expense example- You may like the a certain expense like (like License Fee for ERP system) not to be included, as that might be part of the overall licensing agreement between your parent company and the ERP vendor. This will ensure that this expense is neither considered as source and nor applied to the target data. Within the allocation operation, you can define the limits and tolerance. For example- not to allocate IT expense to departments, which have less than 20 LAN IDs, OR not to allocate IT expense, where it is less than 10000 USD. This kind of allocation rules, necessitate iterative allocation calculations. You can store different versions of allocations. For example- if same department has two different allocations for IT expense. Say, there is an enterprise IT expense allocated to that department (the example used throughout this page), and there is direct allocation of license fee for a software, which is bought exclusively by this department. You should be able to store both of these allocations in separate cells. You are able to handle allocations for special situations. For example- if the basis is NULL- Say the LAN IDs are null in certain departments. This may not be a real-life scenario, but you are able to define that the allocation should consider this as non-applicable and should not allocate any expense.

36

DWBI Essential Guide

OLAP Goal-Seek Data Analysis


Goal-Seek is What-if analysis in OLAP, but of a reverse order. In a typical What-if analysis, A Calculation model takes a set of input values, to give the set of out-put values. Each different set of these 'input and output value combinations' is called a what-if analysis scenario. Depending upon your needs you can build different scenarios of input values, and you can apply those scenarios on the calculation model, and generate the output values. Here we are traveling from input values to output values. In goal-seeking, the direction is reversed. You have the out-put values for a scenario, and you want to have the input values, which will correspond to the given output values.
The OLAP goal-seeking capabilities have following scenarios:

Single input value and single output value. Multiple input values and single output value. For example- you can have same net profit margin (output value), with different combinations of operating and gross margins as input values. Multiple input values and multiple output values. For example- you can have same P&L projections (output values of Gross profit, Operating Profit, Net profit..), with different combinations of input values (like Revenue growth, gross margins, non-operational expense..)

An OLAP analysis solution can have the following goal-seek capabilities:


Allows you to define on which input values you want to change through goalseek to achieve the given output values. Allows you to define the min-max limits for each input value, as goal-seek generates various options of input-value combinations. Allows you to define any tolerances, which are acceptable for the output values. Apart from min-max limits, you can define various other constraints on the input values. You can accept or reject an option created by goal-seek. You can tag an option generated by goal seek. For example- 'more probable' and 'less probable'. You can store the options generated by goal-seek in OLAP OR the analysis tool sitting on top of OLAP.

37

DWBI Essential Guide

Chapter 4 Business Hierarchies Warehouse in OLAP and Data

The subject of hierarchies is relevant to both OLAP and BIPM Delivery - Data Warehousing/Marting. Modeling of data is done, both in DW and OLAP, keeping the hierarchies in mind. However, 38

DWBI Essential Guide OLAP is the platform where the hierarchies are manifested in their final shape for the purpose of analysis.

OLAP and Data Warehouse Dimensional Model Hierarchy


The subject of hierarchies is relevant to both OLAP and BIPM Delivery - Data Warehousing/Marting. Modeling of data is done, both in DW and OLAP, keeping the hierarchies in mind. However, OLAP is the platform where the hierarchies are manifested in their final shape for the purpose of analysis.

Definition of DW/OLAP Hierarchy


Hierarchies are the paths over which any data (OR measure) is summarized. As you perform various Vertical and Horizontal navigation operations, you move along with these paths of hierarchies. 'Office> City> State> Country> Continent>' Globe is one such example of a hierarchy for a location dimension. In this hierarchy, office is at the lowest level of the ladder and globe at the highest. So you can roll-up sales revenue measure figures from office level, way up to the global level.

In terms of basic definitions linked to a DW/OLAP hierarchyA dimension level, which is participating in the hierarchy (OR a step in the ladder of hierarchy) is called a Level. For example 'city' in location dimension hierarchy will be a level. The sequence of these levels is called the Path. For example- the 'Office> City> State> Country> Continent > Globe' is the hierarchy path. The first OR the lowest level of hierarchy is called Leaf (office in the example) and highest OR last level is called Root (Globe in the example). Within the two consecutive levels, the higher level is called the Parent level and lower is called Child (for example 'City' is parent for 'office' and child for 'state' level). Business hierarchies are not limited to Business Intelligence. Business hierarchies exist since the data model was invented. If you look at your typical Entity Relationship diagram in your transaction system data models, you have child and parent entities. Child and parent entities are nothing, but representation of a hierarchy. One has to take a note, that Business intelligence dimensional modeling in most cases, does not invent hierarchies. These hierarchies exist in the data models of transaction OR source systems, and organizational data models & business processes. For example, if you haven't got a linkage between a Sales unit to a Business unit defined in your transaction system, don't expect your OLAP to have that hierarchy defined. In other words, just like data, the input on hierarchies 'mostly' comes from the source OR Source Systems Mapping. 39

DWBI Essential Guide With reference to entity-relationship diagrams in transaction systems- A child and parent entities are reflected in your database design as referential integrity. For example- In the referential integrity you have 'office master table', having a 'city-code field', which is linked to the 'city-master table'. 'City master table' will have the 'state-code field', which will be linked to the 'state-master table'. This is an example of office>city>state hierarchy of location. A transaction system is able to navigate the information from the lowest level of hierarchy to highest level. That is why, a data warehouse can have the storage in dimensional (de-normalized) model form OR relational model (normalized) form, without impacting the concept of hierarchy. As you will see in the Additivity of Measures chapter, the hierarchies drive the additivity OR aggregation rules in big way. You should be reading the hierarchy chapter before you, go to the measures chapter. There are different kinds of hierarchies, and each hierarchy has a different role and a context. Before we go into this classification, let us list three main factors, on which different kind of hierarchy structure are created.

The level-cardinality: This means - if a child level in hierarchy can belong to one OR more than one dimension levels. The instance-cardinality: This means - if a child instance in hierarchy can belong to one OR more than one parent instances. The Analysis criteria: This means- if you are using levels within hierarchy path for one OR more than one analysis criteria. For example- you can use an office for geography as well as sales organization criteria.

Type of Data Entity Hierarchies


Strict OR Simple Hierarchies

These are the hierarchies, which can be represented by a tree structure, whereby:

Each level in the tree has only one possible parent level, AND Each instance can belong to only one defined level AND Criteria for analysis are same.

Therefore, in a simple hierarchy, a child will have only one parent, and parent will have only one child level. The simple hierarchies can be further categorized into symmetric, asymmetric, generalized and non-covering hierarchies.

40

DWBI Essential Guide

Non-Strict Hierarchies

In a non-strict hierarchy

Each level in the tree has only one possible parent level AND Criteria for analysis is same, but Each instance can belong to more than one instance in the parent dimension level.

Multiple and Alternate Path Hierarchies

In Multiple and Alternate path hierarchy

Each level in the tree can have more than one possible parent level AND Each instance can belong to more than one instance in the parent dimension level AND Criteria for analysis are same.

Parallel Path Hierarchies

In this hierarchy structure there is flexibility on all factors

Each level in the tree can have more than one possible parent level AND Each instance can belong to more than one instance in the parent dimension level AND Criteria for analysis is different

Dimensional Model Simple Hierarchy


Simple or strict hierarchy is the simplest form of business Hierarchies.
Symmetric Hierarchy is the simplest form of a simple hierarchy. It has:

All levels in the hierarchy must exist. There in only one path from bottom most to top. In other words, a level cannot exist in any other hierarchy.

For example- you will not have city in any other hierarchy path. Simple example- is the same ' Office> City> State> Country> Continent> Globe'

41

DWBI Essential Guide


Asymmetric Hierarchy

Asymmetric hierarchy will be same as that of symmetric hierarchy apart from the fact that, you may not have lower levels existing in some instances. Let's take an example of Sales Channel. In this you may have 3rd party sales agent>Sales Executive>Sales Manager>Sales Area >Sales Zone>Sales Region In the above, for certain instance, it might be possible that a 3rd party sales agent may not exist and the sale is directly done by the sales executive. Therefore there will be nil instances for lowest level in this hierarchy in certain cases.
Generalized hierarchy:

In generalized hierarchy, there may be shared levels within the two different hierarchy paths, but an instance at lower level, cannot belong to two instances in the parent. For example 3rd party sales agent> Sales Executive>Indirect Channel--Sales Manager >Sales Area Manager>Sales Zone head>Sales Region head Sales executive>direct Channel--Sales Manager>Direct Channel-Sales Area manager>Sales Zone head>Sales Region head In the above two cases you have shared level at 'Sales Executive', 'Sales Zone' and 'Sales Region'. However, any instance of sales executive will belong to only one parent- Either to an Indirect Channel -Sales Manager OR Direct Channel- Sales Manager. In other words, If a sales executive is working for two managers (one direct and one indirect), it will not be a simple hierarchy.
Non-covering hierarchy

In this hierarchy, an instance of an intermediate level may be missing. For example, in the below hierarchy, Sales executive>direct Channel--Sales Manager>Direct Channel-Sales Area manager>Sales Zone head>Sales Region head For some instances, you may have a direct sales Manager directly reporting to sales zone head as the zone might be smaller and the sales area manager level may not be existing in that zone. 42

DWBI Essential Guide

Dimensional non Strict Hierarchy


Non-Strict hierarchy has one similarity and one non-similarity with strict hierarchies. A non-strict hierarchy has one level in a hierarchy path to be having only one parent level. However an instance (or member or value) in a level could belong to multiple instances in the parent level. For example- as taken from previous topic on simple hierarchy: Sales executive>direct Channel--Sales Manager>Direct Channel-Sales Area manager>Sales Zone head>Sales Region head The above can be made more complex by an example of non-strict and noncovering hierarchy, whereby a sales executive is reporting to direct channelsales manager (say for sales of certain set of products) and also directly reporting to a direct-channel area manager (for sales of special set of products). If we have a sales executive working only for a single direct channel- sales manager, it will be called a strict hierarchy. However, if we have a sales executive working for more than one manager, it will be a non-strict hierarchy. As you go to the Additivity of measures chapter within OLAP, you will see that unlike strict and simple hierarchies, you cannot have simple summarization of measures. For example, you cannot have the sales revenue achieved by a sales executive and roll it up through two sales managers he is reporting to. If you do this, you will be double counting.

Multiple Path Hierarchy


Multiple path hierarchies have a dimensional level belong to two different dimensional levels. However the criterion of analysis is same across the multiple paths.

In Multiple and Alternate path hierarchy

Each level in the tree can have more than one possible parent level AND Each instance can belong to more than one instance in the parent dimension level AND Criteria for analysis are same.

43

DWBI Essential Guide An alternate path hierarchy is, when the hierarchy paths merge at certain points (generally once at the higher levels), whereas it is called multiple path, when the paths do not merge. As an example The following hierarchy path is a alternate path hierarchy, whereby sales office level belongs to two different levels as a parent (direct sales channel area and indirect sales channel area), but the paths merge at the level of sales region. Essentially the hierarchy is taking alternate paths to reach at the same level in the end. Sales office>direct sales Channel Area>direct sales channel sector >Direct sales Channel Zone> Sales Region Sales Office>Indirect sales Channel Area>Indirect sales Channel sector > Indirect sales Channel Zone> Sales Region The following hierarchy will be the multiple path hierarchy. In this example the paths are not merging. Sales office>direct sales Channel Area>direct sales channel sector >Direct sales Channel Zone> Sales Region Sales Office>Indirect sales Channel Area>Indirect sales Channel sector > Indirect sales Channel Zone> Indirect Sales Region As you see in the above examples- though the paths are either alternate OR multiple, but the criteria for analysis is same and that is the sales channel and related measures. The next topic is parallel hierarchies, which is combination of the multiple OR alternate path hierarchies, but where the criteria is also different.

Parallel Dimensional Hierarchy


Parallel Hierarchies are most flexible hierarchy paths.

In parallel path dimensional hierarchy system:


Each level in the tree can have more than one possible parent level AND Each instance can belong to more than one instance in the parent dimension level AND Criteria for analysis is different

44

DWBI Essential Guide Sales office>direct sales Channel Area>direct sales channel sector >Direct sales Channel Zone> Sales Region (Sales organization dimension) Sales Office>city>district> state> country (Location Dimension) If you look at the example, sales office level belongs to two different parent level dimensions (Direct sales channel area and city), the instance of an office (say Sydney harbor office) belong to two different instances (Sydney west sales area and Sydney city) and also different criteria for analysis (sales organization and location). Essentially the difference between the parallel and multiple hierarchies is on the criteria for analysis. A parallel hierarchy can be a dependent hierarchy - whereby the paths could be sharing the same levels like the following: Sales office>direct sales Channel Area>direct sales channel sector >Direct sales Channel Zone> Sales Region> Country (Sales Organization Dimension) Sales Office>city>district> state> country (Location Dimension) In the above example, you will have the 'country Level' being shared. Country (like sales office) is also appearing as an instance in two different dimensions.

45

DWBI Essential Guide

Chapter 5 Additivity and Aggregation of Measures-Facts in OLAP Analysis


Additivity and correct aggregation methods application is fundamental to the success of Business Intelligence. The most common mistakes the modelers and designers make is on - Setting the Right Hierarchies AND Establishing Right Additivity and aggregation rules. You need to go through the chapter of business dimensional hierarchies, before you go through this chapter.

Additivity of Measures-Facts
Additivity and correct aggregation methods application is fundamental to the success of Business Intelligence. The most common mistakes the modelers and designers make is on - Setting the Right Hierarchies AND Establishing Right Additivity and aggregation rules. You need to go through the chapter of business dimensional hierarchies, before you go through this chapter. Additivity of a measure is when you are able to apply the sum operator across all the dimensions. Other aggregations on measures-facts are when you use operators like Average, Maximum and Minimum. The OLAP tools now-a-days have some capability to automatically enforce the correct additivity and averaging rules, given the hierarchy and the type of measure. However, the burden is finally on the modelers and designers.

Before we move further, let's take a look at some more aspects, which will be useful:
Completeness of Hierarchy:

This basically means that all the possible instances of a hierarchy path to be available, to make it complete. It means that there should be no missing data in the tables. For example, if you have country and continent level in 46

DWBI Essential Guide the location dimension, one should expect that all the countries in the Europe continent should exist in the tables. Otherwise your summarization for the continent may not work.
Classification vs. descriptive attributes:

A classification attribute of a dimension is the attribute, on which the aggregation takes place. A dimensional attribute is the one, which plays the role of a descriptor and is not the basis of aggregation. You will see that OLAP includes all classification attributes and some descriptive attributes.

Non-Additive Measures-Facts
Non-Additivity is that when you cannot use a sum operator to generate the needed aggregation. Here are non-additive measures:
Ratios and Percentages:

Some examples of the ratios and %ages is the Profit-margin, revenue to asset ratio, default rate etc. If you add the profit-margin % of all the products for a retail company, you may get a figure of much more than 100%. Therefore you need to first take the sum of the numerator (profits) and denominator (revenue) for all the products and then calculate the ratio. When you are applying aggregation on a ratio, one need to take the 'ratio of the sums', instead of 'sum of the ratios'. Similarly for %ages, the same rule will apply. The same constraint will also apply on averaging. Just like sum, even the average operator will fail here. Solution: Store the numerator and denominator in separate fields and the ratio OR % age in a separate field as derived measure-Fact.
Measures of Intensity

These are more of clinical and scientific measurements. For example, blood pressure, temperature, gauge pressure, wind speed etc. The handling of these kind measures can be simple average (like average blood pressure of the sample of patients with same medical history and between ages of 40 and 50 comes out to 140/110..). However, the designers could apply very specific rules to calculate the summations (like placing weights to different instances). This is primarily due to the scientific nature. 47

DWBI Essential Guide Solution: Use alternate aggregation functions like averages, minimum and maximum. Track the constraints in the meta-data.
Grades and scales

This is same as measures of intensity, but more of business domain. Some examples are risk grade of customer, level of risk scale associated with a loan. Solution: Use alternate aggregation functions like averages, minimum and maximum. Track the constraints in the meta-data.
Averages/Maximum/Minimum and similar measures

You may have derived measures in current data OR historical snap-shots. If you have averages OR max-min figures, these will not be additive. In other words, the attributes which do not contain the 'activity', but the 'characterization' measure, do not follow the additive path. 'Characterization' measure is a kind of measure, which characterizes the activity. For example, while the 'turn-around time' is an activity measure, but average TAT, maximum TAT, minimum TAT for a period (say), characterize the TAT activity. Solution: Use alternate aggregation functions like minimum (minimum of minimums) and maximum (maximum of maximums). Track the constraints in the meta-data. However, the solution does not apply on averages.

Semi-Additive Measures-Facts
Semi-Additivity is when you can have a measure aggregated on a certain dimension, but not all the dimensions. Another phrase for semi-Additivity is when you have the summarization with an index of in-accuracy. SemiAdditivity happens primarily in four scenarios:

Semi-additive Missing OR dirty data


There are many reasons and manifestations for missing and dirty data. One can refer to Customer Data Quality and reasons for bad data quality. The missing OR dirty data, provides a wrong picture for the overall submissions. For example if you have empty records for sales of some offices, you will end up under-reporting the sales figures. Similarly, if you have wrong data in the same case, you may end-up under OR over-reporting. 48

DWBI Essential Guide The solution to this issue (you can refer Data Correction techniques in customer data quality to get more detailed listing):

Don't include those offices in the summarization and specify that the report does not include the specific instances. Fill-up an average figure for the instance. For example, if the sales figure for can office is not available for this month, you can (temporarily) assign a value, which is closer to the past patterns. One option is to put the average of last 12 months sales. This is generally a preferred solution. Apart from just putting the average for past periods, you can also use various extrapolation and forecasting techniques to calculate the stop-gap figure. This however, will be done, only if the number of cases of missing OR dirty data is within certain limits (for example- 5% offices not having data..) In case of a high proportions of instances having missing OR dirty data, one needs to apply the above-said tricks and also mention the caveat, that the data could have an inaccuracy index of some percentage.

Historical Data
Historical data falls in two categories, and both have different treatments:
Historical snap-shots:

When you have historical snap-shots, and over the time, you have situations, where you have changed the instances of your dimensions. For example- you may have changed your product categorization ('home- segment' and 'small business' product segment is not combined and re categorized into 'handheld' and 'Table-top') OR sales locations ('New-York' sales area and 'NewJersey' sales areas now combined and split into 5 sales areas, as company business has grown..) In this case, when there is an incompatibility across the instances, it is not possible to add the measures across those dimensions, across time. Referring to the above-said example, it might still be possible to add the sales figures for office instance across the time, as that level in the location dimension has not changed. However, adding sales on 'sales area' basis over the time dimension will not be possible. Solution: You may like to apply some smart transformation rules, to translate the historical categories into new categories. For example- you may know that 'Hand-Held' typically formed 30% of the sales in the home and small business segment. You may apply this %age to historical snap-shots and make it aligned to the current categories.

49

DWBI Essential Guide


Slowly Changing dimensions (SCD):

Please refer to Special Situations in Dimensional Modeling to understand on what we mean by slowly changing dimensions. In-short, a Business Intelligence system will be storing the various instances of changing values with a dimension, as a new record OR a field, whereas the transaction system will typically overwrite. For example, ZIP code of a customer may be over-written by a transaction system, as customer moves to a new address. The Business Intelligence system, may append a new record for this change, without erasing the previous ZIP-code. This may be needed to do sales analysis on basis of ZIP codes in the previous months. In case of SCD, one is not able to summarize the data on the time dimension. For example, if the customer has changed the ZIP code, you will not be able to summarize the sales related to the customer, as there are two records related to the customer (with two different ZIP-codes). One has to take this statement with a little care. It is 'possible' to summarize, but one has place an extra filter, so that you do not double calculate. Solution: Track the constraints in the meta-data and apply the right filters.
Snap-shot data

There are many measures, which are not the 'activity' for a period, but the state of the measure at a given point of time. The example of the 'at the moment' measures is the line items in a balances sheet (which provides the assets and liabilities at a given point of time). example of 'Activity for a period' is the line-items in profit & Loss account (which provides expenses and revenues for a given period). The 'Snap-Shot' measures cannot be added over the time dimension. If you want to find out the account balances for the year, you will not add the balances at the end of each of the 12 months. As you will see in the averaging of measures topic, these are best handled through averaging. Solution: Apply other operators like averages/Maximum/Minimum..
Category data

When you have measures, which provide 'type of magnitude' and not the 'magnitude', it is not possible to add them across some dimensions. The example of this difference is- 'Number of Sales units' and 'units of inventory' are value of Magnitude measures whereas 'number of product-categories sold' and 'Number of inventory parts types' is type of magnitude measure. 50

DWBI Essential Guide Example of above Semi-Additivity is - if you have sold 2000 different product models across US in this month, and 1500 models in the previous month, you cannot sum them up to provide the total numbers of models sold in US over past two months. You will need to have a way to identify the distinct models sold across this period. Solution: Apply other operators like averages/Maximum/Minimum.

Business Intelligence End to End


Chapter 1 BI Architecture Components
These are the building block of the BI architecture. All possible architecture scenarios will have some or all of these components. This chapter endeavors to de-mystify the definitions of often misused terms.

BI Data Warehouse Source System


These are the feeder systems and start point of data flow in the overall BI architecture.

Sources systems are all the data feeding pipes to the Staging Area. TYPICALLY any Transformation on the data is done after the data in its raw OR unchanged form is picked from the source systems. The further details of the Source Systems can be Data Warehouse Design & Architecture in Data Warehousing

51

DWBI Essential Guide


Core source Systems

These system include the core systems mostly having well organized database, set schedules, the data being updated on online basis. Typical systems are core product manufacturing systems, accounting systems, Money Management systems, commission/Sales Compensation Systems, tightly coupled job/workflow systems.
Field OR Front-end Source Systems

These systems include the systems, which are primarily used by the Customer acquisition and retention staff. These include Customer service systems, Sales automation systems, leads management systems, campaign management systems.
Modeling and Analysis systems

This family of systems includes budgeting, planning and forecasting, pricing and valuation type of system.
External Data

This is the data, which you receive from regulators, credit bureau, medical bureau, industry associations, market research firms, database marketing companies and other sources. This data (unlike data provided by your suppliers and Customers; which goes into your core OR field systems) follows a standard format generally governed by the supplier.
Non-Data Base/Desk top Sources

Wealth of information and critical operational data resides in the spreadsheets and MS access tables in the desktops OR local servers of organization. If you want to shock yourself, just make a study of the number of excel based applications, which have become critical part of operational delivery and reporting. The numbers could go in hundreds, if not in thousands.
Off-line Databases

As no organization data management strategy goes through a systematic & planned growth, it is possible that you might have some offline database used by the users to generate their reports. For the sake of speed, one may tend to use that as a source system. However, in a medium to long run, this may become counterproductive because you will make that offline database redundant. 52

DWBI Essential Guide

Data Warehouse BI Staging Area


Staging area is the place where all transformation, cleansing and enrichment is done before data can flow further. The Data is extracted from the source system, by various methods (typically called Extraction) and is placed in the normalized form into the Staging Area. Once in the Staging Area, data is cleansed, standardized and reformatted to make to ready for Loading into the Data-Warehouse Loaded area. We are going to cover the broad details here. The details of staging can be referred to in Data Extraction and Transformation Design in Data Warehouse. Staging Area is important not only for Data Warehousing, bit for host of other applications as well. Therefore, it has to seen from a wider perspective. Staging is an area where a sanitized, integrated & detailed data in normalized form exists. The concept of staging is as old as the Stone Age. It is commonsensical to have an offline database to take care of reporting. Therefore, staging as a concept has been used in one-way OR the other by IS managers. It has become branded since the advent & popularity of Data Warehouse. However, there is much more than mere change of labels. With the advent of Data Warehouse, the concept of Transformation has gained ground, which provides a high degree of quality & uniformity to the data. The conventional (pre-data warehouse) Staging Areas used to be plain dumps of the production data. Therefore a Staging Area with Extraction & Transformation is best of both the worlds for generating quality transaction level information. A staging area is sometimes used for scheduled/Production Reports: As staging is mostly normalized, the queries run on it have to be predictable in terms of volume and timing (unlike a data warehouse). One may ask a question on why cant we generate the production reports from the data warehouse. There are two answers to it. One is that by using the Staging Area, one distributes the load of querying to two separate areas. Second is that it allows the reports to be produced earlier (as one does not have to wait for the Loading process to be over). While, we say this, it is not a desirable option to use staging area. We recommend all information to be taken out of data warehouse platform.

De-normalized DW- Data Warehouse vs. Data mart


Data Warehouse/ Data Mart form the sanitized repository of Data which can be accessed for various purposes. 53

DWBI Essential Guide

Data Warehouse
A Data Warehouse is the area where the information is loaded in undernormalized Dimensional Modeling form. This subject has been dealt in fair degree of detail in Data Warehousing/Marting section. A Data Warehouse is a repository of data, which contains data in a under-normalized dimensional form ACROSS the enterprise. Following are the features of a Data Warehouse:

A Data-Warehouse is the source for most of the end user tools for Data Analysis, Data Mining, and strategic planning . It is supposed to be enterprise wide repository and open to all possible applications of information delivery. It contains uniform & standard dimensions and measures. The details of this can be referred to Dimensional Modeling Concepts. It contains historical as well as current information. Whereas most of the transaction systems get the information updated, the data warehouse concept is based upon 'adding' the information. For example if a Customer in a field system undergoes a change in the marital status, the system may contain only the latest marital status. However, a Data Warehouse will have two records containing previous and current marital status. The time based analysis is one of the most important applications of a data warehouse. The methods of dined this is detailed in special situations in Dimensional Modeling. It is offline repository. It is not used OR accessed online business transaction processing. It is read-only: A Data warehouse platform should not be allowing a writeback by the users. It is essentially a read-only platform. The write back facility is more reserved for OLAP server, which sits between the Data Warehouse and End-user platform. It contains only the actual data: This is linked to 'read-only'. As a best practice, all the non-actual data (like standards, future projections, what-if scenarios) should be managed and maintained in OLAP and End-user tools

Data Marts
Data Marts are a smaller and specific purpose oriented data warehouse. Data Warehouse is a big a strategic platform, which needs considerable planning. The difference in Data Warehouse and Data Marts is like that of planning a city vs. planning a township. Data Warehouse is a medium-long term effort to integrate and create single point system of record for virtually all applications and needs for data. Data mart is a short to medium term effort to build a repository for a specific analysis. The differences between a Data Warehouse vs. Data mart are as follows:
Data Warehouse Scope & Application Data Mart

54

DWBI Essential Guide


Application Independent Specific Application

A Data Warehouse is single point Data-Mart is created out of a specific purpose. repository and its data can be This means that you will have a data mart used for any foreseeable created to analyze customer value. This means application that the designer of the data-mart is aware that the data will be used for OLAP, what kind of broad queries could be placed. Domain Independent Specific Domain The Data Warehouse can be used A Data-mart is specific to a given domain. You for any domain including Sales, will generally not find a data mart, which serves Customer, operations, finance etc. Sales as well as operations domain at the same time. Centralized Independent Decentralized by User Area The control and management of Typically a data-mart is owned by a specific data warehouse is centralized. function/sub-function. Planned Organic, possibly not planned Data Warehouse is a strategic initiative, which comes out of a blueprint. It is not an immediate response to an immediate problem. It has many foundation elements, which cannot be developed in an ad-hoc manner. For example the standard sets of dimensions & measures. Data-Mart is a response to a critical business need. It is developed to provide gratification to the users, and given that it is owned & managed at a functional level, it grows with time.

Data Historical, Detailed & Summarized Some history, detailed and summarized A good data warehouse will capture the history of transactions by default; even if there is no immediate need. This is because a data-warehouse always tries to be future proof. It's same with Data Warehouse. However, the level of history that is captured is governed by the business need. For example, a data warehouse will capture the changes in the Customer marital status by default. A Data Mart may not do it, if Data Mart is created to profile/segment a Customer on the basis of his spending patterns only.

55

DWBI Essential Guide


Sources Many Internal & external Sources Few Internal & External Sources

This is an obvious outcome of the Self Explanatory- A limited purpose leads to Data Warehouse being a generic limited sources. resource. That is also the reason why the staging design for a data warehouse takes much more time compared to that of a data mart. Life Cycle Stand-Along Strategic Initiative: A Data Warehouse is an outcome of a company's strategy to make data an enterprise resource. If there is any other trigger, chances are that it may not achieve its objectives Long life Typically part of a Business Project: A Data Mart comes into being due to a business need. For example Risk Portfolio Analysis data mart could be a part of Enhancing Risk Management Initiative. Can have any life span

A Data Warehouse is a long-term A Data Mart starts with a given objective, and it foundation of an enterprise. can have a life span ranging from one year to endless. This is because some applications are core and business as usual to an enterprise. The life a data mart could be shortened, if a Data Warehouse comes into being.

OLAP Server Layer and capabilities- Why is OLAP needed?

OLAP sits between the Data Warehouse and the End-User Tool. The various OLAP Capabilities and OLAP Architectures are covered in the separate sections. This topic provides what & why of the same. A DataWarehouse/Data-Mart is a repository having relational database structure. OLAP is an optional layer, which sits between the Data-Warehouse/Data-Mart and the end-user tools. As data is moved from various transaction systems into the warehouse, it must be stored in a way that maximizes system flexibility, manageability and overall accessibility. Because the information stored in the warehouse is read-only, historic in nature and includes detailed transaction data, the best data warehousing technology is the relational database. Both Data Warehouses (e.g., comprehensive, enterprise-wide, etc.) and data marts (e.g., subject-OR application-specific) must be accessible to a wide variety of users to satisfy their information needs 56

DWBI Essential Guide Why OLAP is needed OLAP layer/server provides the capability, which cannot be met by the productivity tools sitting directly on the relational database of warehouse. Data Warehouse first & foremost function is to provide a sanitized repository (system of record) of current and historical details for various purposes (which include analytics, data mining, Strategic planning , enterprise reporting and so on) OR in other words, provider of the Data. OLAP sits over the Data Warehouse to enable the end-user tools translate the same into information.

In brief, various OLAP Capabilities are:


The ability to scale to large volumes of data and large numbers of concurrent users. Consistent, fast query response times that allow for iterative analysis. A calculation engine that includes robust mathematical functions for computing derived data (aggregations, matrix calculations, cross-dimensional calculations, OLAP-aware formulas and procedural calculations). A multi-user read/write environment to support users what-if analysis, modeling and planning requirements. The ability to be deployed quickly, adopted easily and maintained costeffectively. Robust data-access security and user management. Availability of a wide variety of viewing and analysis tools to support different user communities

Difference between Data Warehouse and OLAP


Comparison Factor
Data Warehouse Purpose OLAP

57

DWBI Essential Guide


Scope of Content Role Across the enterprise, Subject OR function linked functions and processes. System of recordData Analytics and end-user BI Reference point for BI enablement Data Positioning Level of detail of data Detailed transaction level Summarized data storage data, with some aggregate tables Data Structure Dimensional Model in Dimensional Model with relational database form multi-dimensional form (while there are some OLAPs with RDBMS format) Level of pre-calculated Limited (As the information is High Data at a transaction level precalculated data can lead to significant data-base growth) Data Volumes Gigabyte to Terabytes Gigabytes Access User Access Rights Access mode Read-only Data Retrieval, one-way Read and write Iterative, to and fro (if write backs are allowed in an OLAP) Less IT assisted- More ease of use

Ease of Access

Highly IT assisted

ODS- Operational Data Store


Operational Data Store is a centralized repository of Data put on for online operational use. Operational Data Store is not that much of a popular term as that of a Data Warehouse. However, it can be an extremely useful application. ODS is an integration platform of data from various Source Systems, which is used for operational purpose. This is unlike a Data Warehouse, which is used for nonoperational/processing purpose. ODS is more of a concept than an architecture. A Staging Area can be used as an ODS.

58

DWBI Essential Guide

Following are different kinds of Operational Data Stores possible:


Non-Dynamic, read- only, one way:

These Operational Data Stores are updated typically on overnight basis, and are used where the real-time updating is not required. This kind of ODS is typically used for taking a single view on an entity (Customer, supplier..),which is not possible from a single source system. For example Customer service, OR direct marketing may need to have a single Customer view to assess all the transactions the Customer has done. As soon as a Customer calls-up, his single view from (operational data store) comes up to help the DM take a call on Sales pitch. The difference between a Data Warehouse and ODS is that in Data warehouse you can take out a report on single Customer view. However, a Data Warehouse will not be open to an online access for thousands of direct marketing agents accessing hundreds of thousands of Customers. The de-normalized structure of Data Warehouse is not suited for high volume standard queries. A normalized ODS is much more suited for the same.
Dynamic, Read-only, One Way:

This ODS is same as the previous type, with an only difference that it is updated on online basis. This is required for the applications where one cant wait for an overnight refresh. For example lets take a high volume assets management and trading company. One always wants to maintain a cap OR an exposure to a sector, industry, fund OR a commodity. If the investment systems are different, an ODS is required for the user to check on an online basis for the exposure, before he/she commits any further investment.
Dynamic, Read-only, both ways:

This ODS is used as a platform for the systems to exchange data, by routing the same through the ODS. For example a CRM system (used for up-sell/cross-sell) might be interested in finding out the risk profile of an existing Customer, which exists in the risk management system. However, due to various reasons it cannot access it directly, and the Customer code used in two systems are different. An ODS will be used to extract data from both the system, assign a common Customer code. Apart from giving a single Customer view, it can also update the risk profile of the Customer (as picked from risk management system) to the CRM system. 59

DWBI Essential Guide


Dynamic with Write-Back facility:

This is the closest an ODS can get to being an OLTP. This kind of ODS accepts information from the online user and updates it in its database and also on the databases of the source systems. For example a Direct Marketing personnel brings up a single Customer view for a Customer, makes a Sales pitch and enters the outcomes of the Sales call on the ODS, which updates the relevant CRM/Sales systems. NOTE- A part of what an ODS can do is claimed to be achieved by Enterprise Application Integration (EAI) and Enterprise Information Integration Systems (EII). However, ODS has not lost its appeal primarily because EAI/EII tools typically are not completely successful to transform & cleanse a highly disparate & non-standard data. An ODS can achieve the same by following standard Extraction & Transformation practices. Please refer ODS placement for Data Warehouse for more perspective on ODS.

BI & Data Warehouse- End User Tools


End User tools include tools for managing reporting, analytics, Data Mining, Performance and Decision Modeling. End User Tools sit over the Data Warehouse/Staging/OLAP (Refer BI Architecture Scenarios), to provide various capabilities for using the information.
Enterprise Reporting Tools

Production Queries and Reports are mostly schedule reports containing transaction level details. There are standard tools (like Crystal) for this purpose. These reporting tools provide an environment to create, produce and distribute reports as well as facility to run ad-hoc detailed queries.
Analytic Applications

Analytic applications undertake slice-dicing, drill up/down, navigation across dimensions and measures, time trending analysis, exceptions, max-min analysis and in many cases what-if analysis as well. It provides many features related to data visualization/presentation and distribution. These analytics are done using the analysis functions within the analytics applications or using the calculation functions available within OLAP.

60

DWBI Essential Guide


Data Mining Tools

Data mining tools enable us to discover knowledge and patterns, which we cannot do in analytics application. Data mining tools self discover the information.
Performance Management Tools

Performance Management tools have the dashboard managers and allow you to create scorecards. The Performance Management goes beyond scorecards. Performance Management tools help you to create strategy, setup targets, Performance Measures & standards.
Decision Modeling Tools

These systems allow you to build business models and use the information as an input. For example you can have a risk profiling system, where you have build models to assess the risk of a Customer. Data-Warehouse (OR mart) could provide the single view Customer information to this system.

Business Intelligence Metadata Model


Metadata is core to BI architecture. It provides a map, a catalogue and reference on data about the business, technical and operational elements of Business Intelligence Components. It spans across Business Metadata, BI technical metadata and Source Systems Metadata.

Overview
Metadata is an extremely important component of BI architecture. A welldesigned metadata model enables effective administration, change control, and distribution of the data supporting the BI system. Metadata includes business rules, data sources, summarization levels, data aliases, data transformation rules, technical configurations, data access rights, data usage and much more. In principle, its data about data. The most important part of metadata implementation is the integration between various components of metadata stored in data modeling tools, ETL tools, databases and OLAP tools. In a few implementations, the metadata architecture and its components need to be designed and built in which case, identification and integration of these metadata components need to be done aiming at a robust metadata base NOTE- Please read this page in conjunction with Data-Warehouse Metadata and section of Enterprise Metadata . 61

DWBI Essential Guide

Metadata Paradigm
The following diagram shows the three principal limbs of integrated metadata model for Data-Warehouse and BI overall.
In BI architecture, the metadata spans across three areas namely,

Business Metadata, BI System Technical Metadata and Source Systems Metadata.

The levels of metadata details (or abstract), which are shown along the limbs, give information about the extent of details of the various metadata areas. These levels map very well across the three limbs of metadata model. The metadata in the element level of all the areas (i.e. Business Terms, BI Data Elements and Source Data Fields) forms the base of the metadata implementation. It gives the Glossary which is generated from the most detailed information of all the metadata areas. And hence the most voluminous metadata lies in this zone itself. Links need to be developed among these areas in order to set the metadata platform which can be utilized by various user communities as well as the administrators and developers. These links need to be established along various levels of abstract of individual areas as well as along same levels of abstract across the areas. Setting the links among higher levels of abstract across the metadata areas is not mandatory. But the links need to exist at the most detail level (i.e. the element level) across the areas. A considerable manual effort is involved in incorporating the links and similar effort is needed for incorporating any changes in the links. Linked-list concept can be utilized in the search engine to reduce this effort. In that case the links need to be set only along the levels in metadata areas and across the areas at element level. Any query across the metadata areas will pass through the Glossary zone and provide the appropriate link in the other area. But to achieve high performance in such search query the linkedlist structure needs to be tuned.

62

DWBI Essential Guide

Levels of Abstract in Metadata Areas

The following table gives details of different levels of abstract in the metadata areas. Areas Levels
High

Business

BI Technical

Data Source

Business Concepts,Report CategoriesSource Systems, Subject Areas identifiedmapping to variousdivisions, departments during businesssubject areas. and owners. requirements analysis. Medium Business Entities andTechnical Entities likeData Source Entities in transaction groups e.g.tables, multi-dimensionalform of tables, files, Customer, Geographycubes and reports. spreadsheets and data and Financial forms. Transactions. Low Business Terms yieldingData Elements used inData Fields in the business glossary. various technical entitiessource system entities. e.g. columns in tables, dimensions and measures in cubes and fields in reports.

High: Concept Level At concept level, the report categories are derived from the business subject areas, which are handled by various departments and divisions and hence 63

DWBI Essential Guide are taken care into different systems. These systems from various departments eventually become the data providers for the BI system. Medium: Entity Level At entity level, the technical entities like tables, cubes, reports are designed with the help of the business entities which are analyzed during the business / system requirements. The source systems entities, which contain this information, are mapped to the technical entities. These mappings are used for the data load purpose, but at next level of abstract i.e. element level. The traceability between the business entities and source system entities exist with the system owners. This traceability is utilized to perform the gap analysis at entity level during the requirements gathering and BI analysis phase. Low: Element Level The most detailed level of BI metadata exists at data element level. The business terms from business metadata are mapped to the data fields in tables, reports and dimensions / measures in the multi-dimensional cubes which form the technical metadata. The business users extensively use this metadata information. The mappings between the BI data fields and the source system data fields are used in Extraction, Transformation and Loading processing. The traceability between the business terms and source system data fields aid in identifying the gaps that may exists in the BI system with respect to the business elements as well as the source system. Also it helps in identifying the gaps in the existing source system with respect to the business elements so that a better BI solution can be implemented. The element level information from the three metadata areas yields the Glossary for the metadata implementation. This detailed metadata information forms the base of the metadata model with links to higher level of abstracts as well as other metadata areas. This is the only zone through which the cross metadata area search passes and hence it is important to design this zone for high performance search engines. Use of linked lists may aid in this case.
Metadata Implementation Classification

The metadata implementation takes place across multiple principal limbs of the metadata model. This implementation gets referenced as per the coverage of implementation around the principal limbs. The following are the distinct implementations of metadata in any BI system, which have been discussed in detail, subsequently. 64

DWBI Essential Guide


Back-Room Metadata Front-Room Metadata Counterpoint Metadata

Back-Room Metadata Back-Room Metadata spans across the Data Source and BI Technical Metadata areas and hence occupies a large scope. It encompasses the following components of metadata.

ETL Metadata (Control as well as Process Metadata) Data Models (primarily, data structures) Security Profiles (Roles and ACLs) Audit Trail (e.g. Data usage and actions)

It is primarily utilized by the administrators, supervisors, developers and designers. Front-Room Metadata Front-Room Metadata spans across the BI Technical Metadata and the Business Metadata, which is primarily used by the business users. Hence it needs to be developed in conjunction with business users. It covers the following components

Data Model (Data structures masked with business terminology) BI System Front-End Components Metadata Standard Reports Data Structures in end-user layer for ad-hoc querying against RDBMS as well as MDDB User documents (e.g. manuals, glossary, business information documents, listing of special business events for justifying data trends)

Counterpoint Metadata Counterpoint Metadata, as the name suggests, ensures the harmony in the BI system by establishing a trace for the handshake between Back-Room and Front-Room Metadata. Back-Room metadata ensures that the data from various data source systems is moving into appropriate BI system elements. Further Front-Room metadata aids in presenting this data to the business users in their native language i.e. business terminology. By the time the source data reaches the business users, it passes through a large number of processes and components and hence it is difficult to trace the data along this long path. In order to shorten this path without sacrificing quality and also ensure speedy design and impact analysis, Counterpoint Metadata is implemented. This spans across the Data Source Metadata and the Business Metadata giving traceability from source data to the business information (or 65

DWBI Essential Guide vise versa) at various levels of abstraction. This is primarily used by analysts and designers yielding the following advantages.

During requirement analysis and design phases, it is advantageous to understand the gaps in the data sources with respect to the business, before implementing the BI solution. During maintenance and support phase, this metadata enables rapid impact analysis, in case of o Changes in business definition o Changes in data source o Addition of a new data source mapping to existing business functionality.

Chapter 2 Business Intelligence Architecture Scenarios


These are some of the most common combinations which could exist to enable Business Intelligence.

One tier Data Warehouse


This is most basic set-up having a staging database accessed by end-user tools. This may look like a staid scenario, but even today more than 80% of the enterprises are working on this set-up.

66

DWBI Essential Guide This is the most basic of the information delivery topology and does not include any Data Warehouse OR a Data Mart. As per this topology, data is pulled out from the source systems and placed in a common staging database. Most of the organizations have this level of basic topology. This is not a strategic decision, but an inevitable operational need. This topology can provide excellent source of production/schedule reporting and also some degree of low intensity analysis using front-end analysis tools.

There can be following levels of sophistication:


Basic Level

Objective is to have offline reporting to ensure no impact on production and have reporting across the systems.

The entire set data tables are cut & paste from the source systems. It may not however, include the log and other control tables. No historical information. This is typical overwrite. Data is in as much normalized state as the source systems. No Transformation OR standardization of data. No aggregates. Standard & scheduled production reports.

67

DWBI Essential Guide


Medium Level:

Objective is to have more sophisticated and cleaner staging and also allow some level of aggregate analysis. This should allow the reports covering the time spans.

Data is pulled out selectively in terms of tables and fields with in the tables. The historical data is appended. Data is in normalized state. Some derived attributes and aggregates are generated.

High Level:

Objective is to provide a clean repository with a level of standardization & uniformity. It has the following additional features:

The interlinking of diverse Customer codes, references, product codes etc. by changing the codes OR by creating the mapping tables. This enables a trueblue enterprise wide reporting. Medium level of Cleansing of data. This means that the key Many derived attributes and aggregates.

If an organization has reached to the High Level state (without any Data Warehouse), it has progressed at least 30% of the journey in achieving Information Management journey. By this time organization has understood most of quality issues and resolved some of them. The reporting teams have gone through the first set of Extraction and Transformation experience.

Two tier Data Warehouse Architecture- Independent Data Marts


This is the next step of evolution before we get onto a Data warehouse. This
is the next evolution state for an enterprise. This involves a Data Mart OR set of Independent Data Marts sitting on top of a single OR multiple staging databases. Data Mart is evolved as a quick solution to a business need. This results in Individual businesses and system owners creating their own data mart to meet their specific needs. Data-Mart does not work in isolation and it needs a staging database to get the data and OLAP tools to analyze the data contained in the data-mart. This leads to various variants of this topology.

68

DWBI Essential Guide

NOTE- Staging Database for production reporting and transaction level queries can be different from the Staging Database used for Data Warehouse/Data Marts.

Variants in Two Tier BI architecture are as follows:


The Single Staging and Single Data Mart

This is most basic variant. A single staging is created, which undergoes the standard Extraction and Transformation process. The data from this staging is loaded into a data mart. The Extraction and Transformation is fairly limited as we are looking at the select tables and ETL operations relevant only for the given data mart and its purpose. Therefore, this doesn't have the concept of standard Dimensions and Measures. It may not always have Dimensional Model, IF the purpose of the Data Mart does not require too much of ad-hoc querying and the volume is also limited. For example a Data Mart made to analyze the reserves to fine-tune the reserve allocations at the end of every month, may not use Dimensional Model. This is because the use 69

DWBI Essential Guide of this data mart is low volume, number of users is fairly limited and the type of analysis is predictable to certain extent.
Multiple Staging and Multiple Data Marts

This typically is the next stage of evolution, and most common one in the large enterprises. Once a business owner showcases the advantage of the a Data Mart, there is a lot of demand for creating new Data-Marts. Technology mostly does not get time to work out an overall strategy to have a common staging database to service all these Data Marts. This leads to individual Staging Areas getting created, which mostly serves one (sometimes more than one) Data Mart. While the upside of this set-up is the quickest spread of Data-Marts and their business relevance, there are many risky downsides

Individual staging databases may end-up extracting the same information multiple times from the same source system. This leads to increase in the end of day Extraction load and time window. If there is not a common control, the Transformation/Cleansing process may end-up having inconsistent data across the data-marts. This leads to lack of confidence. Loss of control and inefficient use of resources.

Single Staging and Multiple Data Marts

This is the most ideal set-up in the given topology. This includes a common staging database, which provides loaded data into multiple independent Data-Marts. Lets see at how we reach this situation:

The start-point is the previous scenario (multiple Staging Areas and multiple Data Marts). As the downsides of this scenario come into play, a decision is made to have a common Staging Area. A common Staging Area is created quickly. All the information from the source systems is picked up as is. The rest of the processes (which used to happen in the individual Staging Areas) are replicated in the common Staging Area as is. Thereafter, the Transformation/Loading processes and data sets follow the same isolated path as they used to do in Individual Staging Areas. Till this point, the Extraction load on sourcing systems is taken care of. As the time passes, the Transformation processes happening in isolated fashion for individual data marts (within the common Staging Area) are rationalized. Therefore one may end-up with a common table of transformed & cleansed Customer data. However, the output Data Sets for Loading into the data-marts may still be different. Therefore you will have two different Customer datasets created to be loaded into different data marts.

70

DWBI Essential Guide

Two tier Data Warehouse Architecture- Staging and Data Warehouse


This is the utopian scenario, where an enterprise staging is supplying data to enterprise Data Warehouse. This is close to point of arrival in Data Management. We are talking here of a single Staging Area leading to an enterprise-wide Data Warehouse. The detail of this scenario is covered in detail in Data Warehouse Section.

This topology has also got some variants, which are covered out here. The details of setting-up a Data Warehouse are covered in a separate section. The variants are:
Data Warehouse with a limited coverage

A large organization has a lot of systems, and even the best data warehouse will not be able to cover all. Data Warehouse is a relative and not an absolute term. Please refer to difference between the Data-Warehouse/DataMart and their respective criteria. If there is an initiative which meets the DW criteria, it can be called a data warehouse, even if it is not covering all possible systems. Typically an organization will have its first data warehouse 71

DWBI Essential Guide covering the most critical business themes. This will further expand to include more and more themes over time.
Multiple Data warehouses

If an organization is large enough and has hundreds of systems, it may endup with multiple information delivery platforms with each being a Data Warehouse in its own right. However, these Data Warehouses will not be having overlapping purposes.

Three Tier Data Architecture

Warehouse

-Business

Intelligence

This scenario has a Data Warehouse supplying to Data Marts. In the environments, which are advanced and have grown over time, this is the most probable state. As per this topology, the Staging Area feeds into the Data Warehouse and the data warehouse feeds into the independent Data Marts.

This BI Architecture raises two questions:

The first question is on why to have additional layers of data marts, if Data Warehouse exists? There are many reasons for the same: 72

DWBI Essential Guide

A Data Mart already in use: If a data mart exists, and has been used extensively, this approach ensures a complete transparency from a user point of view. Reducing Network Traffic: In large organizations, you may not like to have the users across the globe to be accessing a central data Warehouse. Data Marts on local servers are created to allow a faster response with reduced WAN traffic. Security: Data Mart creates a physical barrier to deter the users to access the irrelevant information. Easy operation: For the power users it is important to be able to configure, browse and create views over the limited set of dimensions and measures useful for them.

The second question is the reverse of the first one. That is- whats the need for Data Warehouse, when Data Marts exist? One can directly load the data from the staging database. The concept of foundation dimensions & measures and Dimensional Model can be implemented in the staging database level. The reasons for the relevance of Data Warehouse in this topology are as follows:

A Data Warehouse provides quick access refresh for the data marts: A Loading process from staging takes greater amount of time. As Data Warehouse structure and layout is very similar to a Data Mart, this refresh takes lesser degree of time. Flexibility in creating of changing the Data Marts: In this topology, a Data Mart can be changed quite fast (unless it is introducing a dimension, measure OR an attribute which does not exist in Data Warehouse) as it has no impact on Data Warehouse design OR the ETL process. Loading direct from staging will impact the changes done to Data Mart design. A Data Warehouse may still be accessed for information outside of a Data Mart: Sometimes a data warehouse will be accessed for transaction level reports OR some ad-hoc analysis requirements, which dont warrant the creation of a Data-Mart. Therefore the core purpose of Data Warehouse to serve as a enterprise system of record, still needs to be fulfilled.

BI Mixed Data Access


In real world, end user tools access data from various sources. As you can see from the following diagram, various tools can access data for various purposes from various sources throughout the information delivery chain.

73

DWBI Essential Guide

Some of the reasons for this mixed access are:


Already existing ties of Staging Area with existing data marts. Production reports being taken out of the normalized area of staging to spare load on Data Warehouse. Overload on a given data mart of data warehouse to necessitate a parallel, but integrated source.

Business Intelligence Metadata Architecture Scenarios


Business Intelligence Metadata Architecture has essentially two scenariosCentralized and Distributed. This page provides you the representation and comparison across these two scenarios.

Business Intelligence Metadata Repository


Before we proceed to understand the key BI Metadata architecture scenarios, let's take a quick look at the core and common component. This metadata component of the BI system should be populated and connected across various components such as ETL tool, database, data modeling tool and reporting and querying tool to the extent available facilities of various tools permit. Metadata should be used for scheduling various system 74

DWBI Essential Guide management activities, to the extent possible. The sources of metadata are expected to be

Data modeling tool, ETL tool, RDBMS and MDDB and Reporting and querying tool.

In addition, business metadata can also be stored in appropriate metadata repository. Capacities of BI tools should be exploited for this purpose. The following diagram depicts the position of metadata repository in BI architecture.

There are two distinct strategies for managing metadata in a heterogeneous BI environment.

Centralized Metadata Distributed Metadata

Centralized Metadata

The following diagram shows a generic BI system with centralized metadata management strategy.

75

DWBI Essential Guide

The centralized metadata architecture ensures,


Standardized metadata across different systems No replication of metadata across systems and hence no need for synchronization of metadata across the components used No need for maintaining bi-directional connections to be between various tools for metadata exchange Minimal effort in system integration. Optimal hardware resource requirements.

Distributed Metadata

The following diagram gives a representation of a BI system with distributed metadata management strategy.

76

DWBI Essential Guide The main drawback of distributed metadata architecture is in its metadata distribution mechanism, which is against the Data Warehousing principle of possessing a single version of truth at a centralized location. While metadata changes less frequently compared to the data, metadata updates are more complex to deal with. This is because metadata updates not only affect the data that is described (e.g. deletion / insertion of a data element) but also other objects due to metadata interrelationships (e.g. referential integrity constraints). Also, synchronization of repositories, which share metadata with each other, needs to be accomplished. In particular, updates of replicated metadata need to be detected and propagated in order to keep this metadata consistent. Furthermore, updated metadata needs to be applied (i.e. integrated) within a repository, e.g. to keep interrelationships with metadata from other repositories consistent.
Comparison

The aim of metadata management is always to achieve the centralized metadata architecture but due to the limitations of the tools and their functionality it may not be possible to achieve it in present time. Hence distributed metadata architecture is seen in most of the implementations. The following table gives comparison between these two architectures. No. Aspects 1. Number repositories 2. Replication metadata Centralized Metadata Distributed Metadata of Only one centralizedAll the tools possess their own repository is needed. metadata repositories. of None. Sometimes this it is necessary to replicate the metadata across multiple tools e.g. user profiles. Tool The BI systemAs multiple tools can be independence architecture fullyinvolved in this architecture, a depends upon the toolset of tools can be chosen to chosen, as there is onlyget maximum functionality / a single tool to take carefacility coverage. But this is of the entire BI system. usually at the cost of seamless system integration. System As only one tool or a setIntegrating tools from various Integration of tools from a singlevendors is one of the greatest vendor is involved in thischallenges in this architecture, the systemarchitecture. The number of integration is seamlesstool-to-tool connections / and hence compatibilitycompatibility issues and the issues of variousmapping overhead are components do notsignificant. Usually a POC is 77

3.

4.

DWBI Essential Guide recommended to ensure the compatibility among various tools before the implementation begins. Metadata No synchronization isSynchronization of synchronizatio necessary. repositories sharing metadata n with each other needs to be accomplished. In particular, updates of replicated metadata need to be detected and propagated automatically in order to keep this metadata consistent. Metadata No metadata exchangeAll tools communicate with exchange is necessary. each other to exchange metadata generating numerous bi-directional connections. But a few of the tools may not be able to communicate at all or may need tools to provide a channel to communicate with others. Hardware Optimum hardware isVarious tools demand capacity required. different hardware capacities. requirements Accumulated hardware requirement is always larger than what is needed for centralized metadata architecture. Example Informatica Suite ofArchitecture using various Products. products like Informatica Power Center, Business Objects, Hyperion Essbase, MetaCenter etc. arise.

5.

6.

7.

8.

The above comparison gives a clear picture of why the centralized metadata architecture yields a cost effectiveness solution compared to the distributed architecture.

78

DWBI Essential Guide

Chapter 3 OLAP Architectures


MOLAP Multi-dimensional OLAP architecture
This OLAP architecture has summarized data stored in Multi-dimensional Data base format. This is the acronym for Multi-Dimensional OLAP. A multi-dimensional OLAP fulfils the requirements for an analytic application, where you require to access only summarized level of data.

79

DWBI Essential Guide

This method stores the data in multi-dimensional arrays which is different from the two dimensional relational structure. The way the storage happens, it reduces the need for database space for small to medium number of dimensions, and can store data more efficiently compared to a relational database. However, as the dimensions (OR their attributes) go up, the storage becomes inefficient relative to a RDBMS. A MOLAP solution gives faster response time to queries as it has a method of storage, which allows a quicker access and retrieval of summary information. The details of the storage (multi-dimensional arrays) are not covered at this point of time.

ROLAP- Relational OLAP architecture


This OLAP stores the Data in relational format, while presenting it in Multidimensional format. ROLAP architecture provides a multi-dimensional view of the data to the user, but stores the data in the relational database format. This architecture goes with the logic that relational storage of data can give much higher level 80

DWBI Essential Guide of scalability, can absorb as many dimensions etc. as needed, and can provide faster response time due to indexing and other features.

This logic becomes more valid, when you are storing the data in RDBMS in a dimension partially normalized model and not in a traditional E-R normalized model. The way a ROLAP works is that it stays stored in the data warehouse, and unlike MOLAP, there is no database as such in a ROLAP server. ROLAP server contains the analytics objects, which create dynamic aggregation of the RDBMS data in Data Warehouse, and present it to the user.

HOLAP- Hybrid OLAP architecture


This is the balance between ROLAP and MOLAP. This is a combination of the above two. This architecture has the MOLAP storage of summary data and this summary data has links to the detailed transaction level data in Data Warehouse RDBMS to enable user to drill down to the lowest level of detail. Apart from that it also allows direct access to the 81

DWBI Essential Guide data in Data Warehouse, which is not considered worthwhile to be translated into OLAP cubes.

OLAP Architecture choice


Each of the architectures has their own context and relevance. When you select architecture, you have to look beyond the same. A lot depends on the kind of OLAP Capabilities, which are being provided by the tool provider. Commercial aspects are also an important factor. There is entire tool-vendor evaluation exercise one has to go through to make a final choice. Here, we will be covering the bookish scenarios on what to use where purely from the architecture perspective.

82

DWBI Essential Guide


Dimensional Model based HOLAP

The best bookish scenario is, when you have a relational database in your Data Warehouse designed as per Dimensional Model. A Dimensional Model in Data Warehouse ideally should not carry aggregates and that job should be left to a multi-dimensional cube in OLAP OR other reporting/end user tools accessing the Data Warehouse repository. This keeps the whole arrangement clean, and division of labor crystal clear. HOLAP allows the Data warehouse to be maintaining its core task of maintaining detailed system of record and provide best possible design to enable end-user tools productivity. The summarization, aggregation etc. can be handled at cube level. HOLAP However, will work when there is a limit to the number of dimensions in a cube and we are not talking about large amounts of data.
ROLAP

This is useful for very large organizations asking for large number of cubes, dimensions and transaction details. This may however, give rise to the need of creating many aggregated schemas with in Data Warehouse. Most of the MOLAP vendors today hype on the in-efficient query and response of RDBMS storage. However, this argument partially becomes invalid, if the RDBMS design is as per Dimensional Model.

Appendix

83

Potrebbero piacerti anche