Sei sulla pagina 1di 31

Building A Data Warehouse on the HP3000:

A Practical Example
Cailean Sherman - Taurus Software
Cheryl Grandy - DISC
Jennie Hou - HP
Taurus Software
1032 Elwell Court
Suite 245
Palo Alto, CA 94303
650-961-1323
cailean@taurus.com
Dynamic Information Systems Corporation
Boulder, CO 80301
303-444-4000
cag@disc.com
Hewlett-Packard
19447 Pruneridge Avenue
Cupertino, CA 95014
408-447-5971
Jennie_Hou@hp.com
408-447-5649
Alvina_Nishimoto@hp.com
Building A Data Warehouse on the HP3000:
A Practical Example

Table of Contents
WHY A DATA WAREHOUSE OR DATA MART?
WHAT IS A DATA WAREHOUSE OR DATA MART?
Technical goals of the data warehouse
Technical structure of the data warehouse
The data
The data model: Re-architected database - star schema
The environment
BUILDING A DATA WAREHOUSE STEP BY STEP
Analyze the business needs and determine value
Determine the scope of your project
Building a target model
Map the data
Extract operational data and load the target
Optimize queries
End user interface and deployment
Determine the value of the mart
CONCLUSION

Overview
This paper explains the steps involved in building a data warehouse on the HP3000 and how that data
can be used for in-depth analysis of a variety of business parameters. It will discuss the nature of data
warehouses and data marts, the technology used to build them, and their value to an organization.
Because data warehouses can adapt to the needs of any organization, this paper will detail the process
one company went through to decide what they wanted in their data mart and their step-by-step
implementation.
Why a data warehouse or data mart?
The literature of data warehousing is studded with sound bites that suggest how companies use data
warehouses and data marts for competitive advantage. No company in its right mind will give away real
secrets, so the examples below are more illustrative than real.

A supermarket found a correlation in its data warehouse that linked sales of beer and diapers on
Friday evenings between 5 and 7 PM. By sending observers into stores at that time, they found
that dads with infants were doing the shopping. The retailer placed the highest-margin diapers
alongside the beer. Sales and profit on diapers soared.

A web retailer of women's clothes sold an expensive mail list to a Weight Loss Clinic for all
customers whose dress size increased from the previous order.

A Managed Care Organization found that 31% of its medical costs came from 8% of its members,
and that the MCO was losing money on that group. By focusing on this segment, the MCO put in
place programs to return the group to profit.

HOW TO FIND THE VALUE IN A DATA MART


The goal of data warehousing is to increase profitability. Any investment proposal that increases net
Earnings Before Income Tax (EBIT) will attract off-budget and immediate funding if the return is large and
the risk small. Driving up EBIT can be done by increasing revenues, cutting costs or improving margins,
separately or in conjunction.
Cutting Costs
Most companies have cut their costs to the bone by reducing headcount and automating processes.
There is usually very little fat left that a data warehouse or mart can help, with the exception of Inventory
Management, if that function is unsophisticated.
Increasing revenue and improving margins can be achieved in a number of ways:
1. Discount analysis: In a direct sales operation, you can carefully monitor which rep gives what
discount percentage to which customers and when. Cutting the overall rate of discounting to win
business by 1% increases sales in a $100 million company by $1 million. The same amount
drops straight to EBIT. If your sales are larger, or your gain is more than 1%, the ROI for a Billings
or Bookings data warehouse or mart can be staggering.
2. Customer profitability: You can find out who your best customers are, not only in sales volume,
but also in EBIT, if you can factor in informal estimates of the costs of servicing the customer.
Maybe you are consciously buying market share by selling to some customers at a loss or at
minimal profit. If not, a Sales Analysis warehouse or mart will help identify your most profitable
customers. Your sales force can then find more customers like them, and you can drop or re-price
the less profitable customers.
Customer lifetime value can reveal that you should spend most of your effort on your base rather
than looking for new accounts, with profound implications for your sales organization's
productivity.
3. Product profitability: The same analysis can be applied to products.
Most companies sell a wide range of products to one customer with a wide range of individual gross
margins. If you could satisfy the customer with a product mix with adequate function, with an average 1%
increase in margin, and do it without reducing revenues, the 1% and $1 million or more would drop
straight to EBIT.
If your customer set and your product set both can be improved in this fashion, then you can adopt both
cross-selling and up-selling. The product becomes a weapon for targeting your best customers with new
product and higher margin variants of the products they already buy.
ROI is of little relevance unless you can find a senior line executive who is directly affected by the ROI,
and who is willing to go to bat for the data warehouse or data mart. "Directly affected" means that the ROI
favorably affects their personal Key Performance Indicators (KPIs). VPs of Sales are keenly interested in
anything that promises more sales. VPs of Manufacturing like a firm schedule that makes high margin
products. CFOs drool over anything that offers a sniff at improving EBIT.

What is a data warehouse or data mart?


Let's go to the source for this definition. In the words of Ralph Kimbal, one of the gurus of data
warehousing, "A data warehouse is a powerful database model that significantly enhances managers'
ability to quickly analyze large, multidimensional data sets." A data warehouse contains cleansed and

organized data that allows data warehouse managers to make business decisions based on facts, not on
intuition.
The "data mart" is collected data in a single subject area, usually extracted from the data warehouse or
pulled directly from an operational system.
Experience has been that companies build many marts, and all of these may need similar source data. A
company may start with a Billings Mart containing information on sales derived from invoices. Later they
may decide to have a Finance Mart containing information on profit derived from the same information.
Why should both marts pull this data directly from the source system? A data warehouse can feed both
sets of marts and queries, thereby reducing redundant work.

Technical Goals of the data warehouse or data mart


To provide the information users need to increase profitability you must meet both of the following goals of
the data warehouse or data mart.:
To provide easy access to corporate data:
1. The end user access tools must be easy to use. The end user access tools should live on the PC
in an environment familiar to the end user. They must have a short learning curve and be highly
intuitive. If you have high level managers asking questions of the data warehouse, they must
have a simple way to interact with it.
2. The access should be graphic, so it is easy to spot trends. People have been using Microsoft
Excel for years. Spotting trends in a columnar report listing 500 customers' sales for the quarter
is a very tedious process. However, if you take the same list and chart it graphically, it is easy for
the human eye to spot the winners and losers.

3. The access should be manageable by end users. MIS is no longer the end user of the tools that
access the data. End users are analysts, mid-level managers, even in some cases, high level
managers. They must be able to easily get answers to their questions, and ask new questions, all
without getting MIS involved.
4. The process must be fast. Questions leapfrog, so you have to get answers fast. The very nature
of data analysis is that not all requests are known beforehand, so there's lot of unpredictable, ad
hoc inquiry. In an analytical environment, the end user is directly interacting with the data, rather
than looking at a printed report. The answers have to be delivered fast, before users lose their
train of thought. If you lose the interest of a user, they won't come back!
To provide data for analysis that is clean and reliable
1. For consistent analysis, the environment must be stable. Now that users can create their own
reports, they must have consistent data. One department doing an analysis must get the same
results as any other. You can't have two managers showing up at a meeting with different sales
figures for last month!
2. Source conflicts must be resolved. Transactional (production) systems often have data stored in
several different applications. Customer addresses, for example, may be in the invoicing system,
as well as the sales order system. Chase Manhattan Bank may be stored as "Chase" in one
system and "Chase Manhattan Bank" in another. A straight load of data for this customer into the
data warehouse would result in two customers being represented - and with invoices being listed
with no supporting sales orders and visa versa. These conflicts in the source data need to be
resolved and merged as they move into the analytical environment.
3. Historical analysis must be possible. There are two aspects to historical data that must be
considered. The amount of historical data in the warehouse has to be sufficient to satisfy the need
to spot trends over time. To see the growth in sales of a product over 3 years, perhaps to help
predict sales of future products, you need three years of data. The second historical consideration
is the ability to see the state of affairs at a historical point in time. Take, for example, a situation
where you want to see what your inventory amounts have been over the last year to help
determine if you can carry less inventory and still satisfy demand. You may need to be able to see
what your inventory amount was for product X on the 3rd, 4th, and 5th of July. As data is added to
the data warehouse, it needs to be added and date stamped - not updated like it would be in the
transactional inventory system.
4. Performance of operational systems must not be impaired. It's a given that the Data Warehouse
will run on a separate and dedicated systems to avoid conflicts with the performance needs of the
operational systems. Extracting production data to populate the data warehouse will have some
impact on the operational system. Consideration must be given to whether full loads of data are
practical, or whether incremental loads are needed, using only changes since the last load.

Structure of the data warehouse


To create a data warehouse, there are several technical requirements involved. Creating data to be
analyzed requires that the data be subject-oriented, integrated, time referenced and non-volatile. Making
sure that the data can be accessed quickly and can meet the ad hoc queries that users need requires that
the data be organized in a new database design, the star or multidimensional data model. And there are
a whole new set of systems and tools required to build and maintain this new data environment.

The Data

Subject-oriented
Data warehouses group data by subject rather than by activity. Transactional systems are organized
around activities - claim processing, ordering product, shipping product. Data organized by transaction
cannot answer questions asked by subject, such as "how many shoes did we sell in Europe this year?"
This request would require heavy searching of sales orders and customer address records. If the
transactional system does not have a key or index on the country in the address record, the query would
most likely be very slow.
Data used for analysis needs to be organized around subjects, customers, sales, and products. Because
of the design that will be used for the data warehouse or data mart, if you ask for the sales in Europe, you
are able to search through very few records to get the answer to your question.

Integrated
Integrated refers to de-duplicating data and merging it from many sources into one consistent location. If
you are going to find your top 20 customers, you must know that HP and Hewlett Packard are the same
company. There must be one customer number for any form of Hewlett-Packard in your data warehouse
or mart. You can't have one from the sales system that is alphanumeric, and one from the shipping
system that is a concatenation of their company name and phone number .
Much of the transformation and loading work that goes into the data warehouse or mart occurs here,
integrating data and standardizing it. There are often problems found in the source data that need to be
corrected as well, such as absence of data.

Time Referenced
The data warehouse user typically asks questions about prior states of being of the data. For example:
"What was my backlog this month last year vs. this year?" or "How many refrigerator units were in my
service department this time two years ago?" To answer these questions, you need to know what the
order status was at a point in time last year, or what a service inventory count was at a historic point in
time.
Data warehouses handle this technically by time stamping data as it is added to the warehouse and by
continuing to append data to the warehouse without replacing or updating existing information. Many
snapshots of data are moved into the warehouse, one on top of another.
In this way, you can accurately see the inventory count at a certain time, or the backorders outstanding on
a specific day. This allows us to answer questions that compare past performance to the present,
providing the ability to spot trends in the data and to see if these trends are consistent across different
product lines. That makes it possible to better forecast future performance with facts rather than intuition.

Nonvolatile - read only


The data warehouse contents are typically not modified or updated. There are several reasons for this:
First, there really is no reason to update a data warehouse. As mentioned, it is a snapshot in time. You
would not want to go back and change history. If a data element is bad in the warehouse, you don't as a
rule want to make changes to your company's data in the warehouse environment. You typically want to
change it in the source (transactional) system or in the load process. In that way, when the next batch of
data gets loaded, the change continues to be implemented, and the history of the change is recorded.

Second, the data warehouse model is designed for queries and bulk loads; updates can be very
expensive. Data warehouse models often replicate data in many places (de-normalize the data). An
example of this would be a Health Insurance Claim Data Warehouse. Each insured member belongs to
an insurance plan and each plan belongs to a group. You may set up a table listing each Plan and the
Group it belongs to. A Group is now repeated several times in the database, once for each Plan that
belongs to it. This design is used to make it easy to query via plan or group. If, however, you need to
change a group name, you have to update many locations, and that can be very costly.
Third, data warehouse environments are large by nature. They contain historical data, summarized data
and very detailed data. Typically these environments take up gigabytes of space. Updating gigabytes of
data can be very costly.
Fourth, the data warehouse environment is heavily indexed. To allow for the slicing and dicing of data that
an analyst requires, many indexes are placed on the data. To update these environments is very CPU
intensive.
In short, data should simply be loaded, then accessed, never updated.

The Data Model: Re-architected database - Star schema.


Transactional systems are designed to push one transaction at a time through the system at high speed.
The database organization is designed for processing efficiency, with minimum redundancy. This
"normalized" database organization is complex and very inefficient for data analysis purposes.
The data warehouse adopts a data model called a star or snowflake schema that maximizes the
performance of queries. The star consists of one large central fact table (data set) in the center of the star.
It is the only table (data set) in the schema with multiple joins (paths), each connecting it to other satellite
tables (data sets) called "dimension" tables.
The central fact table of a star schema is where the "numerical measurements of the business are stored.
Each of these measurements is taken at the intersection of all the dimensions." (Ralph Kimball)
The fact table is linked to several dimension tables that contain data describing more detail about the fact
they are linked to. These dimensions are used to query quickly into the fact table via keys - the fields in
these dimensions are your query data points.
The data in this organization is deliberately designed for redundancy so that users will find it easy to
create queries and they can run very fast, with minimal linking between tables.
If a database is going to be used for analysis, it should be designed for analysis - hence the star schema.
See the following graphic for an example of a star schema. The fact table is INVOICE_LINES. The rest of
the tables are dimensions.

You can see from the diagram above that it is easy to find sales using various combinations of criteria
such as a particular product for a calendar month. You can select all of the invoices for that calendar
month then easily find the parts you are inquiring about. What could have been a large serial pass to read
through millions of records in a transactional system turns into a finely tuned pass through a fraction of
the records using the star schema data model.
Relational databases may be used to house these star schemas, and until recently, these databases were
the only environments accessible by the popular end-user tools that did the analysis needed for decision
support. With DISC's introduction of specialized OMNIDEX indexes for Data Warehousing, Image/SQL is
now an advantageous option as a data warehouse repository. In fact the capabilities of an IMAGE
database enhanced with OMNIDEX go far beyond standard relational database technology.

The Environment
To satisfy the goals of the data warehouse and to accommodate the new data structures, several types of
tools are used today to populate the data warehouse, house the data warehouse, index the data for query
optimization, and allow fast user access.

The diagram above shows the different technologies used. Reading from right to left there are Extraction
processes to move data out of the transactional systems and into the warehouse. There is the database
management system that houses the warehouse. There are indexes added to the database to optimize
the queries. And there are the users access tools.

Extraction - ETL
There are two ways to populate a data warehouse. You can use custom programs written in a language
like Cobol. Or you can use tools built specifically to move data.
Custom programs have some advantages: You may have a staff of programmers who are familiar with a
language. Or you may already have many processes written that you can leverage to extract data from
your transactional systems.
The disadvantage of custom programs is the maintenance cost. Data warehouses are constantly
changing, and if you are using a third generation language to populate the warehouse, it will need to be
re-compiled each time you make a change. Each change made to the mapping and transformation
similarly requires more time to implement in a third GL than a tool written specifically to transform data.
ETL tools used to populate a data warehouse must be able to perform the following functions:
1. Accept data from many sources
2. Move data to many targets
3. Handle any needed data transformations

4. Allow inserts, deletions or updates of any data


5. Move data in batch
6. Perform fast loads
7. Handle any network in the mix
8. Keep track of mapping - metadata
9. You may also want to be able to detect changes and incrementally update the warehouse.
Sources and targets must not be limited to databases, they must also include external input that may be
coming via files. The sources and targets must all be accessible for inserts, updates and deletes if an ETL
tool is going to handle all of the data manipulation you will require.
Moving data can be done all at once or incrementally. Take, for example, a data warehouse that is
refreshed each night with new information. You can drop data from yesterday and replace it with all of
your history (drop and load). The advantage to this method is that the logic is simple. The disadvantage is
that if you have much data it can be too time consuming. It can take more time than the batch window
available to populate the target.
Change oriented extraction grabs only the data elements that have been added, updated or deleted since
the last refresh of the data warehouse. The advantage to this method is that the volume of data is
significantly less than a full refresh. The disadvantage is that you have to find a way to make sure you
track all changes. Using a date stamp on a record only traps the inserts and updates to that table or
dataset. Deletes are lost. You must either use "triggers" in a relational database or find a way to detect
change in a non-relational database to ensure that you retrieve all changes to the data, including any
deletes.
Data transformations required in populating a data warehouse need to be as robust and encompassing as
third generation languages. Data needs to be cleansed, integrated, follow rules, be standardized,
manipulated logically and physically and flagged when incomplete or in error. Problems in the data are
typically flagged and corrected early on in the process. Not only is the data transformation process a
means to reorganize the data, it is a QA process.
The ETL must be able to take advantages of specific DBMS features to facilitate fast loads of large
amounts of data. It should be able to move data across the network without requiring an additional
software layer that can slow performance.

Database management system and indexing


There are two types of databases typically used to house data warehouses, Relational databases and
Multidimensional databases. Relational databases are familiar to most of us, Oracle, DB2, and SQL
Server. When using a Star schema model, these databases can provide accessibility to the data
warehouse required by the users. Over the last few years, these environments have been enhanced by
their vendors to improve multi-table queries.
Multidimensional databases were developed to accommodate more ad-hoc queries. They handle
unpredictable access and analysis of the data they contain, allowing the data to be examined in many
ways with optimal performance. The disadvantage of multidimensional databases is that they cannot
handle more than 5 or 6 dimensions before the build and refresh times become unmanageable.
Multidimensional databases do not support updates, take a long time to load, do not handle as much data

as standard relational databases (typically only 50 Gb), and data summaries at a level that is not predefined are not allowed.
Many users choose to use both relational and multidimensional databases in their data warehouse
architectures. Often the foundation or data warehouse will use a relational database, while the data mart
or single subject area will be populated in a multidimensional (Cube) model.
There are also tools available which can enhance databases (including IMAGE/SQL) with speed and
capabilities that go beyond relational and multidimensional databases, while eliminating their limitations.
As an example, OMNIDEX for Data Warehousing provides unlimited multidimensional analysis across
any number of dimensions.
OMNIDEX for Data Warehousing also eliminates the need for summary tables in relational databases,
replacing them with high-speed dynamic data summaries. Plus it provides fast access to the underlying
detail data.

End user environment


There are many kinds of questions you can ask of your corporate data warehouse. These seem to fall into
three categories of access methods and tools:
Reporting tools provide pre-defined formats and content of data either to the desk top or printed reports.
Some of these Report Writers come from a production reporting background, others are newer, web
based offerings. Typically, queries are pre-written and executed by users with parameters or templates.
Products that fall into this category are Cognos, Actuate, SQR and Microsoft Access.
Online Analytical Processing OLAP tools allow you to perform multidimensional analysis of the data
warehouse. Typically, OLAP lets you answer not only "what happened" questions that reporting tools
provide, but also "why?". To find out why your sales for May were down 50% from last year, you need to
learn more about what comprises that total sales amount you see on your screen. If you can drill down
into the products sold -- where they were sold, what customers bought them, what sales reps sold them -you may be able to find out why your sales are down and hopefully change this. These features, drill
down, drill up, pivoting are found in OLAP products.
Arbor, Business Objects, Brio, Microstrategy, Information Advantage, Microsoft and Cognos all offer OLAP
tools.
Data mining tools use artificial intelligence to predict what may happen in the future. It uses complex
statistical analysis to look at the data in your data warehouse, spot trends and make an educated guess
at what will happen next. Examples of Data Mining Tool companies are Datamind, NHC Software, IBM

Building a data mart/data warehouse step by step.


Briefly stated, the steps involved in building a data warehouse or data mart are these:
1. Analyze the business need and determine value.
2. Pick the scope of initial deployment.
3. Build the target model.
4. Map data from production to the mart.

5. Extract operational and external data and load the mart.


6. Optimize queries into the mart.
7. Deploy end user interface.
8. Evaluate value of the mart to the end user.

Our Case Study


This section will use the case study of a company that distributes a range of imported electronic
components throughout North America. The Sales channel has both direct reps, and distributors.
We initially started working with this company because they thought that they wanted to create a
consolidated reporting environment which merged information from several systems into one central
database. They initially were interested in allowing users access to their sales data for ad hoc queries.
Much of their in-house expertise is on the HP3000 , so they elected to put a data mart on an HP3000.
DecisionVault, a data warehousing technology bundle specifically designed for HP3000 customers was
selected to build and populate the data mart.
DecisionVault is comprised of BridgeWare from Taurus and Quest, OMNIDEX from DISC, BrioQuery from
Brio Technology, and the HP3000 with IMAGE/SQL. It allowed the customer to create a highperformance, very cost competitive data warehouse environment on an HP3000.

Our Case Study - Step 1 Analyze the business needs and determine
value.
We first met with our client's CIO and MIS staff. They initially were interested in a consolidated reporting
environment to satisfy many of the reporting requests coming from users. They estimated that they had
two people working full time writing reports and were hoping to free up these resources.
The goal of the needs analysis for a data warehouse or mart is to get executive sponsorship, and a clear
definition of measurable business issues which can be addressed.
It is critically important for MIS to work with their users and find the pains they are facing. This
accomplishes three things:
1. MIS knows what the users want.
2. MIS has user buy-in on the project.
3. MIS will be able to get user funding without having to use the MIS budget.
Finding Pain: A fast way to failure in a Data Warehousing project is ask the users for a list of the data
items they want to see on reports. Business issues or Pains that VPs face are what is needed. If we can't
find a Business issue, let's spend the time and money on something else. While IT or users may point to
areas of Pain, the only way to specify and value Pain it to uncover it personally with the top executive
team.

Find an executive Sponsor. Without Pain you won't find an Executive Sponsor. Even with Pain, you
need a firm commitment from one of the three or four top executives in the company (CEO, CFO, VP
Sales, VP Operations, VP Manufacturing).
The main reason for failure in Data Warehousing is lack of one or both of these key requirements!
Conducting interviews with users to uncover pains. Only after you have formal support from an
executive should you go to the users. Concentrate on the direct reports of your Executive Sponsor and
their staffs. Ask these interviewees whom they interact with in other functions on the Pain, and go to
interview those people with a view to uncovering new Pain that you can raise with another VP.
The Pain and Interview process is a skilled task. Do not attempt to do this without training or at least
without having read deeply into Kimball's material on the subject. Where you can, use an experienced
business consultant to help you.
In our example company, the VP of Sales was directly paid under a management by objective (MBO)
scheme. His targets included not only the traditional Sales measures, but also Gross Margin on Product.
He couldn't measure his progress accurately from any data on hand, so he couldn't be sure he was
claiming all of his due compensation.
Personal pain like this can be a strong motivator despite the VPs reluctance to articulate the dollar impact.
Underlying his personal issue was the knowledge that he was largely in the dark about who was selling
what, particularly sales to large customers who could buy, either from his salespeople or from his
distributors, with widely varying margin contributions
In addition the CFO had suspected that gross product margins varied widely. From a prototype built on
billing data, he saw spreads on functionally similar products of up to 20%. He agreed immediately that the
average margin could be improved by 5% with appropriate knowledge and training. The 5% translated to
$5 million in EBIT that would give an ROI of twenty times the cost of the initial Data Mart

Our Case Study - Step 2 Pick the scope of initial deployment (data mart
vs. warehouse).
No matter how much ROI you uncover, be sure to start with one data mart, with one functional area; say
Billings, or Bookings. Start with the simplest, get the mart operational and prove your ROI before you
even start the next mart. Small bites of a big meal prevent indigestion. Start with the end in mind, so lay
down your first data mart on a foundation that will later support other data marts. Don't extract directly
from production to the Mart, but by way of a staging area or central data warehouse that other data marts
can later draw on. Begin by listing the potential marts. The process of uncovering a ROI will also uncover
the functional areas that merit a data mart, as well as the priority. In the example company, a number of
areas promised some ROI: Billings, Booking, Backlog, Inventory and Finances. All of these are candidate
marts that will make up a company wide data warehouse in time. For each mart, you need to discover
what elements are most important for analysis. This is done by interviewing the users. You will probably
find that the CFO is interested in sales and gross margins by product and by all the descriptors that can
apply to a product, including product family -- line, group, subcategory, color, size, supplier and so on.
Most products have upwards of 100 descriptors that can reveal areas for margin improvement.
In dimensional modeling, you call the areas of analysis, such as customer and product, the DIMENSIONs.
The descriptors are called attributes, and they become fields in the dimension tables (data sets). Even
though you select one data mart to start, you define the dimensions across all the marts at the outset
because you want CONFORMED dimensions that all the marts can share. For example, you will need the
same product dimension for a billings mart as you will later want for a bookings mart and for a backlog
mart.

Listing the marts will also uncover the facts or numeric measures that underpin the ROI you are seeking.
Almost invariably, facts are numeric and not textual. Text indicators almost certainly belong in a dimension
as an attribute.
In the case of the CFO, gross margin down to the part number will be required. You need to discover how
to compute that margin. Is it list price minus cost, or invoiced price less cost? Is the cost acquisition cost,
or that cost plus carrying costs, or are you adding customer service costs by product, or should you
include freight and insurance?
For our project, numeric indicators were chosen for Billings, the items for that we now call FACTS and
represent them in a FACT table. Cost of Goods was also defined as acquisition cost.
Billings was the chosen area because it offered the most ROI in a relatively simple area of the business.
You can also be sure that the data is reasonably clean since customers generally complain if their
invoices are inaccurate.

Our Case Study - Step 3 Build the target model.


Building a target model can be broken into four steps which we will cover in more detail.

Selecting the grain for the fact table

Selecting the dimensions

Selecting the individual facts for the fact table

Completing the design and implementing it

Select the grain for the Fact Table


The "grain" or "granularity" is the lowest level of detail at which facts are held.
Our VP of Sales was very interested in sales by customer by fiscal quarter since he gets paid that way.
But, if your grain were higher, such as by customer by quarter, you would never be able to see analyses
by sales person or by product or by week or by invoice number. You generally want to hold facts at the
deepest level of detail that is possible within the limitations of the source systems providing the data.
From that base level, you can easily provide aggregates, perhaps by week, month, quarter and year.
In our project's Billings Mart, the details found on the individual lines of each invoice supplemented from
the invoice header, were chosen to provide the highest (most) granularity and therefore the greatest
flexibility in analysis.

Select the Dimensions


When you determine the scope of the project in the section above, you list all of the attributes by which
you think you will want to perform analysis, such as by customer, geographic regions, time, product
descriptions. These are the dimensions or points on your star schema.
Now you select the subset of dimensions that is meaningful in describing and amplifying the meaning of
your selected Fact Table.

For our Invoice Lines Fact Table, our client required dimension tables representing Customers, Products,
Sales People, Addresses and Time.
You saw above some examples of the attributes for the Product dimension. Time in the form of a Periods
table allows every individual day that the Billings Mart may cover to be described, so you can analyze
billings by those attributes, down to the day. Each day clearly falls into a given calendar year, quarter,
month, and week. If fiscal year does not match calendar year, add fiscal year, quarter, month, and week.
You can also have attributes for a day that describes season, say Back to School and Easter. You could
have an attribute for an annual shut down, to explain why nothing happened in a given week.

Select the individual facts for the Fact Table


The Fact Table will contain pointers to the Dimensions that were selected. In this case, an Invoice Header
tells you what customer was billed, at what billing address, and who took the order. These pointers
(foreign keys in database talk) point to the rows (records) in the Customer, Address and Sales People
Dimension Tables. These foreign keys were put in the detailed row (record) in the Fact Table. A detailed
fact row (record) for each invoice number and invoice line number was the chosen grain for this data
mart.
You now need to add the numeric indicators, the individual facts for each detailed fact row. You know that
quantity of product billed and the individual monetary amounts on an invoice line will always sum up
correctly to the invoice level or the month level, so those facts are fully additive, and you can insert them
in your row design. Our billings mart allowed for the quantity, the list price, and the billed price.
Note that some numeric indicators, such as levels or balances, are not fully additive across all
dimensions. For example, adding a percentage discount as a fact in our invoice line would be
meaningless when we summarize. . Averages and percentages are not additive at all and shouldn't be
facts. You can re-compute ratios whenever you need them, so we didn't include any of these measures.
You can take numeric data as you find it in the invoicing system, such as quantity and billed amount, and
you can add computed facts if you want. In this case, you might find it useful to compute the difference
between list price and billed price as the Discount Given amount, and the difference between the billed
price and Cost of Goods Sold as the Gross Profit amount. Storing these amounts in the Fact record saves
the analysis programs effort in computing them on the fly.
Complete the design and implement it.
Our design now looked like this. Erwin can be used to help visualize the model

From here, we implemented surrogate keys instead of the actual customer and product numbers to link
the tables. As you can imagine, there are many reasons not to use the "real" numbers that identify
customers and products or any of the foreign keys that point to the dimension tables. You may re-use
those key values in the transactional system while the number still points to old data in the data mart.
You may add a division to the data mart that has a customer under a different "real" number. Or, say you
have acquired a company for their customers, and you want to count the sales by both companies under
one customer's name.
You may want to have two rows of dimension data that describe different attributes for different periods.
Say your rep sells a new customer. In the data mart, you assign that sale to that rep, and his team, group
and division. Two years later, you re-assign the customer to another rep and division. If you just change
the one existing row, that customer will disappear from the original rep's sales numbers and appear in the
new rep's numbers, even though they might not have worked for the company at that time.
For disk space and speed of join processing, it's more efficient, to use small integers for your computer
keys rather than the long text strings that the "real" numbers may have.
To neatly handle these situations, use surrogate keys that are artificial and never seen by the user. This
adds some complexity each time you want to add a new dimension row, say to handle the sales rep
problem above. It is well worth the effort. You will want to implement surrogate keys for all the dimension
tables, including the Time dimension.
CUST_ID, and CUST_ADDR_ID listed above are examples of surrogate keys.
The last step is to physically build the empty files for your Fact and Dimension Tables, so that you can
load and test them with data. In contrast to designing a data warehouse using a relational database, we
did not need to design and build summary tables to aggregate the data at this point, since our client used
OMNIDEX as part of DecisionVault. OMNIDEX will provide the data summaries at high speed using its
own summary indexes.

Our Case Study - Step 4 Map the data from production to Mart
Now that we had built the target mart schema it was time to populate it.
The first step of every data movement project is to develop a mapping document. The mapping document
details where the data is coming from; transformations, editing rules, lookups needed; and what needs to
happen if an error occurs. This document becomes the specification for the extraction routines.
The mapping document is done after the data model for the data mart or warehouse is complete. For the
purposes of illustration, let's look at a simplified version of our data model. In this version, we have only a
single dimension table and the fact table.

Loading of the data mart begins first with the dimension, so we started there.

Create a Customer Dimension Mapping Document


Working with the data analyst most familiar with the source data, we found a source for each of the target
fields. To develop the transformation rules, it is sometimes helpful to print the data for this dimension, e.g.
a list of all customers. Normally seeing sample data will help identify any problem areas, e.g. customers
without addresses, etc.
Together with the data analyst, we developed a minimum criterion for the acceptance of a dimension
record. Our client wanted to accept all customers. Don't worry; moving all the data over to the data mart is
a self-correcting problem. As executive management begins reviewing the data and finds junk, they will

either impose "their" rules or demand that the source data be corrected. Either way, the data will get
"cleaned" up.
Our project's mapping document was developed in Excel. You could use any tool that works for you.
Excel is simple to use as it is easy to "paste" sample data onto a worksheet, and it is easy to email the
documents between the parties involved in developing the data warehouse. Below is our mapping
document for the customer dimension.

You can see in this document, it took four source tables to provide enough data to populate the dimension
table. For those fields that required data from other tables, we made sure to include information on how to
make those linkages from our primary source table (cust) to the other source tables (parent, cust_seg,
and cust_commcn).
If there is decoding involved, either create a lookup table or provide another sheet that details the
translation, e.g. this source code gets this description. The mapping document should be complete
enough that an outside programmer could write code from the document.
You should write the mapping documents for all of the dimensions first and then move to the Fact
Table(s). The Fact Table mapping will assume that all of the dimensions exist in the data model before the
fact is added. In other words, you shouldn't have to check if the customer exists, because there shouldn't
be any invoices added who do not have a customer. If your source system doesn't check this referential
integrity, either as part of the application or as part of the database structure, you will need to include the
rules needed to insure that the dimensional data is in place prior to writing a fact row. The mapping
document is a "working" document that will change as you discover different aspects of the data, both
before and during the load document.

Our Case Study - Step 5 Extract operational and external data and load
the mart.
Once the mapping document for a particular dimension is complete, it is time to begin the development of
the extraction routines. All data marts begin with an initial load that loads the entire applicable contents of
the transactional source application into the data mart. This is a bulk load process that starts with the
dimensions and then loads the fact tables.
As with most initial loads for data marts, we started with the dimensions first. The customer data is key to
the model and is the cornerstone of the all of the information within the mart. When talking to the user
about their customer data, they indicated that the data was clean for a number of reasons: Product had
been shipped to the address in the database. The application has an automatic de-duping feature to
ensure that duplicate catalogs are not sent to the same address. And address fields are automatically
updated by the application.

Deciding on a refresh frequency:


Once the initial load is complete, deciding on a methodology to keep the data mart up-to-date is essential.
The choices are reload the data every month, week or night or develop an incremental load scheme. The
advantage of the reload process is that the same procedures that did the initial load can be used without
any changes. The disadvantage is that the time window available to reload the data may not be long
enough. In addition to this problem, the individual incremental changes, e.g. inventory changing from 10
to 20 with a net change of 10, are lost. You will just be left with the end-of-day inventory levels. Depending
on what kinds of questions your data is trying to answer, this may not be acceptable.
Our client started with a full load process which was executed weekly. As the users started really using
the system they asked for a daily refresh. They wanted to query with yesterday's results in the billings
mart, they were using it for reporting as well as analyisis. In their case they did not have enough time
overnight available to add another process which moved millions of records across the network and into
the mart. They had to have an incremental update process.
This was one reason they decided to use DecisionVault, the technology bundle allows for both types of
updating strategies. Taurus' BridgeWare product captures the transactions that have modified in the
selected databases or files. The transactions can then be applied to the data mart. This update can be
done either in real-time or at some interval that you set up.

Initial data load:


Let's look at the first dimension, CUSTOMERS. The source system for this application was an Oracle
database. This meant that the data was already edited for type, meaning you could depend on numeric
data having numeric values, dates having valid date values and character data was already upshifted and
stripped of trailing spaces. The data cleansing issues were then reduced to finding duplicate customers.
During your review of the customer data, you might find duplicated customers. The customer elected to
load all duplicate customers and force data correction on the source side.
BridgeWare data mapping and movement could have been done either using the Forklift GUI
development tool or developing a script in the editor. When we developed scripts, we tried to break them
into small pieces (modules) and then put all the modules together at the end. The scripting broke into
three major components: accessing files, selecting data, and the data movement actions. Let's examine
each of these components separately.

The scripting language has all the power of a third generation programming language. The scripting
language is broken into three major components: accessing files, selecting data, and the action to be
taken on the data. To access a file, you either create it or open it. Open is used for those files that already
exist. In this situation, you are accessing an Oracle database (your source) and an Image database (your
target). The syntax of the open statement is:

open TAG [remote MACHINE LOGININFO] databasetype


DATABASEPARMS.
In our case the client wasaccessing their source locally and writing to the target remotely:
open PROD ORACLE PROD/pass &
home=/oracle/product/723 sid=prod
open DW REMOTE hp3k user=mgr.dw image dw &
pass=secret mode=1
Once the files are open for access, we tell BridgeWare what data we want from those files. In this
situation, we wanted all of the customer information. There was no selection or filtering done. Selecting or
reading is accomplished with the read statement. The syntax of the read statement is:
read TAG = DBTAG.DATASET [for SELECTION].
In this case, we read the CUST tables from the source database called PROD in the open statement. The
read statement was:
read SRC_CUST = PROD.CUST
As you remember from the mapping document, we needed to read four source records to populate the
single data mart table, CUSTOMERS. BridgeWare allows the creation of local record variables to hold the
various pieces until you have a completed record. To create a local record variable, use the define
statement. The define statement can be told to use the record layout of a file that is already open using
the using clause. To save us from having to define each item, we used this clause:
define CUST using DW.CUSTOMERS
As the various data become available, we set the various fields to their appropriate values using the
setvar statement. For example:
setvar CUST.PARENT_ID = null_id
setvar CUST.PARENT_NBR = null_nbr
setvar CUST.PARENT_NAME = SRC_CUST.CUST_NAME
When we had the various pieces, we could copy the record from the local record variable to the target
dataset. This is done using the copy statement. The syntax for the copy statement is: copy tag to
dbtag.dataset. So for our example it was:
copy cust to dw.customers
An incomplete script to read data from our source and copy it to our target looks like this:
open PROD ORACLE PROD/pass &
home=/oracle/product/723 sid=prod
open DW REMOTE hp3k user=mgr.dw image dw &
pass=secret mode=1

define CUST using DW.CUSTOMERS


read SRC_CUST = PROD.CUST
setvar CUST.PARENT_ID = null_id
setvar CUST.PARENT_NBR = null_nbr
setvar CUST.PARENT_NAME = SRC_CUST.CUST_NAME
copy cust to dw.customers
endread
Let's expand to the script to match our project's real world.
open PROD ORACLE PROD/pass &
home=/oracle/product/723 sid=prod
open DW REMOTE hp3k user=mgr.dw image &
dw pass=secret mode=1
define CUST_KEY using DW.CUSTOMER_KEY
define CUST using DW.CUSTOMERS
define PARENT_KEY: using DW.PARENT_KEY
define PARENTS using DW.PARENTS
define add_entry : boolean
define error_flag : boolean
define no_id_flag : boolean
define null_nbr : oracle varchar2(12) value 'NONE'
define null_id : oracle number(11)
**
*Add any new Customer Numbers to the
* Customer key table
*(Generate the Surrogate Customer Id)
**
read SRC_CUST = PROD.CUST
setvar add_entry = $true
read C_KEY = DW.CUSTOMER_KEY &
for C_KEY.CUST_NBR = SRC_CUST.USER_CUST_NO &
and C_KEY.SOURCE_SYSTEM = 'SRCAPPL'
setvar add_entry = $false
endread
if add_entry = $true then
setvar CUST_KEY.CUST_NBR = SRC_CUST.USER_CUST_NO
setvar CUST_KEY.SOURCE_SYSTEM = 'SRCAPPL'
setvar CUST_KEY.LAST_UPDATED = $now
copy CUST_KEY to DW.CUSTOMER_KEY
endif
setvar CUST.CUST_ID = 0
read C_KEY = DW.CUSTOMER_KEY &
for C_KEY.CUST_NBR = SRC_CUST.USER_CUST_NO
setvar CUST.CUST_ID = C_KEY.CUST_ID
endread

setvar no_id_flag = $true


if CUST.CUST_ID > 0 then
setvar no_id_flag = $false
setvar CUST.CUST_NBR = SRC_CUST.USER_CUST_NO
read SRC_CUST_SEG = PROD.CUST_SEG &
for SRC_CUST_SEG.CUST_NO = SRC_CUST.CUST_NO &
and SRC_CUST_SEG.SEG_NO = 317
setvar CUST.ACTIVE = SRC_CUST_SEG.ACTIVE_FLAG
endread
read SRC_CUST_COMMCN = PROD.CUST_COMMCN &
for SRC_CUST_COMMCN.CUST_NO = SRC_CUST.CUST_NO
setvar CUST.AREA_NBR = SRC_CUST_COMMCN.AREA_CD
setvar CUST.PHONE_NBR = SRC_CUST_COMMCN.PHONE_NO
endread
setvar CUST.PARENT_ID = null_id
setvar CUST.PARENT_NBR = null_nbr
setvar CUST.PARENT_NAME = SRC_CUST.CUST_NAME
read SRC_PARENT = PROD.CUST &
for SRC_PARENT.CUST_NO = SRC_CUST.WITHIN_CUST_NO
setvar add_entry = $true read P_KEY = DW.PARENT_KEY &
for P_KEY.CUST_NBR = SRC_PARENT.USER_CUST_NO &
and P_KEY.SOURCE_SYSTEM = 'SRCAPPL'
setvar add_entry = $false
endread
if add_entry = $true then
setvar PARENT_KEY.CUST_NBR = &
SRC_PARENT.USER_CUST_NO
setvar PARENT_KEY.SOURCE_SYSTEM = 'SRCAPPL'
setvar PARENT_KEY.LAST_UPDATED = $now
copy PARENT_KEY to DW.PARENT_KEY
endif
setvar CUST.PARENT_ID = 0
read P_KEY = DW.PARENT_KEY &
for P_KEY.CUST_NBR = SRC_PARENT.USER_CUST_NO &
and P_KEY.SOURCE_SYSTEM = 'SRCAPPL'
setvar CUST.PARENT_ID = P_KEY.PARENT_ID
endread
setvar add_entry = $true
read PRNTS = DW.PARENTS &
for PRNTS.PARENT_ID = CUST.PARENT_ID
setvar add_entry = $false
endread
if add_entry = $true then
setvar PARENTS.PARENT_ID = CUST.PARENT_ID
setvar PARENTS.PARENT_NBR = &
SRC_PARENT.USER_CUST_NO
setvar PARENTS.PARENT_NAME = &
SRC_PARENT.CUST_NAME
setvar PARENTS.LAST_UPDATED = $now

copy PARENTS to DW.PARENTS


endif
setvar CUST.PARENT_NBR = SRC_PARENT.USER_CUST_NO
setvar CUST.PARENT_NAME = SRC_PARENT.CUST_NAME
endread
setvar CUST.CUST_NAME = SRC_CUST.CUST_NAME
setvar CUST.CUST_TYPE = SRC_CUST.CUST_TYPE_CD
setvar CUST.DUNS_NBR = SRC_CUST.DUNS_NO
setvar CUST.DUNS_RATING = SRC_CUST.DUNS_RTNG
setvar CUST.TAX_ID = SRC_CUST.FED_TAX_ID
setvar CUST.SIC_CODE = SRC_CUST.SIC_CD
setvar CUST.LAST_UPDATED = $now
copy CUST to DW.CUSTOMERS
endif
if no_id_flag = $true then
print 'Problem during load for customer", &
src_cust.cust_no
endif
endread
This script from our example first opens the source and target database. It creates local record variables
for those tables for which you are going to add records. There are four tables: cust_key (this is the
customer's surrogate key, customers (contains the data as described in the mapping document in the
previous section), parent (the customer's parent, e.g. HP is the parent for all the various division locations
of HP), and the parent-key (this is the parent's surrogate key). Notice surrogate keys have been used.
Surrogate keys are used when data can come from more than one system using the same "customer
number" but mean two different customers. The surrogate creates a unique identifier for the customer
within the data mart.
Next, we defined local variables that are used as switches and intermediate storage.
The script then reads our source. It checks to see if this customer already exists in the data mart by
checking the surrogate key for customer, customer_key. If the customer doesn't exist, create the
surrogate key and write it out. Next go and get the status information from the cust_seg table in your
source database and put it in your record variable for customer. Do the same thing for the customer's
phone number that you get from the cust_commcn table.
The next logic checks for a parent company. If there is a parent, it uses the parent's key. If there is no
parent, it creates the parent record and puts an entry in the parent surrogate key dataset.
Finally, assign the remaining fields and copy your record to the customers dataset in the data mart.

Incremental load:
Once the data mart was loaded and began being used, it was clear that the customer wanted a daily
refresh of data. They did not want to perform a full refresh each night because they did not have a batch
window opening large enough for a full refresh.
DecisionVault's BridgeWare combines DataBridge with a real-time change detect component for the
HP3000. On relational databases it is easy to capture changes as they are posted to the database using

triggers. On the HP3000, Image does not have such a feature. Although Image has logging, the log file
doesn't capture enough information to apply changes to the datamart.
BridgeWare's technology is best explained using a diagram.

BridgeWare, a joint development with Quest Software combines their detect change technology with
Taurus' ETL. Their component, called SharePlex, detects changes in data files on the HP3000 and writes
them to a message file. SharePlex works at the operating system level, detecting changes before they
reach the data file. DataBridge is then used to define how the captured changes should be transformed
and moved.
The message file holds the records of the changes, and contains a header record with the name of the file
changes, the type of transaction, a sequence number, and the date and time the transaction occurred.
Each type of record (insert, update, delete) always writes a header plus the record images which are
appropriate. An insert has the record image for the record inserted. A delete has the record image of the
record before it was deleted. An update has the before and after record image.
As our client moved to an incremental load process they had to consider how to handle each of their
transaction types, adds, changes and deletes. The rest of the load process remained the same.

Our Case Study - Step 6 Optimize queries into the mart.


Once data was loaded into the IMAGE/SQL data warehouse using BridgeWare, it was almost ready to be
queried, analyzed, and retrieved by the end-users using BrioQuery. At this point, performance became an
issue because users can be quite demanding about query retrieval times (think of user demands for
response time on the web).
Query retrieval times are very important because the #1 complaint of data warehousing managers,
according to Alan Paller of TDWI (The DataWarehousing Institute) is:
"I should have paid more attention to performance."

We evaluated our options to improve query performance. According to data warehousing literature,
standard options to satisfy user query requests are to:

Read serially through large "fact" tables

Build summary tables (typically with relational databases)

Load the data into a MDD (multidimensional database) and query against it

Serial reads through large tables or data sets are slow. They are not viable for any significant amount of
data. Even with extremely large expenditures in CPUs or parallel processors, they are typically not fast
enough for online reporting with multiple concurrent users competing for resources.
At first glance, using a relational database and building summary tables is a possibility. They store precalculated and sorted values that can answer specific queries. But we were worried that the number of
tables needed to do the queries which came out of our user interviews would be prohibitive because a
different table needs to be built for almost every query.
As mentioned above, multidimensional databases are limited in the number of dimensions they can
handle, and they cannot be updated. Typically only 5 or 6 dimensions can be used before the build and
refresh times become unmanageable and another data mart must be built. This we not going to meet our
user's requirements, since they wanted to expand the data mart into a data warehouse over time, and the
refresh times would outgrow their nightly batch window.

Add advanced indexes


This is where DISC's OMNIDEX indexing software came in. OMNIDEX is included in DecisionVault, and
our client had used it before as a high speed keyword search engine for the HP3000. In the past few
years, OMNIDEX had been specifically enhanced for fast and flexible access into data warehouses and
data marts. It provides additional access that is not possible with IMAGE/SQL (or Oracle or SQL Server
databases) alone.
OMNIDEX accomplishes high speed access to ad hoc queries by supplementing the existing database
with Multidimensional and Aggregation Indexes that are specifically designed for decision support queries.
OMNIDEX provides:

High-speed data retrieval for query optimization

ODBC/Web middleware residing between the database and popular end-user tools such as Brio

Multidimensional analysis, with an unlimited number of dimensions

Dynamic data aggregations, without the need for summary tables

Instant drill-down to detail

Optimized for star or snowflake schema

Fast index build rate and compressed disk space

To show how OMNIDEX was used to deliver the query performance we needed on our data mart let's
look at the database's star schema design and the retrieval requirements.
The current model has a basic sales fact table and five (5) dimensions (refer again to the star schema
diagram). The dimensions that can be sliced and diced in any combination are:
1. Customers
2. Addresses
3. Products
4. Periods
5. Sales Peoples
For our client, the primary purpose of the Billing/Sales Analysis data mart was to analyze customer
characteristics and buying patterns, such as:
1. Customer demographics
2. Product performance
3. Sales rep performance
4. Regional preferences
Let's see how OMNIDEX handled multidimensional queries, data aggregations, and drill-down to detail for
this client.

Multidimensional Queries
As with most data warehouse projects, we determined that most of the data warehousing queries which
needed to be made were complex and multidimensional in nature, using multiple criteria against multiple
fields or across multiple dimensions. For example, our end-users rarely wanted to access data by only
one field or dimension, such as finding the number of customers in the state of California. They wanted to
ask complex questions such as how many customers in the state of CA have purchased product B or D in
the last year.
Our client's sales manager wanted to see the sum of sales for his sales reps for their new products during
a certain time period such as last month. The inventory managers might want to see the quantity sold of a
product for the last month and compare it to last year's. A marketing manager might want to see the sum
of sales for some particular states and customer types.
We knew that the performance of these multidimensional queries in a data warehouse environment using
a star or snowflake database design would be greatly enhanced through the use of OMNIDEX
multidimensional indexes. OMNIDEX Multidimensional Indexes are specially designed for unlimited
multidimensional access, allowing complex queries using any number of criteria or dimensions which we
needed to satisfy our requirements.
We placed OMNIDEX Multidimensional indexes on each column in the dimension tables so the end-users
could query by any combination. When an end-user issues a query, a qualifying count based on index

access only is returned without touching the actual data. This speeds the query because it is much more
efficient to service a query request by simply looking in an index or indexes instead of going to the
primary source of data (the database itself). This is true for both pre-defined and ad hoc queries.
Because the end-users quickly obtain the results each time they perform an ad hoc search, they can
decide whether to add criteria to further qualify the selection, aggregate it, drill down into the detail data,
or start over. With the multidimensional indexes applied, our client's end-users were able to query
repeatedly in seconds or minutes instead of hours or days, even with large amounts of data. This allowed
them to interact with and explore the data in unpredictable ways, very quickly. Only when the end-users
want to view the detail data was I/O performed against the database.

Data Aggregation
In addition to multidimensional queries, our client's end-users wanted to see data summaries (e.g.,
COUNT, SUM, AVERAGE, MINIMUM, MAXIMUM) such as total sales dollars, number of customers,
average quantity, etc., and group them by a field such as sum of sales by region, or average quantity of
product line B sold by sales rep.
OMNIDEX performs data summaries by using its Aggregation indexes only, summarizing quantities or
amounts dynamically at the time of the query. Up to 1 million values per second can be queried.
We placed an OMNIDEX aggregation index on the fields to be summarized and were able to query at
these very high speeds while still satisfying our client's requirement for heavy ad hoc inquiries.

Drill-down to detail
The multidimensional and aggregation indexes which we set up with the client contain pointers back to
the actual data in the fact table, so the underlying detail data can be instantly accessed and displayed to
the user at any time. This means data can be kept at any level of granularity, quickly aggregated, and
then the detail displayed - a capability that is lost when using multidimensional databases or summary
tables with relational databases .

OMNIDEX Installation
Once the star schema database is designed, the general query needs are determined, and data is
loaded, OMNIDEX is ready to be installed. To enhance this client's star schema with OMNIDEX indexes,
an "environment catalog" was set up that defines the database layout (like a data dictionary).
Once the environment catalog was created and compiled, an OMNIDEX consultant worked with the
client's staff to further analyze the query needs. They first determined what snowflake tables would
optimize the query performance for "Group By" fields from large data sets to be used in data
aggregations.
Depending on the cardinality of the data (number of possible values per field), and the large size of the
dimension tables, the snowflakes that were created to benefit this Sales Analysis data mart were as
follows: (OMNIDEX has a program called SNOWGEN that generates the snowflake tables automatically.)

After the snowflakes were created and added to the environment catalog, the OMNIDEX installation job
for the data mart was defined based on the overall query needs. First of all, all of the fields in the
dimensions tables were defined as multidimensional keys (MDK).
CUSTOMERS CUST_NBR
CUST_NAME
PARENT_ID
PARENT_NBR
PARENT_NAME
CUST_TYPE
ACTIVE
DUNS_NUMBER
DUNS_RATING
TAX_ID
SIC_CODE
AREA_NBR
PHONE_NBR
ADDRESSES CUST_NBR
CUST_ADDR_NBR
ADDRESS_LINE_1
ADDRESS_LINE_2
ADDRESS_LINE_3
CITY_NAME
STATE_
PROVINCE_CODE
POSTAL_CODE
COUNTRY_NAME
SALESREPS
SALESREP_ID
LAST_NAME
FIRST_NAME
MIDDLE_INITIAL
FULL_NAME
OFFICE_CODE
DISTRICT_CODE
REGION_CODE

PRODUCTS PRODUCT_ID
SKU
PART_DESC
STATUS
PROD_LINE
PRODUCT_TYPE
PROD_GROUP
VENDOR
DIST_PART
QUALITY
INTRINSICS
PACKAGING
ORDER_QTY
MIN_ORD_QTY
MULTIPEL_QUT
NEC_BUDGET_1
NEC_BUDGET_2
NEC_BUDGET_CODE
NEC_PART_NBR
NEC_PLANNER
NEC_PKG_TYPE
NEC_PROD_LINE
NEC_STATUS
NEC_STD_BUS_UNIT
FORECAST_CLASS
DS_ABC_CODE
PERIODS ACTIVITY_DATE
CAL_YEAR
CAL_QUARTER
CAL_MONTH_NAME
CAL_MONTH_NBR
CAL_MONTH_ABBREV
CAL_WEEK_NBR
DAY_NBR
DAY_NAME
FISCAL_YEAR
FISCAL_QUARTER
FISCAL_WEEK
FISCAL_PERIOD
PUBLIC_HOLIDAY
NON_WORKING_DAY
SPECIAL_DAY1
SPECIAL_DAY2
Then the metric fields to be aggregated were specified from the INVOICE LINES fact table:
Sales qty
List price
COGS
Billed price
Discount given
Gross profit

Seamless Integration
Once the indexes were created and loaded, it was time to install the OMNIDEX Client Server software on
each user's PC, create some data source definitions, and invoke a "listener" process on the HP3000

server. High-performance queries then became available to the users through ODBC-compliant analysis
and reporting tools.
When a pre-defined or ad hoc query or report issues a request for information, ODBC passes the query
through the OMNIDEX search engine and SQL Optimizer, which accesses the indexes and passes the
qualifying count or data aggregation back to the PC query or reporting tool. The OMNIDEX
Multidimensional and Aggregation indexes allowed us a broad range of queries to be optimized with one
set of indexes that took much less build time and disk space than other "standard" data warehousing
options..
If the user requests the actual detail data, the pointers for the selected data records are used to quickly
retrieve only those records of interest from the sales fact data set. The OMNIDEX SQL optimizer insures
that all queries against a star schema or snowflake are completed in the fastest manner, and never cause
slow, I/O-intensive serial reads of the data.

Our Case Study - Step 7 The end user interface and deployment
Once we had clean, certified data in our data mart with indexes installed on it, the Billings data mart was
ready to give to the VPs and their staff. We needed to choose a business intelligence tool that users like.
We knew that any ODBC-compliant tool can access an IMAGE database through the OMNIDEX
interface. Our client chose BrioQuery because of its fast learning curve and ease of use.
Brio Technology's range of products allows users to access the data mart from their desktops or from their
web browsers. BrioQuery generates SQL requests automatically that are sent to OMNIDEX so that
OMNIDEX can quickly retrieve the data from the Data Mart. The data comes back to the user initially in
row and column format, ready to be graphed and analyzed.
This sample shows the data requested from the Data Mart on the "Request" line, and below that the
6,999 rows of data that were returned.
Our client chose to develop standard analyses using charts, pivot spread-sheets and serial reports that
cover most of the detail that the VPs want to focus on to meet their ROI goals. Fortunately, Brio's products
are easy enough to use that an Excel user can create new analyses very quickly.
Here is an example bar chart graph: Analysis shows how the sales people have been discounting each
quarter. We know that the VP of Sales and the CFO believe they can reduce the discounting rate from an
average of 14% to at least 12%, if they can see where the problems are. That 2% is $2 million at the end
of the year in Sales and Margin - with no effort. We can "drill into" this chart to find out that reps are the
worst offenders and whether any customer or product is particularly responsible. Managers can then act
to correct the worst offenders.
Here's an example of "Green Bar Paper" reporting brought up to date. Note that the output contains
interactive graphic analyses that change as the control breaks on the listing change.
Here's an example of a Pivot table for product margin analysis.
The display is color coded to reflect the user's thresholds for "good" and "Bad" performance (Blue and
Red). The Purple horizontal and vertical axes show the attributes of the Dimensions that this analysis
uses. The gray tab at the end of each purple axis allows any horizontal dimension to be pivoted to the
vertical and vice versa. The OLAP structure that underlies this analysis is then re-computed .

Our Case Study - Step 8 Value of the mart to the end user

It is important as an MIS organization to investigate the perceived and realized benefits provided by the
data warehouse or mart. Once the users begin using the warehouse or mart they will ALWAYS ask for
changes, and it is important to document the value of the new environment to help prioritize and justify
new enhancements and new marts.
Determining the value of the mart can be done through interviews with end users or even through written
surveys. This list of sample questions was pulled from a Data Warehousing Institute survey being
conducted currently by the Survey Research Center at the University of Georgia.
Strategic Benefits:
How did the mart increase our competitive position?
How did it improve relationships with our customers?
How did it improve relationships with suppliers?
How did it increase our ability to conduct one to one marketing?
How did it improve our business process?
How did it improve cost efficiency?
How did it decrease the cost of supplying information?
How did it decrease the effort in supplying information?
How did it decrease the time to build new applications?
Information Benefits:
How did the mart improve our consistency?
How did the mart improve integration?
How did the mart improve comprehensiveness?
How did the mart improve availability?
How did the mart improve timeliness?
How did the mart improve accuracy?
How did these improvements affect our business? And what value did this have?

Conclusion
You have seen how Pain is vital to the success of a Data Mart Project, and how we built our client's Data
Mart .
You have seen that, for the first time, there is a viable, comprehensive solution for creating data
warehouses and data marts on an HP3000 from data that had been previously locked up in transactional
databases on the HP3000 or HP9000. Better yet, the data for this solution can be easily assembled by
lower level MIS personnel and used by - good grief - marketing and executive types to increase the
productivity and profitability of your organization.
We recommend that anyone entering the Data Warehousing arena start by reading the works of Ralph
Kimball at minimum and attend his workshops. Ralph has codified the principles of data warehousing in a
simple and down to earth fashion that addresses the business needs rather than the technology. No
serious data warehousing project should start before the designers have fully absorbed his principles,
particularly in data modeling. Ralph's two main works are:
The Data Warehousing ToolKit. ISBN 0-471-25547-5 The Data Warehousing LifeCycleToolKit.

Potrebbero piacerti anche