Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
A Practical Example
Cailean Sherman - Taurus Software
Cheryl Grandy - DISC
Jennie Hou - HP
Taurus Software
1032 Elwell Court
Suite 245
Palo Alto, CA 94303
650-961-1323
cailean@taurus.com
Dynamic Information Systems Corporation
Boulder, CO 80301
303-444-4000
cag@disc.com
Hewlett-Packard
19447 Pruneridge Avenue
Cupertino, CA 95014
408-447-5971
Jennie_Hou@hp.com
408-447-5649
Alvina_Nishimoto@hp.com
Building A Data Warehouse on the HP3000:
A Practical Example
Table of Contents
WHY A DATA WAREHOUSE OR DATA MART?
WHAT IS A DATA WAREHOUSE OR DATA MART?
Technical goals of the data warehouse
Technical structure of the data warehouse
The data
The data model: Re-architected database - star schema
The environment
BUILDING A DATA WAREHOUSE STEP BY STEP
Analyze the business needs and determine value
Determine the scope of your project
Building a target model
Map the data
Extract operational data and load the target
Optimize queries
End user interface and deployment
Determine the value of the mart
CONCLUSION
Overview
This paper explains the steps involved in building a data warehouse on the HP3000 and how that data
can be used for in-depth analysis of a variety of business parameters. It will discuss the nature of data
warehouses and data marts, the technology used to build them, and their value to an organization.
Because data warehouses can adapt to the needs of any organization, this paper will detail the process
one company went through to decide what they wanted in their data mart and their step-by-step
implementation.
Why a data warehouse or data mart?
The literature of data warehousing is studded with sound bites that suggest how companies use data
warehouses and data marts for competitive advantage. No company in its right mind will give away real
secrets, so the examples below are more illustrative than real.
A supermarket found a correlation in its data warehouse that linked sales of beer and diapers on
Friday evenings between 5 and 7 PM. By sending observers into stores at that time, they found
that dads with infants were doing the shopping. The retailer placed the highest-margin diapers
alongside the beer. Sales and profit on diapers soared.
A web retailer of women's clothes sold an expensive mail list to a Weight Loss Clinic for all
customers whose dress size increased from the previous order.
A Managed Care Organization found that 31% of its medical costs came from 8% of its members,
and that the MCO was losing money on that group. By focusing on this segment, the MCO put in
place programs to return the group to profit.
organized data that allows data warehouse managers to make business decisions based on facts, not on
intuition.
The "data mart" is collected data in a single subject area, usually extracted from the data warehouse or
pulled directly from an operational system.
Experience has been that companies build many marts, and all of these may need similar source data. A
company may start with a Billings Mart containing information on sales derived from invoices. Later they
may decide to have a Finance Mart containing information on profit derived from the same information.
Why should both marts pull this data directly from the source system? A data warehouse can feed both
sets of marts and queries, thereby reducing redundant work.
3. The access should be manageable by end users. MIS is no longer the end user of the tools that
access the data. End users are analysts, mid-level managers, even in some cases, high level
managers. They must be able to easily get answers to their questions, and ask new questions, all
without getting MIS involved.
4. The process must be fast. Questions leapfrog, so you have to get answers fast. The very nature
of data analysis is that not all requests are known beforehand, so there's lot of unpredictable, ad
hoc inquiry. In an analytical environment, the end user is directly interacting with the data, rather
than looking at a printed report. The answers have to be delivered fast, before users lose their
train of thought. If you lose the interest of a user, they won't come back!
To provide data for analysis that is clean and reliable
1. For consistent analysis, the environment must be stable. Now that users can create their own
reports, they must have consistent data. One department doing an analysis must get the same
results as any other. You can't have two managers showing up at a meeting with different sales
figures for last month!
2. Source conflicts must be resolved. Transactional (production) systems often have data stored in
several different applications. Customer addresses, for example, may be in the invoicing system,
as well as the sales order system. Chase Manhattan Bank may be stored as "Chase" in one
system and "Chase Manhattan Bank" in another. A straight load of data for this customer into the
data warehouse would result in two customers being represented - and with invoices being listed
with no supporting sales orders and visa versa. These conflicts in the source data need to be
resolved and merged as they move into the analytical environment.
3. Historical analysis must be possible. There are two aspects to historical data that must be
considered. The amount of historical data in the warehouse has to be sufficient to satisfy the need
to spot trends over time. To see the growth in sales of a product over 3 years, perhaps to help
predict sales of future products, you need three years of data. The second historical consideration
is the ability to see the state of affairs at a historical point in time. Take, for example, a situation
where you want to see what your inventory amounts have been over the last year to help
determine if you can carry less inventory and still satisfy demand. You may need to be able to see
what your inventory amount was for product X on the 3rd, 4th, and 5th of July. As data is added to
the data warehouse, it needs to be added and date stamped - not updated like it would be in the
transactional inventory system.
4. Performance of operational systems must not be impaired. It's a given that the Data Warehouse
will run on a separate and dedicated systems to avoid conflicts with the performance needs of the
operational systems. Extracting production data to populate the data warehouse will have some
impact on the operational system. Consideration must be given to whether full loads of data are
practical, or whether incremental loads are needed, using only changes since the last load.
The Data
Subject-oriented
Data warehouses group data by subject rather than by activity. Transactional systems are organized
around activities - claim processing, ordering product, shipping product. Data organized by transaction
cannot answer questions asked by subject, such as "how many shoes did we sell in Europe this year?"
This request would require heavy searching of sales orders and customer address records. If the
transactional system does not have a key or index on the country in the address record, the query would
most likely be very slow.
Data used for analysis needs to be organized around subjects, customers, sales, and products. Because
of the design that will be used for the data warehouse or data mart, if you ask for the sales in Europe, you
are able to search through very few records to get the answer to your question.
Integrated
Integrated refers to de-duplicating data and merging it from many sources into one consistent location. If
you are going to find your top 20 customers, you must know that HP and Hewlett Packard are the same
company. There must be one customer number for any form of Hewlett-Packard in your data warehouse
or mart. You can't have one from the sales system that is alphanumeric, and one from the shipping
system that is a concatenation of their company name and phone number .
Much of the transformation and loading work that goes into the data warehouse or mart occurs here,
integrating data and standardizing it. There are often problems found in the source data that need to be
corrected as well, such as absence of data.
Time Referenced
The data warehouse user typically asks questions about prior states of being of the data. For example:
"What was my backlog this month last year vs. this year?" or "How many refrigerator units were in my
service department this time two years ago?" To answer these questions, you need to know what the
order status was at a point in time last year, or what a service inventory count was at a historic point in
time.
Data warehouses handle this technically by time stamping data as it is added to the warehouse and by
continuing to append data to the warehouse without replacing or updating existing information. Many
snapshots of data are moved into the warehouse, one on top of another.
In this way, you can accurately see the inventory count at a certain time, or the backorders outstanding on
a specific day. This allows us to answer questions that compare past performance to the present,
providing the ability to spot trends in the data and to see if these trends are consistent across different
product lines. That makes it possible to better forecast future performance with facts rather than intuition.
Second, the data warehouse model is designed for queries and bulk loads; updates can be very
expensive. Data warehouse models often replicate data in many places (de-normalize the data). An
example of this would be a Health Insurance Claim Data Warehouse. Each insured member belongs to
an insurance plan and each plan belongs to a group. You may set up a table listing each Plan and the
Group it belongs to. A Group is now repeated several times in the database, once for each Plan that
belongs to it. This design is used to make it easy to query via plan or group. If, however, you need to
change a group name, you have to update many locations, and that can be very costly.
Third, data warehouse environments are large by nature. They contain historical data, summarized data
and very detailed data. Typically these environments take up gigabytes of space. Updating gigabytes of
data can be very costly.
Fourth, the data warehouse environment is heavily indexed. To allow for the slicing and dicing of data that
an analyst requires, many indexes are placed on the data. To update these environments is very CPU
intensive.
In short, data should simply be loaded, then accessed, never updated.
You can see from the diagram above that it is easy to find sales using various combinations of criteria
such as a particular product for a calendar month. You can select all of the invoices for that calendar
month then easily find the parts you are inquiring about. What could have been a large serial pass to read
through millions of records in a transactional system turns into a finely tuned pass through a fraction of
the records using the star schema data model.
Relational databases may be used to house these star schemas, and until recently, these databases were
the only environments accessible by the popular end-user tools that did the analysis needed for decision
support. With DISC's introduction of specialized OMNIDEX indexes for Data Warehousing, Image/SQL is
now an advantageous option as a data warehouse repository. In fact the capabilities of an IMAGE
database enhanced with OMNIDEX go far beyond standard relational database technology.
The Environment
To satisfy the goals of the data warehouse and to accommodate the new data structures, several types of
tools are used today to populate the data warehouse, house the data warehouse, index the data for query
optimization, and allow fast user access.
The diagram above shows the different technologies used. Reading from right to left there are Extraction
processes to move data out of the transactional systems and into the warehouse. There is the database
management system that houses the warehouse. There are indexes added to the database to optimize
the queries. And there are the users access tools.
Extraction - ETL
There are two ways to populate a data warehouse. You can use custom programs written in a language
like Cobol. Or you can use tools built specifically to move data.
Custom programs have some advantages: You may have a staff of programmers who are familiar with a
language. Or you may already have many processes written that you can leverage to extract data from
your transactional systems.
The disadvantage of custom programs is the maintenance cost. Data warehouses are constantly
changing, and if you are using a third generation language to populate the warehouse, it will need to be
re-compiled each time you make a change. Each change made to the mapping and transformation
similarly requires more time to implement in a third GL than a tool written specifically to transform data.
ETL tools used to populate a data warehouse must be able to perform the following functions:
1. Accept data from many sources
2. Move data to many targets
3. Handle any needed data transformations
as standard relational databases (typically only 50 Gb), and data summaries at a level that is not predefined are not allowed.
Many users choose to use both relational and multidimensional databases in their data warehouse
architectures. Often the foundation or data warehouse will use a relational database, while the data mart
or single subject area will be populated in a multidimensional (Cube) model.
There are also tools available which can enhance databases (including IMAGE/SQL) with speed and
capabilities that go beyond relational and multidimensional databases, while eliminating their limitations.
As an example, OMNIDEX for Data Warehousing provides unlimited multidimensional analysis across
any number of dimensions.
OMNIDEX for Data Warehousing also eliminates the need for summary tables in relational databases,
replacing them with high-speed dynamic data summaries. Plus it provides fast access to the underlying
detail data.
Our Case Study - Step 1 Analyze the business needs and determine
value.
We first met with our client's CIO and MIS staff. They initially were interested in a consolidated reporting
environment to satisfy many of the reporting requests coming from users. They estimated that they had
two people working full time writing reports and were hoping to free up these resources.
The goal of the needs analysis for a data warehouse or mart is to get executive sponsorship, and a clear
definition of measurable business issues which can be addressed.
It is critically important for MIS to work with their users and find the pains they are facing. This
accomplishes three things:
1. MIS knows what the users want.
2. MIS has user buy-in on the project.
3. MIS will be able to get user funding without having to use the MIS budget.
Finding Pain: A fast way to failure in a Data Warehousing project is ask the users for a list of the data
items they want to see on reports. Business issues or Pains that VPs face are what is needed. If we can't
find a Business issue, let's spend the time and money on something else. While IT or users may point to
areas of Pain, the only way to specify and value Pain it to uncover it personally with the top executive
team.
Find an executive Sponsor. Without Pain you won't find an Executive Sponsor. Even with Pain, you
need a firm commitment from one of the three or four top executives in the company (CEO, CFO, VP
Sales, VP Operations, VP Manufacturing).
The main reason for failure in Data Warehousing is lack of one or both of these key requirements!
Conducting interviews with users to uncover pains. Only after you have formal support from an
executive should you go to the users. Concentrate on the direct reports of your Executive Sponsor and
their staffs. Ask these interviewees whom they interact with in other functions on the Pain, and go to
interview those people with a view to uncovering new Pain that you can raise with another VP.
The Pain and Interview process is a skilled task. Do not attempt to do this without training or at least
without having read deeply into Kimball's material on the subject. Where you can, use an experienced
business consultant to help you.
In our example company, the VP of Sales was directly paid under a management by objective (MBO)
scheme. His targets included not only the traditional Sales measures, but also Gross Margin on Product.
He couldn't measure his progress accurately from any data on hand, so he couldn't be sure he was
claiming all of his due compensation.
Personal pain like this can be a strong motivator despite the VPs reluctance to articulate the dollar impact.
Underlying his personal issue was the knowledge that he was largely in the dark about who was selling
what, particularly sales to large customers who could buy, either from his salespeople or from his
distributors, with widely varying margin contributions
In addition the CFO had suspected that gross product margins varied widely. From a prototype built on
billing data, he saw spreads on functionally similar products of up to 20%. He agreed immediately that the
average margin could be improved by 5% with appropriate knowledge and training. The 5% translated to
$5 million in EBIT that would give an ROI of twenty times the cost of the initial Data Mart
Our Case Study - Step 2 Pick the scope of initial deployment (data mart
vs. warehouse).
No matter how much ROI you uncover, be sure to start with one data mart, with one functional area; say
Billings, or Bookings. Start with the simplest, get the mart operational and prove your ROI before you
even start the next mart. Small bites of a big meal prevent indigestion. Start with the end in mind, so lay
down your first data mart on a foundation that will later support other data marts. Don't extract directly
from production to the Mart, but by way of a staging area or central data warehouse that other data marts
can later draw on. Begin by listing the potential marts. The process of uncovering a ROI will also uncover
the functional areas that merit a data mart, as well as the priority. In the example company, a number of
areas promised some ROI: Billings, Booking, Backlog, Inventory and Finances. All of these are candidate
marts that will make up a company wide data warehouse in time. For each mart, you need to discover
what elements are most important for analysis. This is done by interviewing the users. You will probably
find that the CFO is interested in sales and gross margins by product and by all the descriptors that can
apply to a product, including product family -- line, group, subcategory, color, size, supplier and so on.
Most products have upwards of 100 descriptors that can reveal areas for margin improvement.
In dimensional modeling, you call the areas of analysis, such as customer and product, the DIMENSIONs.
The descriptors are called attributes, and they become fields in the dimension tables (data sets). Even
though you select one data mart to start, you define the dimensions across all the marts at the outset
because you want CONFORMED dimensions that all the marts can share. For example, you will need the
same product dimension for a billings mart as you will later want for a bookings mart and for a backlog
mart.
Listing the marts will also uncover the facts or numeric measures that underpin the ROI you are seeking.
Almost invariably, facts are numeric and not textual. Text indicators almost certainly belong in a dimension
as an attribute.
In the case of the CFO, gross margin down to the part number will be required. You need to discover how
to compute that margin. Is it list price minus cost, or invoiced price less cost? Is the cost acquisition cost,
or that cost plus carrying costs, or are you adding customer service costs by product, or should you
include freight and insurance?
For our project, numeric indicators were chosen for Billings, the items for that we now call FACTS and
represent them in a FACT table. Cost of Goods was also defined as acquisition cost.
Billings was the chosen area because it offered the most ROI in a relatively simple area of the business.
You can also be sure that the data is reasonably clean since customers generally complain if their
invoices are inaccurate.
For our Invoice Lines Fact Table, our client required dimension tables representing Customers, Products,
Sales People, Addresses and Time.
You saw above some examples of the attributes for the Product dimension. Time in the form of a Periods
table allows every individual day that the Billings Mart may cover to be described, so you can analyze
billings by those attributes, down to the day. Each day clearly falls into a given calendar year, quarter,
month, and week. If fiscal year does not match calendar year, add fiscal year, quarter, month, and week.
You can also have attributes for a day that describes season, say Back to School and Easter. You could
have an attribute for an annual shut down, to explain why nothing happened in a given week.
From here, we implemented surrogate keys instead of the actual customer and product numbers to link
the tables. As you can imagine, there are many reasons not to use the "real" numbers that identify
customers and products or any of the foreign keys that point to the dimension tables. You may re-use
those key values in the transactional system while the number still points to old data in the data mart.
You may add a division to the data mart that has a customer under a different "real" number. Or, say you
have acquired a company for their customers, and you want to count the sales by both companies under
one customer's name.
You may want to have two rows of dimension data that describe different attributes for different periods.
Say your rep sells a new customer. In the data mart, you assign that sale to that rep, and his team, group
and division. Two years later, you re-assign the customer to another rep and division. If you just change
the one existing row, that customer will disappear from the original rep's sales numbers and appear in the
new rep's numbers, even though they might not have worked for the company at that time.
For disk space and speed of join processing, it's more efficient, to use small integers for your computer
keys rather than the long text strings that the "real" numbers may have.
To neatly handle these situations, use surrogate keys that are artificial and never seen by the user. This
adds some complexity each time you want to add a new dimension row, say to handle the sales rep
problem above. It is well worth the effort. You will want to implement surrogate keys for all the dimension
tables, including the Time dimension.
CUST_ID, and CUST_ADDR_ID listed above are examples of surrogate keys.
The last step is to physically build the empty files for your Fact and Dimension Tables, so that you can
load and test them with data. In contrast to designing a data warehouse using a relational database, we
did not need to design and build summary tables to aggregate the data at this point, since our client used
OMNIDEX as part of DecisionVault. OMNIDEX will provide the data summaries at high speed using its
own summary indexes.
Our Case Study - Step 4 Map the data from production to Mart
Now that we had built the target mart schema it was time to populate it.
The first step of every data movement project is to develop a mapping document. The mapping document
details where the data is coming from; transformations, editing rules, lookups needed; and what needs to
happen if an error occurs. This document becomes the specification for the extraction routines.
The mapping document is done after the data model for the data mart or warehouse is complete. For the
purposes of illustration, let's look at a simplified version of our data model. In this version, we have only a
single dimension table and the fact table.
Loading of the data mart begins first with the dimension, so we started there.
either impose "their" rules or demand that the source data be corrected. Either way, the data will get
"cleaned" up.
Our project's mapping document was developed in Excel. You could use any tool that works for you.
Excel is simple to use as it is easy to "paste" sample data onto a worksheet, and it is easy to email the
documents between the parties involved in developing the data warehouse. Below is our mapping
document for the customer dimension.
You can see in this document, it took four source tables to provide enough data to populate the dimension
table. For those fields that required data from other tables, we made sure to include information on how to
make those linkages from our primary source table (cust) to the other source tables (parent, cust_seg,
and cust_commcn).
If there is decoding involved, either create a lookup table or provide another sheet that details the
translation, e.g. this source code gets this description. The mapping document should be complete
enough that an outside programmer could write code from the document.
You should write the mapping documents for all of the dimensions first and then move to the Fact
Table(s). The Fact Table mapping will assume that all of the dimensions exist in the data model before the
fact is added. In other words, you shouldn't have to check if the customer exists, because there shouldn't
be any invoices added who do not have a customer. If your source system doesn't check this referential
integrity, either as part of the application or as part of the database structure, you will need to include the
rules needed to insure that the dimensional data is in place prior to writing a fact row. The mapping
document is a "working" document that will change as you discover different aspects of the data, both
before and during the load document.
Our Case Study - Step 5 Extract operational and external data and load
the mart.
Once the mapping document for a particular dimension is complete, it is time to begin the development of
the extraction routines. All data marts begin with an initial load that loads the entire applicable contents of
the transactional source application into the data mart. This is a bulk load process that starts with the
dimensions and then loads the fact tables.
As with most initial loads for data marts, we started with the dimensions first. The customer data is key to
the model and is the cornerstone of the all of the information within the mart. When talking to the user
about their customer data, they indicated that the data was clean for a number of reasons: Product had
been shipped to the address in the database. The application has an automatic de-duping feature to
ensure that duplicate catalogs are not sent to the same address. And address fields are automatically
updated by the application.
The scripting language has all the power of a third generation programming language. The scripting
language is broken into three major components: accessing files, selecting data, and the action to be
taken on the data. To access a file, you either create it or open it. Open is used for those files that already
exist. In this situation, you are accessing an Oracle database (your source) and an Image database (your
target). The syntax of the open statement is:
Incremental load:
Once the data mart was loaded and began being used, it was clear that the customer wanted a daily
refresh of data. They did not want to perform a full refresh each night because they did not have a batch
window opening large enough for a full refresh.
DecisionVault's BridgeWare combines DataBridge with a real-time change detect component for the
HP3000. On relational databases it is easy to capture changes as they are posted to the database using
triggers. On the HP3000, Image does not have such a feature. Although Image has logging, the log file
doesn't capture enough information to apply changes to the datamart.
BridgeWare's technology is best explained using a diagram.
BridgeWare, a joint development with Quest Software combines their detect change technology with
Taurus' ETL. Their component, called SharePlex, detects changes in data files on the HP3000 and writes
them to a message file. SharePlex works at the operating system level, detecting changes before they
reach the data file. DataBridge is then used to define how the captured changes should be transformed
and moved.
The message file holds the records of the changes, and contains a header record with the name of the file
changes, the type of transaction, a sequence number, and the date and time the transaction occurred.
Each type of record (insert, update, delete) always writes a header plus the record images which are
appropriate. An insert has the record image for the record inserted. A delete has the record image of the
record before it was deleted. An update has the before and after record image.
As our client moved to an incremental load process they had to consider how to handle each of their
transaction types, adds, changes and deletes. The rest of the load process remained the same.
We evaluated our options to improve query performance. According to data warehousing literature,
standard options to satisfy user query requests are to:
Load the data into a MDD (multidimensional database) and query against it
Serial reads through large tables or data sets are slow. They are not viable for any significant amount of
data. Even with extremely large expenditures in CPUs or parallel processors, they are typically not fast
enough for online reporting with multiple concurrent users competing for resources.
At first glance, using a relational database and building summary tables is a possibility. They store precalculated and sorted values that can answer specific queries. But we were worried that the number of
tables needed to do the queries which came out of our user interviews would be prohibitive because a
different table needs to be built for almost every query.
As mentioned above, multidimensional databases are limited in the number of dimensions they can
handle, and they cannot be updated. Typically only 5 or 6 dimensions can be used before the build and
refresh times become unmanageable and another data mart must be built. This we not going to meet our
user's requirements, since they wanted to expand the data mart into a data warehouse over time, and the
refresh times would outgrow their nightly batch window.
ODBC/Web middleware residing between the database and popular end-user tools such as Brio
To show how OMNIDEX was used to deliver the query performance we needed on our data mart let's
look at the database's star schema design and the retrieval requirements.
The current model has a basic sales fact table and five (5) dimensions (refer again to the star schema
diagram). The dimensions that can be sliced and diced in any combination are:
1. Customers
2. Addresses
3. Products
4. Periods
5. Sales Peoples
For our client, the primary purpose of the Billing/Sales Analysis data mart was to analyze customer
characteristics and buying patterns, such as:
1. Customer demographics
2. Product performance
3. Sales rep performance
4. Regional preferences
Let's see how OMNIDEX handled multidimensional queries, data aggregations, and drill-down to detail for
this client.
Multidimensional Queries
As with most data warehouse projects, we determined that most of the data warehousing queries which
needed to be made were complex and multidimensional in nature, using multiple criteria against multiple
fields or across multiple dimensions. For example, our end-users rarely wanted to access data by only
one field or dimension, such as finding the number of customers in the state of California. They wanted to
ask complex questions such as how many customers in the state of CA have purchased product B or D in
the last year.
Our client's sales manager wanted to see the sum of sales for his sales reps for their new products during
a certain time period such as last month. The inventory managers might want to see the quantity sold of a
product for the last month and compare it to last year's. A marketing manager might want to see the sum
of sales for some particular states and customer types.
We knew that the performance of these multidimensional queries in a data warehouse environment using
a star or snowflake database design would be greatly enhanced through the use of OMNIDEX
multidimensional indexes. OMNIDEX Multidimensional Indexes are specially designed for unlimited
multidimensional access, allowing complex queries using any number of criteria or dimensions which we
needed to satisfy our requirements.
We placed OMNIDEX Multidimensional indexes on each column in the dimension tables so the end-users
could query by any combination. When an end-user issues a query, a qualifying count based on index
access only is returned without touching the actual data. This speeds the query because it is much more
efficient to service a query request by simply looking in an index or indexes instead of going to the
primary source of data (the database itself). This is true for both pre-defined and ad hoc queries.
Because the end-users quickly obtain the results each time they perform an ad hoc search, they can
decide whether to add criteria to further qualify the selection, aggregate it, drill down into the detail data,
or start over. With the multidimensional indexes applied, our client's end-users were able to query
repeatedly in seconds or minutes instead of hours or days, even with large amounts of data. This allowed
them to interact with and explore the data in unpredictable ways, very quickly. Only when the end-users
want to view the detail data was I/O performed against the database.
Data Aggregation
In addition to multidimensional queries, our client's end-users wanted to see data summaries (e.g.,
COUNT, SUM, AVERAGE, MINIMUM, MAXIMUM) such as total sales dollars, number of customers,
average quantity, etc., and group them by a field such as sum of sales by region, or average quantity of
product line B sold by sales rep.
OMNIDEX performs data summaries by using its Aggregation indexes only, summarizing quantities or
amounts dynamically at the time of the query. Up to 1 million values per second can be queried.
We placed an OMNIDEX aggregation index on the fields to be summarized and were able to query at
these very high speeds while still satisfying our client's requirement for heavy ad hoc inquiries.
Drill-down to detail
The multidimensional and aggregation indexes which we set up with the client contain pointers back to
the actual data in the fact table, so the underlying detail data can be instantly accessed and displayed to
the user at any time. This means data can be kept at any level of granularity, quickly aggregated, and
then the detail displayed - a capability that is lost when using multidimensional databases or summary
tables with relational databases .
OMNIDEX Installation
Once the star schema database is designed, the general query needs are determined, and data is
loaded, OMNIDEX is ready to be installed. To enhance this client's star schema with OMNIDEX indexes,
an "environment catalog" was set up that defines the database layout (like a data dictionary).
Once the environment catalog was created and compiled, an OMNIDEX consultant worked with the
client's staff to further analyze the query needs. They first determined what snowflake tables would
optimize the query performance for "Group By" fields from large data sets to be used in data
aggregations.
Depending on the cardinality of the data (number of possible values per field), and the large size of the
dimension tables, the snowflakes that were created to benefit this Sales Analysis data mart were as
follows: (OMNIDEX has a program called SNOWGEN that generates the snowflake tables automatically.)
After the snowflakes were created and added to the environment catalog, the OMNIDEX installation job
for the data mart was defined based on the overall query needs. First of all, all of the fields in the
dimensions tables were defined as multidimensional keys (MDK).
CUSTOMERS CUST_NBR
CUST_NAME
PARENT_ID
PARENT_NBR
PARENT_NAME
CUST_TYPE
ACTIVE
DUNS_NUMBER
DUNS_RATING
TAX_ID
SIC_CODE
AREA_NBR
PHONE_NBR
ADDRESSES CUST_NBR
CUST_ADDR_NBR
ADDRESS_LINE_1
ADDRESS_LINE_2
ADDRESS_LINE_3
CITY_NAME
STATE_
PROVINCE_CODE
POSTAL_CODE
COUNTRY_NAME
SALESREPS
SALESREP_ID
LAST_NAME
FIRST_NAME
MIDDLE_INITIAL
FULL_NAME
OFFICE_CODE
DISTRICT_CODE
REGION_CODE
PRODUCTS PRODUCT_ID
SKU
PART_DESC
STATUS
PROD_LINE
PRODUCT_TYPE
PROD_GROUP
VENDOR
DIST_PART
QUALITY
INTRINSICS
PACKAGING
ORDER_QTY
MIN_ORD_QTY
MULTIPEL_QUT
NEC_BUDGET_1
NEC_BUDGET_2
NEC_BUDGET_CODE
NEC_PART_NBR
NEC_PLANNER
NEC_PKG_TYPE
NEC_PROD_LINE
NEC_STATUS
NEC_STD_BUS_UNIT
FORECAST_CLASS
DS_ABC_CODE
PERIODS ACTIVITY_DATE
CAL_YEAR
CAL_QUARTER
CAL_MONTH_NAME
CAL_MONTH_NBR
CAL_MONTH_ABBREV
CAL_WEEK_NBR
DAY_NBR
DAY_NAME
FISCAL_YEAR
FISCAL_QUARTER
FISCAL_WEEK
FISCAL_PERIOD
PUBLIC_HOLIDAY
NON_WORKING_DAY
SPECIAL_DAY1
SPECIAL_DAY2
Then the metric fields to be aggregated were specified from the INVOICE LINES fact table:
Sales qty
List price
COGS
Billed price
Discount given
Gross profit
Seamless Integration
Once the indexes were created and loaded, it was time to install the OMNIDEX Client Server software on
each user's PC, create some data source definitions, and invoke a "listener" process on the HP3000
server. High-performance queries then became available to the users through ODBC-compliant analysis
and reporting tools.
When a pre-defined or ad hoc query or report issues a request for information, ODBC passes the query
through the OMNIDEX search engine and SQL Optimizer, which accesses the indexes and passes the
qualifying count or data aggregation back to the PC query or reporting tool. The OMNIDEX
Multidimensional and Aggregation indexes allowed us a broad range of queries to be optimized with one
set of indexes that took much less build time and disk space than other "standard" data warehousing
options..
If the user requests the actual detail data, the pointers for the selected data records are used to quickly
retrieve only those records of interest from the sales fact data set. The OMNIDEX SQL optimizer insures
that all queries against a star schema or snowflake are completed in the fastest manner, and never cause
slow, I/O-intensive serial reads of the data.
Our Case Study - Step 7 The end user interface and deployment
Once we had clean, certified data in our data mart with indexes installed on it, the Billings data mart was
ready to give to the VPs and their staff. We needed to choose a business intelligence tool that users like.
We knew that any ODBC-compliant tool can access an IMAGE database through the OMNIDEX
interface. Our client chose BrioQuery because of its fast learning curve and ease of use.
Brio Technology's range of products allows users to access the data mart from their desktops or from their
web browsers. BrioQuery generates SQL requests automatically that are sent to OMNIDEX so that
OMNIDEX can quickly retrieve the data from the Data Mart. The data comes back to the user initially in
row and column format, ready to be graphed and analyzed.
This sample shows the data requested from the Data Mart on the "Request" line, and below that the
6,999 rows of data that were returned.
Our client chose to develop standard analyses using charts, pivot spread-sheets and serial reports that
cover most of the detail that the VPs want to focus on to meet their ROI goals. Fortunately, Brio's products
are easy enough to use that an Excel user can create new analyses very quickly.
Here is an example bar chart graph: Analysis shows how the sales people have been discounting each
quarter. We know that the VP of Sales and the CFO believe they can reduce the discounting rate from an
average of 14% to at least 12%, if they can see where the problems are. That 2% is $2 million at the end
of the year in Sales and Margin - with no effort. We can "drill into" this chart to find out that reps are the
worst offenders and whether any customer or product is particularly responsible. Managers can then act
to correct the worst offenders.
Here's an example of "Green Bar Paper" reporting brought up to date. Note that the output contains
interactive graphic analyses that change as the control breaks on the listing change.
Here's an example of a Pivot table for product margin analysis.
The display is color coded to reflect the user's thresholds for "good" and "Bad" performance (Blue and
Red). The Purple horizontal and vertical axes show the attributes of the Dimensions that this analysis
uses. The gray tab at the end of each purple axis allows any horizontal dimension to be pivoted to the
vertical and vice versa. The OLAP structure that underlies this analysis is then re-computed .
Our Case Study - Step 8 Value of the mart to the end user
It is important as an MIS organization to investigate the perceived and realized benefits provided by the
data warehouse or mart. Once the users begin using the warehouse or mart they will ALWAYS ask for
changes, and it is important to document the value of the new environment to help prioritize and justify
new enhancements and new marts.
Determining the value of the mart can be done through interviews with end users or even through written
surveys. This list of sample questions was pulled from a Data Warehousing Institute survey being
conducted currently by the Survey Research Center at the University of Georgia.
Strategic Benefits:
How did the mart increase our competitive position?
How did it improve relationships with our customers?
How did it improve relationships with suppliers?
How did it increase our ability to conduct one to one marketing?
How did it improve our business process?
How did it improve cost efficiency?
How did it decrease the cost of supplying information?
How did it decrease the effort in supplying information?
How did it decrease the time to build new applications?
Information Benefits:
How did the mart improve our consistency?
How did the mart improve integration?
How did the mart improve comprehensiveness?
How did the mart improve availability?
How did the mart improve timeliness?
How did the mart improve accuracy?
How did these improvements affect our business? And what value did this have?
Conclusion
You have seen how Pain is vital to the success of a Data Mart Project, and how we built our client's Data
Mart .
You have seen that, for the first time, there is a viable, comprehensive solution for creating data
warehouses and data marts on an HP3000 from data that had been previously locked up in transactional
databases on the HP3000 or HP9000. Better yet, the data for this solution can be easily assembled by
lower level MIS personnel and used by - good grief - marketing and executive types to increase the
productivity and profitability of your organization.
We recommend that anyone entering the Data Warehousing arena start by reading the works of Ralph
Kimball at minimum and attend his workshops. Ralph has codified the principles of data warehousing in a
simple and down to earth fashion that addresses the business needs rather than the technology. No
serious data warehousing project should start before the designers have fully absorbed his principles,
particularly in data modeling. Ralph's two main works are:
The Data Warehousing ToolKit. ISBN 0-471-25547-5 The Data Warehousing LifeCycleToolKit.