Sei sulla pagina 1di 7

DW components

Cooking up a
Data Warehouse
Todd Saunders
Abstract

Todd Saunders is chief solutions architect for


CONNECT: The Knowledge Network. He has
been building systems and organizations
for nearly 20 years and has been involved
specifically in building and implementing data
warehouses, business intelligence solutions, and
database marketing systems for the last 12 years.
tsaunders@connectknowledge.com

As a data warehousing professional, you know that your environment has many components that must work together and
interact just so to provide valuable information to your business. However, for colleagues who are beginning to familiarize
themselves with data warehousing, it is not always clear what
those components are and how they affect each other. This
article will provide an analogy to help you explain key
DW components and their interactions to neophytes.

Introduction
When explaining the basic components of a data warehouse environment, the analogy I like to use is that of a
restaurant. This is not a new analogy; in a Web search, I
found articles from several years ago that use it. However,
new techniques and technologies have influenced the way
data warehouses are developed and used, so its time to
update the analogy.
Because were all familiar with restaurants, you can use
this analogy to explain data warehousing to people
friends or colleagues, perhapswho are unfamiliar with
technology in general or with DW components such
as ETL tools or databases. Providing an easy way to
visualize what is happening in one of these solutions can
go a long way toward effectively communicating how a
data warehouse works. It should be easier for someone
new to data warehousing to visualize and remember how
ingredients are stored in a kitchen by type (for example,
frozen, canned, or fresh) than to visualize how data
is stored in a database according to subject area. The
analogy will provide DW neophytes with a clarifying
context about data warehouse solutions, so when they are
pulled into conversations regarding data structures, they
can discuss them meaningfully.

16

BUSINESS INTELLIGENCE Journal vol. 14, No. 2

DW components

First, lets define our terms. A restaurant is a business that


prepares and serves food to customers. For this article, I
will focus on the food preparation and delivery process of
the restaurant, and not so much on issues related to renting a building or designing the dcor. In short, well look
at the process of procuring raw materials (food) from
suppliers, storing the food in the kitchen, and preparing
and serving dishes ordered by customers.
When we look at the general process flow that occurs in
our restaurant, we assume placing orders for the ingredients in our dishes is the first step. When the ingredients
arrive, we store them in the appropriate places in our
kitchenthe freezer, the refrigerator, or on the shelf. As
orders are placed by customers, the appropriate ingredients are retrieved by the chef; measured, mixed, combined,
and cooked; and finally delivered to the customer.
How is this relevant to data warehousing? The parallel
between the two is striking, I believe. Purchasing raw
ingredients is analogous to obtaining data from various
source systems in your business. The raw ingredients
the datacome from source systems such as your ERP
system, sales system, accounting system, or fulfillment
system, to name a few. Occasionally, you may procure
data from outside sources to get information about
prospective customers or your market.
Storing the raw ingredients in the appropriate places is
the equivalent of storing data in a database (i.e., your
data warehouse). As in a restaurant, it is important to
put your resources (in this case, your data) in the right
place so everyone knows what is stored where and how it
can be retrieved.
Preparing the dishes ordered by the customers is analogous to building reports and delivering the reports to
your business users.
To summarize, obtaining raw ingredients, storing the
ingredients, then retrieving the ingredients to prepare
a dish is similar to obtaining data from source systems,
storing the data in a database, and then using that data
to build reports. Get data in, manage it, and then get
information out.

It sounds simple, right? Well, it can be, but the demands


of business users and the complexity of businesses usually
mean that more detail and complexity is needed in the
data warehousing environment to support the business
requirements. Again, our restaurant analogy can help
clarify some of these issues.

Purchasing raw ingredients is


analogous to obtaining data
from various source systems in
your business.
Technical Resources
Just as a restaurant needs a top chef to prepare the meals,
businesses need to have high-quality technical people
who can build and manage the data warehouse environment. Just as a poor chef can take good ingredients and
still produce a mediocre dish, a poor technical team can
have good data supplied from the source systems yet still
struggle to produce timely and accurate information. An
experienced database administrator with deep knowledge
of data warehousing is one of the keys to creating a
successful data warehouse environment.

Data Sources
One of the complexities in data warehousing is determining what data to put into the data warehouse. In our
restaurant analogy, this equates to figuring out what raw
ingredients to order. The key is deciding what is going to
be on the menu. As a restaurant owner, you decide what
soups, salads, appetizers, main dishes, and desserts you
will offer. Each of these dishes requires ingredients, so
the complete menu gives you the total list of ingredients.
If one of the desserts you offer is a milk shake, you know
you will need to have milk, ice cream, and flavoring
available in the kitchen.
In the business world, knowing what information is
required by business users will help you determine what
data needs to be available in your data warehouse.

BUSINESS INTELLIGENCE Journal vol. 14, No. 2

17

DW components

If the business users need a weekly report of accounts


receivables by customer, you know that you will need
data from the accounting department that details the
amounts each customer owes. With that raw data in
the data warehouse, a report of the receivables can be
developed that displays the information needed by the
end user.

A poor technical team can have


good data supplied from the source
systems yet still struggle to produce
timely and accurate information.
Data Granularity
When preparing a meal, you need to know what the
ingredients are as well as what amounts to use. In some
cases, it may make sense to buy ingredients that are
already a complete food item. For example, ice cream
could be served as a dessert by itself or included as an
ingredient in a more complex dessert. If a consumer
wishes to know all the ingredients that are included in a
dessert, it may not be enough to know that ice cream is
one of the ingredients. The consumer may want to know
exactly what ingredients went into the ice cream being
served. The fact that they know it is ice cream may be at
too high a level of aggregation for their needs.
The data that goes into data warehouses faces the same
issues. If a source system can only deliver total sales
amounts by product per week, it may not be possible to
determine what days of the week the most sales are generated by product. This may or may not be important to the
business users, but it is important to find out before the
database is designed so that expectations can be set about
exactly what level of detail will be available for analysis.

Data Updates
Another complexity is timing of data refresh. In the previous example, a business user needed a report delivered
each week of the receivables by customer, but what if

18

BUSINESS INTELLIGENCE Journal vol. 14, No. 2

the data warehouse only gets data from the accounting


system once a month? In our restaurant, this would be
like ordering milk once a month. That first week the
cake you make with the milk tastes pretty goodlike
it is supposed to. Going into weeks two, three, and four,
the milk might not be providing the flavor and texture
expected. In other words, the milk starts going bad and
causes the end product (the cake) to be bad even if all
the other ingredients are fresh and the cake is made and
delivered to the customer in a timely manner.
If you know that milk is only good for a week, youll
set up weekly deliveries of fresh milk so that the dishes
produced are good. In the same way, reports that are
developed on a weekly basis will be meaningful only if
the data is no more than a week old. If the data can only
be updated monthly, then reports should be produced
from that data only once a month.

Data Standardization
Another complexity is standardization and hygiene.
Food orders can help in explaining what happens in the
standardization process.
The key to standardization and hygiene is getting
everything to look the way we expect it to. If we have
filet mignon on the menu, we need know how much of
exactly what to order. We cant just place an order for
meat. We need to make sure we are ordering beef, and
we need to make sure we are ordering beef tenderloin
and not strip steak.
What happens if we are ordering our beef from two different suppliers? One supplier may ship us individual filets.
The other may ship us tenderloins that can be carved into
six filets each. When we want to know how many filet
dinners well be able to serve at a given time, we need to
know how many filets and tenderloins are on hand and
how they add up to individual filet mignon meals.
In business, we need to know the number of items in an
order unit. Our business may have one supplier that ships
six oil filters per order and another that ships 24 filters per
order. In our data warehouse, we need to recognize how
many orders have been received from each supplier and

DW components

apply the appropriate multiplication to know how many


individual oil filters we have received. It is not meaningful to simply say we have received 20 orders, since we
wouldnt necessarily know how many of those orders
were from the first supplier and how many were from
the second. We may have received anywhere between
120 and 480 filters. The proper standardization of our
warehouse data will tell us exactly.
Another form of standardization is recognizing that different terms may mean the same thing. In our example
above, we know that one beef tenderloin equals six filets.
If we order one pound of cilantro from one supplier and
one pound of coriander from a second, we know that
we actually have two pounds of cilantro (since cilantro
and coriander are the same thing and we choose to call
both cilantro).
In business, we may have one division that uses the term
customers and another that uses the term consumers.
In our data warehouse, if the business rules specify, we
can know that a customer is the same as a consumer, even
though the different divisions refer to them with different names. Other businesses have both B2B and B2C
models where customers and consumers are differenta
distinction that needs to be known and tracked.

Data Storage (Database)


We need to know where to store our ingredients. They need
to be organized and kept in specific places based on the
type of ingredient, their attributes, and how they are used.
In our kitchen, we need to keep frozen foods in the freezer,
perishables in the refrigerator, and canned goods on the
shelf. We also need to keep like foods together within each
of those storage areas to make it easier to get to them. It
makes sense to keep the cilantro and coriander together in
the same bin and just call it cilantro. That way, it is quick
and easy for our chef to go to one place and find what he
or she is looking for. On the other hand, it would certainly
make life more difficult if sugar and flour were kept in the
same container and the chef had to try to separate it out
each time a cup of one or the other was needed.
In our business example, we want to keep like data
grouped together. It would be very difficult to manage

and access data if we tried to keep all information in one


big table. Imagine if for every sales transaction you had
to list information about the order (such as item sold,
number of units, amount, and data/time) and all the
information about the customer (name, address, phone,
e-mail, previous purchases, lifetime value, value segment,
etc.), as well as store information (including address,
current manager, and inventory levels). Each record
would be huge. We would have so much redundancy
and eventually conflicting information that our data
warehouse would become useless.

An experienced database
administrator with deep knowledge
of data warehousing is one of the
keys to creating a successful data
warehouse environment.
It makes more sense to keep information about customers
in one area, store attributes in another, and sales transactions in another. Our sales transaction record would have
the sales information (item ID, units sold, and amount),
with just a store ID and customer ID that can be used to
find out more information about the store or customer
later if needed.
Organizing the data in this manner will help with data
management as well as data retrieval, a key attribute of
data warehouses: the ability to access and retrieve data
(relatively) quickly.

ETL
In data warehousing, one of the biggest parts of the
development effort is the ETL process. ETL (extract,
transform, and load) refers to getting (extracting) data
from point A (the source system), transforming it (e.g.,
changing euros to U.S. dollars), and loading it into point B
(the correct table within the data warehouse). It is a much

BUSINESS INTELLIGENCE Journal vol. 14, No. 2

19

DW components

simpler process in our restaurant example than in a real


data warehouse system.
In our restaurant, the extract consists of placing an order
with a vendor. Once the order arrives, we transform it
(cut up the tenderloin into filets) and load it to its proper
location (freezer, refrigerator, or shelf). As it turns out,
data across different business units within a company
can require transformation and manipulation to make it
compatible with the rest of the data within the warehouse.
This is why it typically requires the most effort in a data
warehouse development project.

In our kitchen, a data mart would


be like food that is partially
pre-made to expedite completion
of the dish.
Matching
Matching is another key component of data warehousing.
Going back to the beef example, our restaurant may
have several meat suppliers. One supplier sends us beef
tenderloin, another filet mignon, another filets, and
yet another beef (filet mignon). We know these all refer
to the same cut of meat, so we decide on one termfilet
mignonand call all of these by that single, standard
name. This way we are able to easily track exactly how
much filet mignon we have on hand.
In business, we may receive sales transactions from the
same Home Depot store, but the store name could have
several variations: Home Depot, Ottumwa, IA, Home
Depot #207, IA, Home Depot Store #207, Ottumwa,
or other variations. If we are attempting to track sales
across the different Home Depot stores, we need to
know that these are all referring to the same store so
that we can appropriately attribute the sale. Commercial
matching software can be configured to help the data
warehouse recognize that all of these refer to the same
store and aggregate the information correctly.

20

BUSINESS INTELLIGENCE Journal vol. 14, No. 2

Data Marts
Typically, a data mart contains summarized (or aggregated) data relevant to a particular subject area such as
marketing or sales. In our kitchen, a data mart would be
like food that is partially pre-made to expedite completion of the dish.
Picture one of those fast food Chinese restaurants where
you can choose rice or noodles, then one or several main
courses such as orange chicken, garlic chicken, or beef
and broccoli. The rice and noodles have already been
cooked and are ready to dish, as are the main courses.
This is how the data mart works. You have your raw
ingredients (raw data) stored in the kitchen in the various
storage areas (freezer, refrigerator, or shelf) just like
the data in the tables in the data warehouse. You then
partially prepare the food (cook the rice or make the
orange chicken), much as you would aggregate the data
for the data mart (sum up all the sales by customer or
calculate total parts sold per time period).
When you want to prepare the final dish (orange chicken
on rice), you can quickly scoop the two ingredients
together on a plate. In the case of the data mart, you can
simply select the appropriate time period and see how
many parts were sold without having to go to the data
warehouse and select each and every individual transaction (where some may have been sales transactions, some
were order corrections, and some were returns).
The data has already been prepared, so you know that
when you ask for net parts sold during a time period, the
mart has already applied all the necessary logic to the raw
data to present you with the right answer.

Reporting
Consider the dishes delivered to the customers at their
tables. The dishes are analogous to the presentation (i.e.,
reporting) layer in a data warehousing environment.
They are the end product. They are what are produced
using our raw ingredients as inputs.
The dishes are ordered by the customers based on choices
from the menu. The menu is not infinite. It has a set
selection from which the customers can choose, because

DW components

the kitchen cannot possibly stock all the ingredients necessary to produce any dish that a customer might desire.
Rather, the kitchen is stocked with the raw ingredients
needed to produce any of the items listed on the menu.
When a chef receives an order for veal parmesan, he or
she knows that the necessary ingredients are available in
the kitchen and can find those ingredients and produce
the dish in a timely manner.
Similarly, in our data warehouse environment, the end
users have been identified and their reporting needs captured. These reports are like the menu items. Just as the
menu items require certain raw ingredients, the reports
require certain data. Since the report specifications are
known before the warehouse is built (if the process works
as it should), we can be confident that the needed data is
in the warehouse and available for each report.
Standard Reports
The reporting environment often includes a set of
standard reports. It is useful in many cases for a business
unit to receive the same report every Monday morning
showing sales for the previous week, day, or other time
period, depending on the business need. The point is
that the report arrives at the expected time on a predetermined frequency containing the most recent information.
This is like having the same meal prepared and picked
up or delivered on a regular schedule. Maybe you like to
treat yourself to a favorite meal every Friday for lunch
and have it delivered. You talk to the restaurant, let them
know how you would like the meal prepared, and ask
them to deliver it to your office every Friday at noon. You
get the same food every week and it is prepared with
fresh ingredients each time.
Configurable Reports
Sometimes when reading a menu, you like a particular
dish but would like to exchange one of the ingredients or
side dishes. You might ask the server for soup instead of
salad, or chips instead of fries. Reporting can operate in a
similar fashion.
Reporting environments will often provide a list of standard reports that an end user can select to view. However,

the end user may want to vary that report slightly. For
example, a particular report may show sales by region
by year for the last 10 years, but the end user would
prefer sales by month over the last 12 months. It is often
possible to make reports like this configurable, where
the end user can select some of these parameterssuch
as time period or regionbut the basic structure of the
report and the supporting data remains the same.

A variation of reporting is the buffet.


A buffet has a lot of food, but not
infinite options.
Ad Hoc Reporting
A variation of reporting is the buffet. A buffet has a lot of
food, but not infinite options. It presents the food items
the restaurant has decided the customers are most likely
to want, leaving the customer free to pick and choose
exactly which of those items they wish.
The business analogy to the buffet is ad hoc reporting.
End users can pick and choose which data they would
like on their reports, but they have a finite amount of
data to choose from. However, end users can choose
and combine the data any way they want. The data that
has been made available in the warehouse was based on
gathering information requirements from the end users
and finding the sources of data to put into the warehouse
so the end users can access it. The available data should
serve most or all of the business needs of the end users.
One caution is that the users should have some familiarity with the data and the data structure so they dont end
up selecting data that would be the equivalent of putting
mustard on an ice cream sundae.
Dashboards and Scorecards
In some ways, dashboards and scorecards are like
standard reports, but they tend to present information
in summarized, easy-to-read, graphical formats. For
example, you may have an internal Web site for your
company that displays several charts and graphs showing

BUSINESS INTELLIGENCE Journal vol. 14, No. 2

21

DW components

key pieces of current information. There may be a graph


showing quarter-to-date sales and how it compares
to the goal. There may be another graph showing
month-to-date profitability by region presented in a
pleasing-to-the-eye, consumable format.
When I think about eye-pleasing, consumable items in
a restaurant, the dessert cart comes to mind. What is
typically shown on a dessert cart is that days current
selection of desserts in single-serving portions. You can
quickly and easily see what is there without having read a
menu; you can quickly zero in on the item that is of most
interest to you. All of the desserts have been prepared and
are ready to be served, just as the results on a dashboard
or scorecard have already been calculated. The same
information is probably available in one of the other
standard reports, just as the desserts are probably listed in
the menu. However, the dessert cart, like the dashboard,

presents interesting information in a format that is very


quick to see (visualize) and comprehend.

Summary
It is a little surprising how closely the process of producing meals at a restaurant resembles the processes in a
data warehousing environment. When people you know
are thinking about the components of a data warehouse
solution, this restaurant analogy should be a good
way to help them keep the components and processes
straight and provide a clearer picture of what is going on
in your solution.
Of course, no analogy is perfect, but this one does a
good job of providing an easy-to-understand overview
of what could be a whole new environment for those
new to the technology. n

Instructions for Authors


The Business Intelligence Journal is a quarterly journal
that focuses on all aspects of data warehousing and
business intelligence. It serves the needs of researchers
and practitioners in this important field by publishing
surveys of current practices, opinion pieces, conceptual
frameworks, case studies that describe innovative practices
or provide important insights, tutorials, technology
discussions, and annotated bibliographies. The Journal
publishes educational articles that do not market,
advertise, or promote one particular product or company.
Visit www.tdwi.org/journalsubmissions for the Business
Intelligence Journals complete submissions guidelines,
including writing requirements and editorial topics.

22

BUSINESS INTELLIGENCE Journal vol. 14, No. 2

Submissions
www.tdwi.org/journalsubmissions
Materials should be submitted to:
Jennifer Agee, Managing Editor
E-mail: journal@tdwi.org

Upcoming Deadlines
Volume 14, Number 4
Submissions Deadline: September 4, 2009
Distribution Date: December 2009
Volume 15, Number 1
Submissions Deadline: December 18, 2009
Distribution Date: March 2010