Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Business Intelligence
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
1/uts
Data mining
OLAP,
Queries & reports
Data Warehouse
Many authors speak of BI as being an umbrella term, with various components hanging under this umbrella.
Another way to look at it is the first explanation of Business Intelligence, which is the following pyramid:
1.3 History of BI
Up to this point, we have agreed on Business Intelligence as being an umbrella that covers a whole range of concepts.
It is clear that BI has somehow evolved from other concepts. Therefore, when exploring the history of Business
Intelligence, it seems wise to take a look at what preceded Business Intelligence.
2/uts
The problem with topics such as Business Intelligence, Decision Support Systems and many other acronyms with the
S standing for System is that they are all part of a terribly volatile field. Much has been written about Information
and Support Systems, authors have filled tomes with describing the existing Systems: how do they work, how should
they be built, what are the requirements, and so forth. Unfortunately, little to nothing is written on the history and
development of the Systems. What we would have to do is take all these writings lay them out next to each other
and compare. Consider the following overview given in figure below.
Financial
reporting system
World
Wide
Web
Executive
Information
System(EIS)
Demographic
data provider
Enterprise
Information Portals
Math/ Merge
Services
Web Analytics
Closed Loop
CRM
Marketing
Database
Transaction
Systems
Customer
Information File
(CIFs)
Analytic
Application
Data mining
Interaction
Personalization
Customer
Resource
Management
Extract Files
Data
Warehouse
Reporting
Systems
Online analytical
process(OLAP)
Decision Support
System (DSS)
Ad Hoc Query
Tools
Personal
Computing
Relational
Databases
Multi dimensional
database
Spreadsheet
Software
1975
1980
1985
1990
1995
1998
(Source: www.few.vu.nl/en/Images/werkstuk-quarles_tcm39-91416.doc)
The information that is most volatile is that what we read on the Internet. Where up to about ten years ago authors
wrote their findings down in books and journals, nowadays the easier, faster, cheaper and more accessible way
of publishing is on the World Wide Web. The problem with this medium however, is that a web page has to be
maintained and updated regularly to keep it and its topics alive. When this does not happen, pages get lost or wiped
away or simply contain information that is out of date.
The Database Magazine (also known as DB/M) proved a valuable source of information. DB/M is pinpointed as
being a magazine we must not miss if we are interested in BI. Because it has been published since 1990, however,
it is only since 1997 that BI received the attention of the authors of DB/M.
Within this framework of CRM, BI is no longer only used by management levels, but BI-tools and techniques are
developed for all organisational levels.
At various points in the report we will see how Business Intelligence can influence CRM. To give a brief example
up front, BI can be used to identify what is called customer profitability: which customer profiles are responsible
for the highest profit? Based on the answer to this question, a company can choose to change their strategy and, for
instance, make special offers to certain customer groups.
4/uts
5/uts
ut
le
t
O
style
month
Fig. 1.3 A 3-dimensional OLAP cube
(Source: www.few.vu.nl/en/Images/werkstuk-quarles_tcm39-91416.doc)
t
O
ut
le
t
le
ut
O
ea
style
Sn
Amsterdam
ke
ut
le
The following three cubes show us how we can look at, respectively: data on all shoe styles sold in all months in
the outlet Amsterdam, data on shoe style sneaker sold in all months in all outlets, and data on all shoe styles sold
in all outlets in the month April.
style
style
April
month
month
month
ut
le
When we combine these three dimensions, we get data on the number of sneakers sold in the outlet Amsterdam in
the month April:
style
250
month
Number of
sneakers sold
in Amsterdam
in April
Suppose we want information about the colours of the sneakers or the sizes sold, we would have to define new
dimensions. This would mean a 4-, 5- or even more-dimensional cube. Of course cubes like this are no longer
visible to the eye, but in an OLAP-application they are possible.
1.8 FASMI
If we go back in time a few decades we come across Dr. E.F. Codd, a well-known database researcher during the
60s, 70s and 80s. In 1993, Dr. Codd wrote a report titled: Providing OLAP (On-Line Analytical Processing) to
User-Analysts: An IT Mandate, in which he defined OLAP in 12 rules. These rules make up the requirements that
an OLAP application should satisfy. A year later, Nigel Pendse and his co-author Richard Creeth became increasingly
occupied by the phenomenon OLAP. After a critical study of the rules of Dr. Codd, some were discarded and others
lumped together in one feature, and a new definition of OLAP was born:
Fast Analysis of Shared Multidimensional Information (FASMI)
In a later article they go on to describe what they mean exactly with the five separate words that make up this
definition:
Fast means that the system is targeted to deliver most responses to users within about five seconds, with the
simplest analyses taking no more than one second and very few taking more than 20 seconds.
Analysis means that the system can cope with any business logic and statistical analysis that is relevant for the
application and the user, and keep it easy enough for the target user.
Shared means that the system implements all the security requirements for confidentiality (possibly down to cell
level) and if multiple writes access is needed, concurrent update locking at an appropriate level.
Multidimensional means that the system must provide a multidimensional conceptual view of the
data, including full support for hierarchies and multiple hierarchies, as this is certainly the most logical
way to analyze businesses and organisations.
Information is all of the data and derived information needed, wherever it is and however much is
relevant for the application.
Nigel Pendse declares that this definition was first used by him and his company in early 1995, and that it has not
needed revision in the years since. He states that the definition has now been widely adopted and is cited in over
120 Web sites in about 30 countries. Research with the help of Google revealed there to be 34 countries with one
or more Web site(s) containing the term FASMI. A total of 21 countries host one or more Web site(s) that write
about FASMI in combination with The OLAP Report. The term is widely and globally used. Striking is, next to
mostly English-language sites, the large number of German (university) sites that include the terms.
We can conclude some points from the history of OLAP:
Multidimensionality is here to stay. Even hard to use, expensive, slow and elitist multidimensional products
survive in limited niches; when these restrictions are removed, it booms. We are about to see the biggest-ever
growth of multidimensional applications.
End-users will not give up their general-purpose spreadsheets. Even when accessing multidimensional databases,
spreadsheets are the most popular client platform. Multidimensional spreadsheets are not successful unless they
can provide full upwards compatibility with traditional spreadsheets, something that Improve and Compete
failed to do.
Most people find it easy to use multidimensional applications, but building and maintaining them takes a
particular aptitude which has stopped them from becoming mass market products. But, using a combination
of simplicity, pricing and bundling, Microsoft now seems determined to prove that it can make OLAP servers
almost as widely used as relational databases.
8/uts
Multidimensional applications are often quite large and are usually suitable for workgroups, rather than
individuals. Although there is a role for pure single-user multidimensional products, the most successful
installations are multi-user, client/server applications, with the bulk of the data downloaded from feeder systems
once rather than many times. There usually needs to be some IT supports for this, even if the application is
driven by end-users.
Simple, cheap OLAP products are much more successful than powerful, complex, expensive products. Buyers
generally opt for the lowest cost, simplest product that will meet most of their needs; if necessary, they often
compromise their requirements. Projects using complex products also have a higher failure rate, probably
because there is more opportunity for things to go wrong.
Description
Mostly found in consumer goods industries, retailers and the financial services
industry.
Database marketing
Determine who are the best customers for targeted promotions for particular
products or services.
Financial reporting
Management reporting
Using OLAP based systems one is able to report faster and more flexible, with
better analysis than the alternative solutions.
Profitability analysis
Quality analysis
OLAP tools provide an excellent way of measuring quality over long periods of
time and of spotting disturbing trends before they become too serious.
Table 1.2 OLAP application areas
Finding patterns
9/uts
The idea of Data Mining (DM) is to discover patterns in large amounts of data. Whereas query and even OLAP
functions require human interaction to follow relationships through a data source, data mining programs are able to
derive many of these relationships automatically by analysing and learning from the data values contained in files
and databases (Lewis, 2001). The patterns that are found in the data could provide information that cannot directly
be deduced from the data itself, patterns and connections that are not straightforward. These invisible patterns
might not always be logical and useful.
For instance, for a supermarket chain that is based in several different countries, DM might show that the sales of
yogurt in America might be strongly correlated with the sales of bicycles in the UK. Naturally this is a coincidental
connection. But if DM reveals that customers who buy Product X most of the time also purchase Products Y and Z,
it is a very valuable tool for the management to help them in their strategic decision making. Products X, Y and Z
could be in shelves that are located close to each other, or the management could chose to make special offers for
these three products at the same time, to increase the sales in a short time.
Actually there is nothing new about looking for patterns in data. People have been seeking patterns in data ever
since human life began. Hunters seek patterns in animal migration behaviour, farmers seek patterns in crop growth,
and politicians seek patterns in voter opinion. A scientists job is to make sense of data, to discover the patterns that
govern how the physical world works and encapsulate them in theories that can be used for predicting what will
happen in new situations. The entrepreneurs job is to identify opportunities, that is, patterns in behaviour that can
be turned into a profitable business, and exploit them.
1.10.1 The Data Mining Process
A quite general view of the Data Mining process is the one offered by Van der Putten (1999):
Business
Data
Understanding
Understanding
Data
Preparation
Deployment
Modeling
Evaluation
10/uts
Description
Business Understanding
Data Understanding
Collecting the initial data, describing and exploring these data and verifying
its quality.
Data Preparation
Modelling
Evaluation
Evaluating the results, reviewing the process and determining the next
steps.
Deployment
Plan deployment, plan monitoring and maintenance, producing the final report
and reviewing the project.
Table 1.3 The steps of the data mining process
Solution quality
Speed
Solution comprehensibility
Expertise required
In some cases it could be preferred to have a DM-tool that provides answers very quickly, no matter what the quality
of the solution is. In other cases one might want a solution of very high quality, but if this means that the solution
concerned becomes totally incomprehensible one will have no use for it.
11/uts
On which page of the web site do visitors enter / leave the site?
How many visitors fill their shopping cart but leave the site without making a purchase?
An article by Carine Joosse (2000) gives a short but interesting description of the different ways of applying data
mining to the Internet. The first is Mining the Web itself. An example of this is collecting data from various sites and
categorising, analysing and presenting them on new web pages for the benefit of the web visitor. Another example is
a search engine on the Web by searching for hits of a word, phrase or synonym, registration of these hits, grouping
them into categories and keeping up a history, the search engine could be made more powerful. The data mining
element in this is making predictions, trend analysis, categorising and data reduction.
A second type of Web mining is Web usage mining. The goal of web usage mining is analysing the site navigation:
how do visitors click through the site, how much time do they spend on which part (page) of the site, on which point
do they enter or leave the site? This form of analysis is also referred to as Clickstream Analysis. Just as important is
to keep records of which visitors finally make a purchase, which visitors start making a purchase that is start filling
their virtual shopping cart and do not buy in the end, and which visitors leave the site without making a purchase.
By combining all these data with the registered customer profiles it is possible to define those types of customers
that are most likely to purchase using the internet. Also, these customer profiles in connection with their behaviour
on the Web site can be used to see if the site should be designed differently.
While most authors ascribe the Web Mining tool Clickstream Analysis to the Data Mining field, Nigel Pendse says in
his OLAP Report that it is one of the latest OLAP applications (Pendse, 2001). He also adds Database Marketing
to his list of OLAP applications. In his opinion, determining who the preferred customers are can be done with
brute force data mining techniques (which are slow and can be hard to interpret), or by experienced business users
investigating hunches using OLAP cubes (which is quicker and easier). In other words, here we encounter once
again the vague boundaries that exist between the concepts within Business Intelligence!
Web mining applications of a more advanced level are personalisation and multichannel-analysis. Personalisation
happens when rules are activated in order to offer personalised content to the visitor. A danger in this application is
that the information is not always fully reliable, in the sense that the visitor cannot be categorised correctly. When
individual visitors make use of a large company network, for example, they will not be recognised as separate
visitors. What Multichannel-analysis comes down to is anticipating the behaviour, wishes and possibilities of the
customer in the use of different communication channels.
12/uts
Business Intelligence
Business Intelligence (BI) is a broad category of A decision support system (DSS) is a computer program
applications and technologies for gathering, storing, application that analyzes business data and presents it so
analyzing, and providing access to data to help enterprise that users can make business decisions more easily.
users make better business decisions.
Table 1.4 BI vs. DSS definition
The key similarity in these two definitions is making business decisions, and in particular both concepts are
focused on helping to make these decisions in a better and easier way. The other important similarity is they both
involve decision making based on data.
The way Dekker (2002) looks at it is that Data Warehousing and Data Mining have two precursors: DSS and EIS.
DSS is focused on the lower and middle management and makes it possible to look at and analyze data in different
ways. EIS is the precursor focused on the higher management. Given the fact that Data Warehousing and Data
Mining form a large part of Business Intelligence, we could indeed see DSS as the precursor of BI.
The following (Alter, 1999) fully enforces Eibens theory about BI replacing data-driven decision support: A
number of approaches developed for supporting decision making include online analytical processing (OLAP) and
data mining. The idea of OLAP grew out of difficulties analyzing the data in databases that were being updated
continually by online transaction processing systems.
When the analytical processes accessed large slices of the transaction database, they slowed down transaction
processing critical to customer relationships. The solution was periodic downloads of data from the active transaction
processing database into a separate database designed specifically to support analysis work. This separate database
often resides on a different computer, which together with its specialised software is called a data warehouse.
What Alter points out here is that, because of the difficulties when analysing the data to support decision making, the
data are duplicated in a Data Warehouse on top of which OLAP and Data Mining can be applied without disturbing
transaction processing. In other words, the components that make up Business Intelligence are replacing the oldfashioned way of performing data-driven decision support on the original transaction processing systems.
13/uts
Turban & Aronson (2001) write that the term Business Intelligence (BI) or Enterprise Systems is used to describe the
new role of the Executive Information System, especially now that data warehouses can provide data in easy-to-use,
graphics-intensive query systems capable of slicing and dicing data (Q&R) and providing active multi-dimensional
analysis (OLAP).
Simon & Shaffer (2001) find the following classification of business intelligence applications to be useful:
Data mining
Why do they include EISs as an application amongst Q&R, OLAP and DM? Dont EISs already have some form
of Q&R and even OLAP-like activities in them? One thing is certain, Turban & Aronson and Simon & Shaffer will
not agree on a definition of BI. The first duo says that BI replaces EIS, and the second includes EIS in BI.
14/uts
With BI-tools it is possible to carry out analyses and reports on virtually all thinkable aspects of the underlying
business, as long as the data about this business come in large amounts and are stored in a Data Warehouse.
Departments that are known to benefit most from Business Intelligence are (Database) Marketing, Sales, Finance,
ICT (especially the Web) and the higher Management.
Recall in the chapter about Queries & Reports the remark about Q&R not being as far away from our daily line of
work as it may seem. A very good example for this was the SPSS Report Writer that is tightly integrated with SPSS.
Another BI-tool integrated with an application many of us use daily is Business Intelligence for Excel offered by
Business Intelligence Technologies, Inc. This tool also called BIXL differs from other BI-tools in this respect: the
product delivers to an end-users Excel spreadsheet data that can be used for analytical and reporting purposes, from
Microsofts Analysis Services (and other OLEDB for OLAP cube providers), and adds all-important write-back
capabilities for planning (and budgeting and forecasting) tasks.
The output of a CI-process is Actionable Intelligence, by Voorma abbreviated as AI, but please be sure not to get
mixed up with the widely accepted abbreviation of Artificial Intelligence! Actionable Intelligence is the actionfocused (actionable) knowledge, the intelligence that stimulates changes in an organisations strategy.
For the rest it is not entirely clear from Voormas article what the added value is of CI. He names a few initial steps,
like:
Collecting data
15/uts
Summary
BI is a term introduced by Howard Dresner of Gartner Group in 1989 and whilst Hans Dekker (2002).
Many authors speak of BI as being an umbrella term, with various components hanging under this
umbrella.
BI consists of various levels of analytical applications and corresponding tools that are carried out on top of a
Data Warehouse.
Data warehouse is a collection of integrated, subject-oriented databases designed to support the DSS function,
where each unit of data is specific to some moment of time.
The data warehouse contains atomic data and lightly summarised data.
Data Mart is a part of a Data Warehouse, specifically concentrated on a part of the business, like a single
department.
OLAP is a technology that allows users to carry out complex data analyses with the help of a quick and interactive
access to different viewpoints of the information in data warehouses.
Multidimensional applications are often quite large and are usually suitable for workgroups, rather than
individuals.
OLAP products are much more successful than powerful, complex, expensive products.
Data mining is the use of data analysis tools to try to find the patterns in large transaction databases.
Data mining is analysis of large pools of data to find patterns and rules that can be used to guide decision making
and predict future behaviour.
The idea of Data Mining (DM) is to discover patterns in large amounts of data.
An area of growing importance for companies trying to sell their products is e-commerce.
With BI-tools it is possible to carry out analyses and reports on virtually all thinkable aspects of the underlying
business.
Actionable Intelligence is the action-focused (actionable) knowledge, the intelligence that stimulates changes
in an organisations strategy.
16/uts
References
Business Intelligence The Beginning, [Online] Available at: <http://www.few.vn.nl> [Accessed 25 April
2012].
Pechenizkiy, M., 2006. Lecture 2 Introduction to Business Intelligence, [Online] Available at: <http://www.win.
tue.nl/~mpechen/courses/TIES443/handouts/lecture02.pdf> [Accessed 27 April 2012].
Biere, M., 2003. Business Intelligence for the Enterprise, Prentice Hall Professional Publication.
2010. What is Business Intelligence?, [Video Online] Available at: <http://www.youtube.com/watch?v=0aHtHljcAs> [Accessed 27 April 2012].
Recommended Reading
Becerra-Fernandez, I. &Sabherwal, R., 2010. Business Intelligence, John Wiley & Sons Publication.
Howson, C., 2007. Successful Business Intelligence, Tata McGraw-Hill Education Publication.
Whitehorn, M., 1999. Business Intelligence: The IBM Solution, Springer Publication.
17/uts
Self Assessment
1. Which of the following statements is false?
a. The data warehouse contains atomic data and lightly summarised data.
b. Data warehouse is designed to support the DSS function.
c. Data warehouse is unable to deal with very large amount of data.
d. A data warehouse is a database, with reporting and query tools.
2. __________ is a replication of the data existing in the operational databases.
a. Data warehouse
b. Data Mart
c. DBMS
d. Database
3. Which of the following process does not include while creating a data warehouse?
a. Extraction
b. Manipulation
c. Transformation
d. Loading
4. _________ is the special-purpose computer language used to provide immediate, online answers to user
questions.
a. Report
b. OLAP
c. Extraction
d. Query
5. _______ is a technology that allows users to carry out complex data analyses with the help of a quick and
interactive access to different viewpoints of the information in data warehouses.
a. OLAP
b. OLATP
c. OLAP-tools
d. OLEDB
6. ________program that makes it comparatively easy for users or programmers to generate reports by describing
specific report components and features.
a. Report
b. OLAP
c. Extraction
d. Query
7. _________is analysis of large pools of data to find patterns and rules that can be used to guide decision making
and predict future behaviour.
a. Data extraction
b. Data warehouse
c. Data mining
d. Data manipulation
18/uts
8. Which of the following process does not include in data mining process?
a. Evaluation
b. Abstraction
c. Modelling
d. Deployment
9. Which of the following system is not decision support system?
a. Model-driven
b. Data-driven
c. User-driven
d. System driven
10. The output of a CI-process is_____________.
a. Business Intelligence
b. Artificial Intelligence
c. Actionable Intelligence
d. Competitive Intelligence
19/uts
Chapter II
Components of Business Intelligence Tools
Aim
The aim of this chapter is to:
Objectives
The objectives of this chapter are to:
Learning outcome
At the end of this chapter, you will be able to:
20/uts
2.1 Introduction
Business intelligence is not business as usual. Its about making better decisions easier and making them more
quickly. Businesses collect enormous amounts of data every day: information about orders, inventory, accounts
payable, point-of-sale transactions, and of course, customers. Businesses also acquire data, such as demographics
and mailing lists, from outside sources. Unfortunately, based on a recent survey, over 93% of corporate data is not
usable in the business decision-making process today.
Consolidating and organising data for better business decisions can lead to a competitive advantage, and learning
to uncover and leverage those advantages is what business intelligence is all about. The amount of business data is
increasing exponentially. In fact, it doubles every two to three years. More information means more competition.
In the age of the information explosion, executives, managers, professionals, and workers all need to be able to
make better decisions faster.
IBM Business Intelligence solutions are not about bigger and better technology they are about delivering more
sophisticated information to the business end user. BI provides an easy-to-use, shareable resource that is powerful,
cost-effective and scalable to our needs. Much more than a combination of data and technology, BI helps us to create
knowledge from a world of information. Get the right data, discover its power, and share the value, BI transforms
information into knowledge. Business Intelligence is the application of putting the right information into the hands
of the right user at the right time to support the decision-making process.
21/uts
How do we currently monitor the key or critical performance indicators of our business?
How easily can we answer ad hoc questions with our current reporting systems?
Do we have to wait a long time (hours? days?) for answers to new questions?
Depending on the response of an executive, there are certain needs that, if addressed in his responses, identify the
executive as a BI project prospect. The answers to the previously mentioned questions would point to the following,
if he is a candidate:
Dissatisfaction is exhibited with the current reporting systems, especially in terms of flexibility, timeliness,
accuracy, detail, consistency, and integrity of the information across all business users.
Many people in the organisation spend a lot of time re-keying numbers into spreadsheets.
The senior executive is very vague about how key performance indicators are monitored.
Do end users often ask IT to produce queries, reports, and other information from the database?
Do end users frequently re-key data into spreadsheets or word processing packages?
Does our production system suffer from a heavy volume of queries and reports running against the system?
Would we like to see our end users receiving more business benefits from the IT organisation? The IT staff is
a data warehousing prospect if the answers point to problem areas, such as:
End users are relying on IT to perform most or all ad hoc queries and reports.
End users have to re-key data into their spreadsheets on a regular basis.
IT identifies end user dissatisfaction with the current reporting systems and processes.
IT has a large backlog built up of end user requests for queries and reports.
IT is concerned about end user queries and reports that are bogging down the production systems.
22/uts
How are our monthly management reports and budgets delivered and produced?
Do we spend more time preparing, consolidating, and reporting on the data, or on analyzing performance that
is based on what the data has highlighted?
Do all the companys executives and managers have a single view of key information to avoid inconsistency?
How easy is it to prepare budgets and forecasts, and then to disseminate that critical information?
Can we easily track variances in costs and overhead by cost center, product, and location?
Is the year-end consolidation and reporting cycle a major amount of duplicated effort in data preparation and
validation, and then in consolidation reporting? The financial staff is a data warehousing prospect if the answers
given to these questions are like these:
Personnel like using spreadsheets, but they usually or often need to re-key or reformat data.
They indicate in any way that their preferred reporting tool would be a spreadsheet if they did not have to
constantly re-key great amounts of numbers into them.
They admit that much time is spent in the production of reports and the gathering of information, with less time
actually spent analyzing the data, and they can identify inconsistencies and integrity issues in the reports that
have been produced.
Budget collection is a painful and time consuming process and there is very little control available in the
collection and dissemination process.
The monthly management reports involve too much time and effort to produce and circulate, and do not easily allow
queries and analysis to be run against them.
Management information does not go into sufficient detail, especially in terms of expense control and overhead
analysis.
How do we perform ad hoc analysis against our marketing and sales data?
How do we monitor and track the effectiveness of a marketing or sales promotion program?
Do we have to wait a long time (days? weeks?) for sales management information to become available at month
or quarter-end?
Are we and our staff using spreadsheets a lot, and re-keying great mounts of data?
23/uts
Current reporting is very static and ad hoc requests must be accomplished through IT.
Profitability versus volume and value cannot be easily analyzed, and the measurement of data is inconsistent;
for example, there might be more than one way of calculating margin, profit, and contribution.
There is no concept of re-planning and re-budgeting as it is too difficult to accomplish with the current
systems.
Suppliers cannot be provided with timely information, so it is very difficult to achieve reviews of their
performance.
Getting down to the right level of detail is impossible: for example, to the SKU level in a retail store.
General dissatisfaction is expressed with the current process of information flow and management.
How is the validity of the MRP model checked and how accurate do we think it really is?
How do we handle ad hoc analysis and reporting for raw materials, on-time, and quality delivery?
How do we handle shipments and returns, inventory control, supplier performance, and invoicing?
New projects cannot easily be costed out, and trends in quality, efficiency, cost, and throughput cannot be
analyzed.
The preferred access to information would be via a spreadsheet or an easy-to-use graphical user interface.
Currently there is access to daily information only, which means much re-keying into spreadsheets for trending
analysis and so on is required.
The MRP model cannot easily be checked for accuracy and validity on a constant basis.
24/uts
Data Mining
Data Warehouse
ODS
Drill down
OLTP
OLTP Server
OLAP
Data Mart
Data Visualisation
Meta Data
Subject-oriented: Data that gives information about a particular subject instead of about a companys on-going
operations.
Integrated: Data that is gathered into the data warehouse from a variety of sources and merged into a coherent
whole.
Time-variant: All data in the data warehouse is identified with a particular time period.
25/uts
Calculations and modelling applied across dimensions, through hierarchies and/or across members
26/uts
OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries, regardless
of database size and complexity. OLAP helps the user synthesize enterprise information through comparative,
personalised viewing, as well as through analysis of historical and projected data in various what-if data model
scenarios. This is achieved through use of an OLAP Server.
2.4.7 OLAP Server
An OLAP server is a high-capacity, multi-user data manipulation engine specifically designed to support and operate
on multi-dimensional data structures. A multi-dimensional structure is arranged so that every data item is located
and accessed, based on the intersection of the dimension members that define that item. The design of the server
and the structure of the data are optimised for rapid ad hoc information retrieval in any orientation, as well as for
fast, flexible calculation and transformation of raw data based on formulaic relationships. The OLAP Server may
either physically stage the processed multi-dimensional information to deliver consistent and rapid response times
to end users, or it may populate its data structures in real-time from relational or other databases, or offer a choice
of both. Given the current state of technology and the end user requirement for consistent and rapid response times,
staging the multi-dimensional data in the OLAP Server is often the preferred method.
2.4.8 Metadata
Metadata is the kind of information that describes the data stored in a database and includes such information as:
A description of tables and fields in the data warehouse, including data types and the range of acceptable
values.
A similar description of tables and fields in the source databases, with a mapping of fields from the source to
the warehouse.
A description of how the data has been transformed, including formulae, formatting, currency conversion, and
time aggregation.
Any other information that is needed to support and manage the operation of the data warehouse.
2.4.9 Drill-Down
Drill-down can be defined as the capability to browse through information, following a hierarchical structure. A
small sample is shown in figure below.
27/uts
On operational databases a high number of transactions take place every hour. The database is always up to
date, and it represents a snapshot of the current business situation, or more commonly referred to as point in
time.
Informational databases are usually stable over a period of time to represent a situation at a specific point in
time in the past, which can be noted as historical data.
For example, a data warehouse load is usually done overnight. This load process extracts all changes and new records
from the operational database into the informational database. This process can be seen as one single transaction
that starts when the first record gets extracted from the operational database and ends when the last data mart in the
data warehouse is refreshed. Following figure shows some of the main differences of these two database types.
28/uts
29/uts
30/uts
31/uts
32/uts
Extracted, detailed, denormalised data organised in a Star-Join Schema to optimise query performance.
Multiple aggregated and precalculated data marts to present the data to the end user.
33/uts
Departmental data marts to hold data in an organisational form that is optimised for specific requests new
requirements usually require the creation of a new data mart, but have no further influence on already existing
components of the data warehouse.
Metadata is the major component to guarantee success of this architecture, ease-of-use and navigation support
for end users.
The three different stages in aggregating/transforming data offer the capability to perform data mining tasks
in the extracted, detailed data without creating workload on the operational system.
Workload created by analysis requests is totally offloaded from the OLTP system.
34/uts
The different stages of aggregation in the data are: OLTP data, ODS Star-Join Schema, and data marts.
Metadata and how it is involved in each process is shown with solid connectors.
The horizontal dotted line in the figure separates the different tasks into two groups.
Tasks to be performed on the dedicated OLTP system are optimised for interactive performance and to handle
the transaction oriented tasks in the day-to-day-business.
Tasks to be performed on the dedicated data warehouse machine require high batch performance to handle the
numerous aggregations, precalculation, and query tasks.
35/uts
2.7.1 Extraction/Propagation
Data extraction / data propagation is the process of collecting data from various sources and different platforms to
move it into the data warehouse. Data extraction in a data warehouse environment is a selective process to import
decision-relevant information into the data warehouse. Data extraction / data propagation is much more than mirroring
or copying data from one database system to another. Depending on the technique, this process is either:
Pulling (Extraction) or
Pushing (Propagation)
2.7.2 Transformation/Cleansing
Transformation of data usually involves code resolution with mapping tables for example, changing 0 to female and
1 to male in the gender field and the resolution of hidden business rules in data fields, such as account numbers. Also
the structure and relationships of the data are adjusted to the analysis domain. Transformations occur throughout
the population process, usually in more than one step. In the early stages of the process, the transformations are
used more to consolidate the data from different sources, whereas, in the later stages the data is transformed to suit
a specific analysis problem and/or tool.
Data warehousing turns data into information, on the other hand, cleansing ensures that the data warehouse will
have valid, useful, and meaningful information. Data cleansing can also be described as standardisation of data.
Through careful review of the data contents, the following criteria are matched:
Data consolidation (one view), such as house holding and address correction
36/uts
Data aggregation: Change the level of granularity in the information. Example: The original data is stored on a
daily basis the data mart contains only weekly values. Therefore, data aggregation results in less record.
Data summarisation: Add up values in a certain group of information. Example: The data refining process
generates records that contain the revenue of a specific product group, resulting in more records.
37/uts
(Source: capstone.geoffreyanderson.net)
Both database architectures can be selected to create departmental data marts, but the way to access the data in the
databases is different:
To access data from a relational database, common access methods like SQL or middleware products like
ODBC can be used.
Multidimensional databases require specialised APIs to access the usually proprietary database architecture.
Fact tables
Dimension tables
The following is a definition for those two components of the Star-Join Schema:
Fact Tables: -what are we measuring? Contain the basic transaction-level information of the business that is
of interest to a particular application. In marketing analysis, for example, this is the basic sales transaction data.
Fact tables are large, often holding millions of rows, and mainly numerical.
Dimension Table: - by what are we measuring? Contain descriptive information and are small in comparison
to the fact tables. In a marketing analysis application, for example, typical dimension tables include time period,
marketing region, product type etcetera.
Subject oriented, based on abstractions of real-world entities like project, customer, organisation
etcetera.
Estimates response time by showing the number of records to be processed in a query. Holds calculated fields
and pre-calculated formulas to avoid misinterpretation, and contains historical changes of a view.
39/uts
The data warehouse administrator perspective of metadata is a full repository and documentation of all contents
and all processes in the data warehouse, whereas, from an end user perspective, metadata is the roadmap through
the information in the data warehouse.
2.7.7 Operational Data Source (ODS)
The operational data source can be defined as an updatable set of integrated data used for enterprise-wide tactical
decision making. It contains live data, not snapshots, and has minimal history that is retained.
40/uts
Provide fast access to information for specific analytical needs or user group.
Represents the end users view and data interface of the data warehouse.
42/uts
Summary
Business intelligence is not business as usual. Its about making better decisions easier and making them more
quickly.
Businesses acquire data, such as demographics and mailing lists, from outside sources.
Consolidating and organising data for better business decisions can lead to a competitive advantage, and learning
to uncover and leverage those advantages is what business intelligence is all about.
Dissatisfaction is exhibited with the current reporting systems, especially in terms of flexibility, timeliness,
accuracy, detail, consistency, and integrity of the information across all business users.
Operational databases are detail oriented databases defined to meet the needs of sometimes very complex
processes in a company.
A data warehouse is a database where data is collected for the purpose of being analyzed.
The systems used to collect operational data are referred to as OLTP (On-Line Transaction Processing).
Bill Inmon coined the term data warehouse in 1990. His definition is: A (data) warehouse is a subjectoriented, integrated, time-variant and non-volatile collection of data in support of managements decisionmaking process.
A data mart contains a subset of corporate data that is of value to a specific business unit, department, or set
of users.
External data is data that can not be found in the OLTP systems but is required to enhance the information
quality in the data warehouse.
OLAP is implemented in a multi-user client/server mode and offers consistently rapid response to queries,
regardless of database size and complexity.
An OLAP server is a high-capacity, multi-user data manipulation engine specifically designed to support and
operate on multi-dimensional data structures.
Drill-down can be defined as the capability to browse through information, following a hierarchical structure.
Metadata is the kind of information that describes the data stored in a database and includes information.
A summary table on an OLTP system is the most common implementation that is already included in many
standard software packages.
The three different stages in aggregating/transforming data offer the capability to perform data mining tasks in
the extracted, detailed data without creating workload on the operational system.
Data sources can be operational databases, historical data usually archived on tapes and external data.
Data extraction / data propagation is the process of collecting data from various sources and different platforms
to move it into the data warehouse.
Data warehousing turns data into information, on the other hand, cleansing ensures that the data warehouse will
have valid, useful, and meaningful information.
Data refining is creating subsets of the enterprise data warehouse, which have either a multidimensional or a
relational organisation format for optimised OLAP performance.
43/uts
References
Reinschmidt, J., Business Intelligence Certification Guide [pdf] Available at: <capstone.geoffreyanderson.net/
export/.../sg245747.pdf - United States> [Accessed 27 April 2012].
Haag, 2005. Business Driven Technology W/Cd, Tata McGraw-Hill Education Publication.
Schlukbier, A., 2007. Implementing Enterprise Data Warehousing: A Guide for Executives, Lulu.com
Publication.
2011, 1.2.1 BI Tools and Processes, [Video Online] Available at: <http://www.youtube.com/watch?v=ZpBtxKf20zY>
[Accessed 27 April 2012].
Recommended Reading
Panos, V., Vassiliou, Y., Lenzerini, M. & Jarke, M., 2003. Fundamentals of Data Warehouses, 2nd ed. Springer
Publication.
Paredes, J., 2009. The Multidimensional Data Modeling Toolkit: Making Your Business Intelligence Applicatio,
John Paredes Publication.
Scheps, S., 2008. Business Intelligence For Dummies, John Wiley & Sons Publication.
44/uts
Self Assessment
1. ______ describes the way data is processed by an end user or a computer system.
a. OLTP Server
b. ODS
c. OLAP
d. OLTP
2. A _________is a database where data is collected for the purpose of being analysed.
a. data mart
b. data warehouse
c. meta Data
d. data Mining
3. An _________is a high-capacity, multi-user data manipulation engine specifically designed to support and
operate on multi-dimensional data structures.
a. OLAP server
b. OLTP Server
c. ODS
d. OLAP
4. __________is the kind of information that describes the data stored in a database and includes information.
a. Data mart
b. ODS
c. Metadata
d. OLAP
5. ________is the capability to browse through information, following a hierarchical structure.
a. Drill-down
b. Meta data
c. Drill-up
d. Data haunting
6. Which one of the following is not the stage of aggregation in data?
a. OLTP data
b. ODS Star-Join Schema
c. data marts
d. OLAP
7. _________is the process of collecting data from various sources and different platforms to move it into the
data warehouse.
a. Data aggregation
b. Data extraction
c. Data manipulation
d. Drill-down
45/uts
46/uts