Sei sulla pagina 1di 73

Computer Science Faculty

Information Systems Department

Data warehousing & BI


Abdul Rahman Safi
Rafiullah Momand
With Materials taken from Dr. Marcela Charfuelan, Dr. Ahsan Abdullah, Dr. Michael
Mannino, Dr.Jahangir Karimi
Introduction to DW
Contents -- Lectures
• Introduction to Data Warehouse
• De-normalization
• OLAP (Online Analytical Processing)
• Dimensional Modeling
• ETL (Extract Transform Load)
• DQM (Data Quality Management)
• Need For Speed
• Introduction to Knowledge Discovery
• Introduction to Business Intelligence

Information Systems Department 3


Tools
• Microsoft SQL Server 2012
• Oracle Database
• MySQL Server
• Pentaho
• Kettle
• Pivot4j
• MicroStrategy
• Talend

Information Systems Department 4


Organization
• Lectures: Sunday 08:00 – 08:50, 09:50-10:40 CSFCR02
Tuesday 08:00 – 09:45, CSFCR02
• Tutorials: will be adjusted. ISDLab

Information Systems Department 5


Evaluation
• Assignments: 45%
• Theoretical -- Lecture based: 10%
• Practical – Based on tutorials: 10%
• Seminar– Either a chapter of a book or a paper: 25%
• Mid Term: 15%
• Final: 40%

Information Systems Department 6


Introduction University students IT professionals

• Targeted Learners

Project managers Business analysts

Information Systems Department 7


Broad Course Objectives
• Establish an initial foundation of data warehouse background for
business intelligence careers
• Gain conceptual background about business architectures,
management practices, and data warehouse development
methodologies
• Create data warehouse designs, data integration workflows, and pivot
table operations
• Reflect on business architecture selection, data warehouse design
methodologies, and data integration goals and constraints

Information Systems Department 8


Historical Overview

Information Systems Department 9


What is Business Intelligence (BI)?
• BI is the process of making “intelligent” business decisions based on the
analysis of available data
• Decision support systems are the core of business IT infrastructures, that
enable executives, managers, and analysts to make better and faster
decisions.
• Examples of business decisions:
• Send vouchers to clients whose consumption is above $N a year
• Reward best agents/sellers in certain area, region
• Increase the stocks on certain related products
• Send personalized information (advertisement) to selected target customers
• Close/reduce or expand/open branches of a company
• Develop marketing campaigns according to customer consumption

Information Systems Department 10


Decision Support System in the context of BI

Information Systems Department 11


What is a Data Warehouse (DW)?
• A Data Warehouse is a repository of integrated enterprise data. A
DW is used specifically for decision support, i.e., there is (typically or
ideally) only one DW in an enterprise.
• DW technologies have been successfully deployed in many industries:
• Manufacturing for order shipment and customer support
• Retail for user profiling to target grocery coupons during checkout
• Financial services for claims analysis and fraud detection
• Transportation for fleet management
• Telecommunications for identifying reasons for customer churn
• Health care for outcomes analysis

Information Systems Department 12


Inmon’s definition – A data Warehouse is:
• Subject oriented all entities and events relating to a specific subject (e.g.,
“sales”) are linked together.
• customer, product, transaction or activity, policy, claim, account
• Time variant all changes to the data are tracked to enable reporting that
shows changes over time.
• Non-volatile when data is entered into a DW, it is never overwritten or
deleted
• Integrated the DW contains data from multiple source systems after being
cleaned and conformed
All authors agree on the reasoning behind a separate data store for business
analysis and reporting as was originally defined by Devlin & Murphy

Information Systems Department 13


Inmon’s definition – Data Warehouse
characteristics

Information Systems Department 14


Why DW? --Motivation
• Decision Making Hierarchy

Decision making hierarchy Typical decisions

Identify new markets,


Top
(strategic) choose store locations

Middle
Choose suppliers, forecast
(tactical) sales

Lower Resolve order delays,


(operational)
schedule employees

Information Systems Department 15


Why DW? --Motivation
• Technology and Deployment Limitations

Lack of
integration
Missing
Performance
DBMS
limitations
features
Data
warehouse
technology
and
deployments

Information Systems Department 16


Why DW? --Motivation
• Data Warehouse Characteristics
• Essential part of infrastructure for business intelligence
• Logically centralized repository for decision making
• Populated from operational databases and external data
sources
• Integrated and transformed data
• Optimized for reporting and periodic integration

Information Systems Department 17


Why DW? --Motivation
• Comparison of Processing Environments

Transaction processing
• Primary data from transactions
• Daily operations and short term
decisions

Business intelligence processing


• Transformed secondary data
• Medium and long-term decisions

Information Systems Department 18


Why DW? --Motivation
• Data Comparison

Information Systems Department 19


Why DW? --Motivation
• Schema Comparison Data warehouse
Manages Store
Item StoreId
ItemId StoreManager
Employee StoreStreet
ItemName
EmpNo ItemUnitPrice StoreCity
EmpFirstName ItemBrand StoreSales StoreState
EmpLastName ItemCategory StoreZip
... StoreNation
ItemSales Sales DivId
DivName
SalesNo
DivManager
SalesUnits
Takes SalesDollar
Customer SalesCost
TimeDim
Product CustId TimeNo
Customer CustName TimeSales TimeDay
Order ProdNo CustPhone
CustNo ProdName TimeMonth
OrdNo CustStreet CustSales TimeQuarter
CustFirstName Places Contains ProdQOH
OrdDate CustCity TimeYear
CustLastName ...
... CustState TimeDayOfWeek
...
CustZip TimeFiscalYear
Qty CustNation

Operational database Information Systems Department 20


Why a DW is necessary?
• Example: Book retailer

Information Systems Department 21


Why a DW is necessary?
• Example: extract information from a relational data base:

Information Systems Department 22


Why a DW is necessary?
• Question: How many finished orders do we have before Christmas?,
organize by product group and discount type

Information Systems Department 23


Why a DW is necessary?
SELECT Y.year, PG.name, DI.disc, count(*)
FROM year Y, month M, day D, session S, line_item I, order O, product P,
productgroup PG, discount DI, order_status OS
WHERE M.year_id = Y.id and
D.month_id = M.id and
S.day_id = D.id and
O.session_id = S.id and
I.order_id = O.id and
I.product_id = P.id and
P.productgroup_id = PG.id and
DI.productgroup_id = PG.id and
O.id = OS.order_id and D.day < 24 and
M.month = 12 and
OS.status=’FINISHED’
GROUP BY Y.year, year PG.name, DI.discount
ORDER BY Y.year, id DI.discount

Information Systems Department 24


Why a DW is necessary?
• Result:
9 Joins
year: 10 Records line_item: 72.000.000
month: 120 Records order_status: 37.000.000
day: 3.650 Records product: 200.000
session: 36.000.000 productgroup: 100
order: 37.000.000 discount: 50

• Problem!
• Difficult to optimize (Join-Order)
• Due to the execution plan, lots of intermediate results
• Similar questions, big amount of intermediate results

Information Systems Department 25


Why a DW is necessary? – In the real world
• There are more DBs:
• Amazon.de
• Amazon.fr
• Amazon.en
• ...
• More possibilities
• Count over the union result of similar questions in different
• databases

Information Systems Department 26


Why a DW is necessary? – In the real world:

Information Systems Department 27


Why a DW is necessary? – Technically:
• Create a VIEW
CREATE VIEW christmas AS
SELECT Y.year, PG.name, DI.disc, count(*) AS o_count
FROM FR.year Y, FR.month M, FR.day D, FR.session S, ...
WHERE M.year = Y.id and
...
ORDER BY Y.year, DI.discount
UNION ALL
SELECT Y.year, PG.name, DI.disc, count(*) AS o_count
FROM EN.year Y, EN.month M, EN.day D, EN.session S, ...
WHERE M.year = Y.id and
...
ORDER BY Y.year, DI.discount
• Use a predefined VIEW
• SELECT year, name, disc, sum(o_count)
• FROM christmas
• GROUP BY year, name, disc
• ORDER BY year, disc

Information Systems Department 28


Why a DW is necessary? – additional problems
• Count over the union result of distributed databases
• Heterogeneity problem
• Sources might change the schemas
• Country specific units (VAT, shipment costs, special offers )
• Hidden changes in the semantics of the data
• Calculation of intermediate results with each question
• Data size problem
• Requires transport of huge amounts of data over the network
• Historical view – size of data grows continuously
• The operational (transaction) system does not need historical
• DataGoal: early deletion (closed orders)
• Managers do not need many of the operational (transaction)
• Details Goal: keeping everything.
Information Systems Department 29
Why a DW is necessary? – solution to the
problem of heterogeneity?

• Problem
• Several branches write over the network
• Large response time in productive operations
• Data size problem remains

Information Systems Department 30


Why a DW is necessary? – solution to the problem
of
response time?

• Problem
• Fast local questions
• Large response time for strategic questions
• Heterogeneity problem remains

Information Systems Department 31


Why a DW is necessary? – solution to the large
response
time for strategic questions?

• Problem
• Local questions executed on huge tables
• Delay in productive operations
• Data size problem remains

Information Systems Department 32


Why a DW is necessary? – solution a Data
Warehouse

• Redundant data storage


• Special Modelling
• Transform and select data
• Asynchronous actualization

Information Systems Department 33


Why a DW is necessary?
• A DW is necessary because
• All information is in one place
• Up-to-date Information
• Quick access
• No size limits
• All history available
• Easy to understand
• Clear and uniform definitions
• Standardized data
Assumption: the DW is designed and built properly!

Information Systems Department 34


Knowledge Discovery in Databases and data
mining
• At an abstract level, the KDD field is concerned with the development
of methods and techniques for making sense of data.
• Basic problem addressed by the KDD process: mapping low-level
data (typically too voluminous) into other forms more:
• compact  a short report,
• abstract  a descriptive approximation or model of the process that
generated the data, or
• useful  a predictive model for estimating the value of future
• cases.
• At the core of the process is the application of specific data-mining
methods for pattern discovery and extraction.

Information Systems Department 35


What is Knowledge Discovery in Databases
(KDD)?

Information Systems Department 36


Knowledge Discovery in Databases and data
mining
• In Business main KDD application areas include:
• marketing,
• finance (especially investment),
• fraud detection,
• manufacturing,
• telecommunications ...
• Interdisciplinary Nature of KDD, continuous evolve from the
• intersection of research fields such as:
• machine learning,
• pattern recognition,
• databases,
• statistics,
• artificial Intelligence (AI) ...
• The unifying goal is extracting high-level knowledge from low-level data in the context
of large data sets.

Information Systems Department 37


How BI, KDD, Data Mining, and DW are
related?
• KDD refers to the overall process of discovering useful knowledge from
data.
• Data mining is a particular step in the KDD process and is the application of
specific algorithms for extracting patterns from data.
• Most data mining algorithms from statistics, pattern recognition, and
machine learning assume data are in the main memory.
• DB techniques for gaining efficient data access, (grouping, ordering
operations, optimizing queries, etc.) constitute the basics for scaling
algorithms to larger data sets
• Data warehousing helps to set the stage for KDD in two important ways:
cleaning and data access

Information Systems Department 38


How BI, KDD, Data Mining, and DW are
related?

Information Systems Department 39


Learning Effects for DW Development
• Challenges in Data Warehouse Projects
• Substantial coordination across organizational units
• Uncertain data quality in data sources
• Difficult to scale data warehouse

Information Systems Department 40


Learning Effects for DW Development
• Intangible Benefits
• Not easily quantified but important for an organization’s success
• Increased data quality
• Fewer missing values
• More matched entities
• More data availability
• Higher levels of compliance with data standards
• May become tangible over time

Information Systems Department 41


Learning Effects for DW Development
• Learning Curve for Skills

Information Systems Department 42


Learning Effects for DW Development
• Learning Curve for Production
Learning Curve for Production
21
19
17
15
Effort

13
11
9
7
5
3
1
0 1 2 3 4 5 6 7 8 9 10 11

Units

Information Systems Department 43


Learning Effects for DW Development
• Maturity Relationships
Business Value Learning Curve Data Transformation Learning Curve
1.2 25

1
20
Business value

0.8
15

Transformation Cost
0.6

10
0.4

5
0.2

0 0
0 10 20 30 40 50 60 70 0 2 4 6 8 10 12
Time Time

Information Systems Department 44


Learning Effects for DW Development
• Project Relationships
Potential Value Project Risk
1.2 1.2

1 1
Business value

0.8 0.8

Risk
0.6 0.6

0.4 0.4

0.2 0.2

0 0
0 20 40 60 80 0 10 20 30 40 50 60 70
Scope Scope

Information Systems Department 45


Market Trends and Applications
• Data Mining
• Discover significant, implicit patterns
• Target promotions
• Change mix and collocation of items
• Requires large volumes of transaction data including sensor data and social
media interactions
• Important tools for business intelligence

Information Systems Department 46


Market Trends and Applications
• Market Shares and Trends
• Major vendors: Teradata, Oracle, IBM, Microsoft, SAP
• Large projected market growth
• Trends
• Real time load and analysis
• Increased storage and analysis of social interactions
• Increased usage of cloud services and appliances

Information Systems Department 47


Market Trends and Applications
• Cloud Influence
• Reduces local expertise to procure technology and manage a data
warehouse
• Economies of scale
• Improved scalability Server

• Higher variable costs but Database

lower fixed costs


Server Server

Database Database

Information Systems Department 48


Market Trends and Applications
• Cloud Service Models
User
Organization Application
(SaaS)
Development
Platform Cloud Vendor
(PaaS)
Infrastructure
Infrastructure
(IaaS)

Information Systems Department 49


How it is Different?
• Starts with a 6x12 availability requirement ... but 7x24 usually
becomes the goal.
• Decision makers typically don’t work 24 hours a day and 7 days a week.
• Once decision makers start using the DWH, and start reaping the benefits,
they start liking it.
• Start using the DWH more often, till want it available 100% of the time.
• For business across the globe, 50% of the world may be sleeping at any one
time, but the businesses are up 100% of the time.
• 100% availability not a trivial task, need to take into account loading
strategies, refresh rates etc.

Information Systems Department 50


How it is Different?
Business user
• Fundamentally different needs info

Answers result
User requests
in more questions
IT people

?
Business user
may get answers  IT people do
system analysis
and design

IT people
send reports to IT people
business user create reports

Information Systems Department 51


How it is Different?
• Combines operational and historical data
• OLTP systems don’t keep history, cant get balance statement more
than a year old.
• DWH keep historical data, even of by gone customers. Why?
• In the context of bank, want to know why the customer left?
• What were the events that led to his/her leaving? Why?
• Customer retention.

Information Systems Department 52


How much history?
• Depends on:
• Industry.
• Cost of storing historical data.
• Economic value of historical data.
• Industries and history
• Telecomm calls are much more as compared to bank transactions- 18 months.
• Retailers interested in analyzing yearly seasonal patterns- 65 weeks.
• Insurance companies want to do actuary analysis, use the historical data in
order to predict risk- 7 years

Information Systems Department 53


How much history?
Economic value of data
Vs.
Storage cost

Data Warehouse a
complete repository of data?
Information Systems Department 54
How it is Different?
• Does not follows the traditional development model

Classical SDLC Requirements

 Requirements gathering
 Analysis
 Design
 Programming
 Program
 Testing
 Integration
 Implementation 

Information Systems Department 55


How it is Different?
• Does not follows the traditional development model

DWH SDLC (CLDS)


DWH
 Implement warehouse
 Integrate data
 Test for biasness
Program
 Program w.r.t data


 Design DSS system
 Analyze results
Requirements
 Understand requirement

Information Systems Department 56


What is(is not) a Data Warehouse(DW)
Operational Database Data Warehouse
Purpose Day-to-day operations of an organization. OLTP Decision support, On-line Analytical Processing
(OLAP)
Tasks Structured and repetitive, and consist of short, Extract revenue/lost of last 3 years per product, per
atomic, isolated transactions. month. Understand trends, make predictions
Data The transactions require detailed, up-to date Historical, summarized and consolidated data is
data, and read or update a few (tens of) records more important than detailed, individual records
accessed typically on their primary Keys.
History Do not maintain history, but update data to Need historical context to be preserved to accurately
reflect current state evaluate the organization’s performance over time.
Size Hundreds of megabytes to gigabytes in size tend to be orders of magnitude larger than
operational databases.
Performance Consistency and recoverability of the database The workloads are query intensive with mostly ad
are critical. Designed to minimize concurrency hoc, complex queries that can access millions of
conflicts records and perform a lot of scans, joins, and
aggregates
Information Systems Department 57
OTLP Vs OLAP

Information Systems Department 58


Comparison of Response Times
• On-line analytical processing (OLAP) queries must be executed in a
small number of seconds.
• Often requires de-normalization and/or sampling.
• Complex query scripts and large list selections can generally be
executed in a small number of minutes.
• Sophisticated clustering algorithms (e.g., data mining) can generally
be executed in a small number of hours (even for hundreds of
thousands of customers).

Information Systems Department 59


Typical Applications
• Impact on organization’s core business is to streamline and maximize
profitability.
• Fraud detection.
• Profitability analysis.
• Direct mail/database marketing.
• Credit risk prediction.
• Customer retention modeling.
• Yield management.
• Route Assessment.
• Risk Assessment.
• Inventory management.
• ROI on any one of these applications can justify HW/SW & consultancy
costs in most organizations.

Information Systems Department 60


Typical Applications
• Fraud detection
• By observing data usage patterns.
• People have typical purchase patterns.
• Deviation from patterns.
• Certain cities notorious for fraud.
• Certain items bought by stolen cards.
• Similar behavior for stolen phone cards.

Information Systems Department 61


Typical Applications
• Profitability Analysis
• Banks know if they are profitable or not.
• Don’t know which customers are profitable.
• Typically more than 50% are NOT profitable.
• Don’t know which one?
• Balance is not enough, transactional behavior is the key.
• Restructure products and pricing strategies.
• Life-time profitability models (next 3-5 years).

Information Systems Department 62


Typical Applications
• Direct mail marketing
• Targeted marketing.
• Offering high bandwidth package NOT to all users.
• Know from call detail records of web surfing.
• Saves marketing expense, saving pennies.
• Knowing your customers better.

Information Systems Department 63


Typical Applications
• Credit risk prediction
• Who should get a loan?
• Customer segregation i.e. stable vs. rolling.
• Qualitative decision making NOT subjective.
• Different interest rates for different customers.
• Do not subsidize bad customer on the basis of good.

Information Systems Department 64


Typical Applications
• Yield Management
• Works for fixed inventory businesses.
• The price of item suddenly goes to zero.
• Item prices vary for varying customers.
• Example: Air Lines, Hotels etc.
• Price of (say) Air Ticket depends on:
• How much in advance ticket was bought?
• How many vacant seats were present?
• How profitable is the customer?
• Ticket is one-way or return?

Information Systems Department 65


Typical Applications Some Issues
• Financial
 First data warehouse that an organization builds. This is appealing
because:
 Nerve center, easy to get attention.
 In most organizations, smallest data set.
 Touches all aspects of an organization, with a common denomination i.e.
money.
 Inherent structure of data directly influenced by the day-to-day activities of
financial processing.

Information Systems Department 66


Typical Applications
• Telecommunication
• Dominated by sheer volume of data.
• Many ways to accommodate call level detail:
 Only a few months of call level detail,
 Storing lots of call level detail scattered over different storage media,
 Storing only selective call level detail, etc.
 Unfortunately, for many kinds of processing, working at an aggregate level is simply not
possible.

Information Systems Department 67


Typical Applications
• Insurance
• Insurance data warehouses are similar to other data warehouses BUT with a
few exceptions.
• Long operational business cycles, in years. Processing time in months. Thus the operating
speed is different.
• Transactions are not gathered and processed, but are in kind of “frozen”.
• Thus a very unique approach of design & implementation.
• Long operational business cycles, in years. Processing time in months. Thus the operating
speed is different.
• Transactions are not gathered and processed, but are in kind of “frozen”.
• Thus a very unique approach of design & implementation.

Information Systems Department 68


Job Opportunities – Market Value
• Recommend technology solutions
DW Analyst • Define user interfaces
• Collaborate with business analysts and DW managers

• Design, develop, and maintain data warehouses


DW Manager • Ensure conformance to enterprise standards
• Develop and implement data integration procedures

• Develop data analysis and reporting solutions


• Mine and analyze data from multiple sources
BI Analyst • Communicate results to management
• Prepare data (reduction and missing values)

• Document data elements


• Use reporting tools
Data Analyst • Collaborate with business analysts and data architects
• Develop data extraction procedures

Information Systems Department 69


Job Opportunities – Market Value
Position
Competency
DW Manager DW Analyst BI Analyst
Communication ▄ █ █
Data cube tools ▄ █ █
Dashboards ▄ █
Data mining ▄ █
Data integration tools █ █
DW schema design █ ▄
Performance analysis █
Quantitative modeling █
SQL extensions █ █ ▄

Information Systems Department 70


Job Opportunities – Market Value
• Competency Acquisition

Career Progression

Courses/
Internship Certification Experience
Degrees

Information Systems Department 71


Job Opportunities – Market Value
• Salary Trends (USA)

Job Title 2013 2014 % Change


DB manager $101,750 – $140,750 $107,750 – $149,000 5.9%

DB developer $80,500 – $128,250 $92,000 – $134,500 5.5%

Data analyst $64,250 – $96,000 $67,750 – $101,000 5.3%

DW manager $108,750 – $145,750 $115,250 – $154,250 5.9%

DW analyst $93,500 – $126,500 $99,000 – $133,750 5.8%

BI analyst $94,250 – $132,500 $101,250 – $142,250 7.4%

Information Systems Department 72


Job Opportunities – Market Value
• Salary Trends (Europe)

Job Title Country 2013 2014


DBA Germany €40,000 – €55,000 €40,000 – €60,000

Business Analyst Germany €55,000 – €85,000 €55,000 – €85,000

DBA London £55,000 – £85,000 £55,000 – £80,000

Database developer £55,000 – £85,000 £60,000 – £85,000

DBA France €50,000 – €90,000 €50,000 – €70,000

DBA Australia $75,000 – $125,000 $75,000 – $125,000

Information Systems Department 73

Potrebbero piacerti anche