Sei sulla pagina 1di 47

DATA WAREHOUSING

ARCHITECTURES
Basic information system architectures
• Client server
• n-tier
• One tier
• Two tier
• Three tier

• Capability of N-tier architecture


Three components of data warehouse
• Data warehouse itself
• Data acquisition
• Front-end software
3 tier-
• Client workstation
• DSS/BI/BA engine –application software
• Data acquisition
Database software
• Data warehouse
• Advantage of 2-tier
 Economical
 Simple, easy to build

Disadvantage of 2-tier
 performance problems

Web-based data warehousing


Several issues must be considered when deciding which architecture
to use.
• DB?

• Parallel processing/partitioning ?

• Data migration tools

• Data retrieval and analysis tools


The five architectures by Ariyachandra and Watson

1.Independent data marts


2.Data mart bus architecture
3.Hub-and-spoke architecture
4.Centralized data warehouse
5.Federated data warehouse
Ariyachandra and Watson (2005) identified 10
factors that potentially affect the
architecture selection decision:

• Information interdependence
• Upper management’s information needs
• Urgency of need for a data warehouse
• Nature of end-user tasks
• Constraints on resources
Selection Decision continued…
• Strategic view of the data warehouse prior to implementation

• Compatibility with existing systems

• Perceived ability of the in-house IT staf

• Technical issues
Quiz
A star schema has what type of relationship between a dimension and
fact table?
1. Many-to-many
2. One-to-one
3. One-to-many
4. All of the above
What is data scrubbing?
1. A process to reject data from the data warehouse and to create the
necessary indexes
2. A process to load the data in the data warehouse and to create the
necessary indexes
3. A process to upgrade the quality of data after it is moved into a data
warehouse
4. A process to upgrade the quality of data before it is moved into a data
warehouse
Fact tables are

1.Completely demoralized
2. Partially demoralized
3. Completely normalized
4. Partially normalized
Attempt to find a function which models the data with the least error is
known as

1. Clustering
2. Regression
3. Association rule
4. Clustering
The active data warehouse architecture includes which of the
following?

1. At least one data mart


2. Data that can extracted from numerous internal and external sources
3. Near real-time updates
4. All of the above
Which of the following statements is/are true about Data Warehouse?
Options
- Can be update by end user

- Contains numerous naming conventions and formats

- Organized around important subject areas

- Contain only current data


What are the challenges to developing BI with semi-structured or
unstructured data?
Options
- unstructured data is stored in a huge variety of formats
- there is a need to develop a standardized terminology
- Both a and b
What makes BI 2.0 diferent?
Options
- Dynamic querying of real-time corporate data

- Unstructured data is taken care of.

- Both a and b

- Semi structured data is taken care of.


________________ in business intelligence allows huge data and reports to be
read in a single graphical interface
Options
- Reports

- OLAP

- Dashboard

- Warehouse
Which of the following BI technique can predict value for a specific data item
attribute?
Options
- Classification

- Clustering

- Regression

- Association rule mining


Which of the following is BI analytics tools
• Rstudio
• Redash
• Streamsets
• QuickSight
Data Integration
Three Major processes

• Data access
• Data federation
• Change capture

Strong data integration tools are given by


• SAS institute
• Oracle
Various integration technologies
Allows data and meta data integration

• Enterprise application integration (EAI)


• Service-oriented architecture (SOA)
• Enterprise information integration (EII)
• Extraction, transformation, and load (ETL)
Enterprise application integration (EAI)
• Process –uses technologies and services -
Integrates software applications across
enterprise
• Benefits of modularization
• Enterprise applications- CRM, SCM,
Inventory
• No communication
• EAI links EA applications to simplify and
automate business processes
Benefits of EAI
• Better sharing of vital data across an organization’s
applications
• Access information in real time
• Data freely available to those within the organization
who need access to this information.
• Streamlined processes
Enterprise information integration
(EII)
• Real-time data integration from a
variety of sources
• It is a pull engine, web service – satisfy
user request
• EII tools use predefined metadata
• Populate views that make integrated
data
• Make appear relational to end users
Extraction, Transformation, and
Load (ETL)
ETL tools perform the following:
• Transport data
• Document change in data elements
• Exchange metadata with other applications as
needed
• Administer all runtime processes and operations
(e.g., scheduling, error management, audit logs,
statistics).
Extract
• Data is extracted from source system to staging area
• Why staging area?
Transformations are applied
Direct load -> source to DW-> corrupted data-.
Challenging
Validates data
Three Data Extraction methods:
• Full Extraction
• Partial Extraction- without update notification.
• Partial Extraction- with update notification
Some validations are done during Extraction:
• Reconcile records with the source data
• Make sure that no spam/unwanted data loaded
• Data type check
• Remove all types of duplicate/fragmented data
• Check whether all the keys are in place or not
Transformation
• Key step – BI insights
• Direct move or pass through data.
• Situations for transformations
• Different spelling of the same person like Jon, John, etc.
• There may be a case that different account numbers are
generated by various applications for the same customer.
• Sum-of-sales revenue
Some validations
• Using any complex data validation (Empty rows-reject)
• Selection/Rejection of rows and columns
• Data threshold validation check. For example, age cannot be
more than two digits.
• Conversion of Units of Measurements like Date Time
Conversion, currency conversions, numerical conversions,
etc.
• Using rules and lookup tables for Data standardization
• Cleaning ( for example, mapping NULL to 0 or Gender Male
to "M" and Female to "F" etc.)
• Required fields should not be left blank.
Loading

Challenges
• Huge volume of data – short time
• In case of failure- recovery
Types of loading
• Initital load
• Incremental load
• Full refresh
Load verification
• Ensure that the key field data is neither missing nor null.
• Data checks in dimension table as well as history table.
• Check calculated measures.
VARIATIONS OF OLAP
• ROLAP - Relational Online Analytical Processing.
• MOLAP - Multidimensional OLAP
• HOLAP -
ROLAP - Relational Online
Analytical Processing
• Data is stored and fetched from the main data warehouse.
• Data is stored in the form of relational tables.
• Large data volumes.
• Uses Complex SQL queries to fetch data from the main warehouse.
• ROLAP creates a multidimensional view of data dynamically.
MOLAP - Multidimensional OLAP

• Data is Stored and fetched from the MDDBs.


• Limited summaries data is kept in MDDBs
• MOLAP engine created a pre calculated and prefabricated data cubes
for multidimensional data views.
• Faster access
(HOLAP)Hybrid Online Analytical
Processing
• Combination of ROLAP and MOLAP
• Stores data in RDB and MDB
• Uses whichever one is best suited to the type of processing desired
• Materializing cells – Storing results of the queries which are very
frequent
DATA WAREHOUSING
IMPLEMENTATION ISSUES
• End-user support
• Establishment of service-level agreements
• Identification of data sources
• Data quality planning
• Data model design
• ETL tool selection
• Relational database software and platform selection
• Purge and archive planning
Implementation guidelines
• DW project –fit
• complete buy-in
• Incremental build
• Adaptability and scalability – from scratch
• Management- IT and Business professional
• Load data – decision analysis
• Choose proven tools and methodologies
Avoid the following issues
during DW project:
• Starting with the wrong sponsorship chain.

• Setting expectations that you cannot meet.

• Loading the data warehouse with information just because it is available.

• Believing that data warehousing database design is the same as transactional database
design.

• Choosing a data warehouse manager who is technology oriented rather than user
oriented.
Contd..
• Focusing on traditional internal record-oriented data and ignoring the
value of external data and of text, images, and, perhaps, sound and video.

• Delivering data with overlapping and confusing definitions.

• Believing promises of performance, capacity, and scalability.

• Believing that your problems are over when the data warehouse is up and
running.
Real time data warehousing
Why Real time data warehousing?
• Updates in OLTP
• Traditional data warehouses - not business
critical
• Updates – weekly

• Real-time data warehousing- process of loading


and providing data via the data warehouse as
they become available.
Different names used in practice
to describe
the same concept
• Real-time data warehousing
• Near-real-time data warehousing
• Zero-latency warehousing,
• active data warehousing
Comparison Between Traditional and Active
Data Warehousing Environments
Traditional Data Warehousing Active Data Warehousing
Environment Environment
Strategic decisions only Strategic and tactical decisions
Daily, weekly, monthly data is acceptable Only comprehensive detailed data available
within minutes is acceptable

Moderate user concurrency High number (1,000 or more) of users


accessing and querying the system
simultaneously
often uses predeveloped summary tables or Flexible ad hoc reporting as well as
data marts machineassisted
modeling (e.g., data mining) to
discover new hypotheses and relationships

Power users, knowledge workers, internal Operational staffs, call centers, external users
users
Issues with real time data warehousing

• Reporting
• Not all field updates
• Enabling Real-time ETL
• No system downtime
Security in data warehouse

• Effective corporate and security policies and


procedures.

• Implementing logical security procedures

• Limiting physical access - data center


environment.

• Internal control review process - on security and


privacy.

Potrebbero piacerti anche