Sei sulla pagina 1di 32

Enterprise Systems

Session 1&2: MBA – Full Time (Trim III)

IS Stream

Jan-18
Table of Contents

1. Introduction
2. DW Basics
3. Business Performance Management
4. Data Mining
5. Text Mining fundamentals
6. Big Data and Analytics
7. Future Trends in Analytics

2
Confidential |
Why DW

Confidential | 3
Information usage

 Before the start of the information age (late 20th century)


– Businesses collected information from non-automated sources
– Absence of computing resources for proper analysis
 Commercial decisions made based primarily on intuition
 Automation in businesses  more applications  more
data/information
– Collection of data remained a challenge:
• Lack of infrastructure for information exchange
• Incompatibilities between systems
– Reports across systems took days/months to generate
• Informed long term strategic decision making
• Short-term tactical decision making relied on intuition

4
Confidential |
Problems in Information usage

 Decision makers require concise, dependable, timely information


– About current operations, trends, and changes
 Problem of fragmented data:
– Large businesses have multiple applications to run their operations
• Information is scattered across multiple platforms and variations of technology
– Information from multiple operational applications is not available in one place
• Almost impossible for any one individual to peruse information from multiple sources
– Needed to submit a request for a report to the IT team
– Completing reporting requests across operational systems could take days or weeks
– Operational information:
• Is mainly current
– Does not include history that is required to make good decisions
– Without information history, it is difficult to tell how and why things change over time
• Frequently has quality issues
• Are not designed for analysis and decision support.

 Partial information forms the basis of decisions


 Data Rich, Information poor
5
Confidential |
Data Warehousing

 Solution: Data Warehousing


– Take data from multiple platforms/technologies
– Place them in a common location that uses a common querying tool
– Standardizes information across applications
– Helps in accessing, integrating, and organizing key operational data
– In a form that is consistent, reliable, timely, and readily available
– Wherever and whenever needed
 Decision support more readily available without affecting day-to-day operations
– Operational databases could be held on whatever system was most efficient for the operational business
– Reporting/Strategic information could be held in a separate common location
 Data Warehouse
– Pool of data set up to support decision making: NOT the storehouse for ALL business information
– Repository of current and historical data
– Structured in a form ready for analytical processing activities
• Reporting, OLAP, Data Mining, ad-hoc querying
 Business intelligence: Art of
– Sifting through large amounts of data
– Extracting information
– Turning that information into actionable knowledge

6
Confidential |
Fundamental Characteristics of a DW

 Subject oriented
– Data organized around “Subjects”: sales, products, customers
– Information collected about subjects from all relevant systems
– Difference with operational database
 Integrated
– Closely related to subject orientation
– Data from different sources, consistent format
– Get rid of conflicts and discrepancies
 Time variant
– Maintenance of historical data
– Current status reporting is optional
– Single most important dimension
– Support for multiple time points
 Non-Volatile
– No update of data once written into the DW

7
Confidential |
DW Types

 Data Marts
– Same as EDW, but smaller in scale
– Focuses on a particular subject or department
– Can be Dependent (gets its feeds from the DW) or Independent (feeds the DW)
 Operational Data Stores
– Integrate corporate data from different heterogeneous data sources
– Facilitate operational reporting in real-time or near real-time
– Structured similar to the source systems: not optimized for historical and trend analysis
– During integration the data can be cleaned, denormalized, and business rules applied to
ensure data integrity
– Data at the lowest granular level
– Integration with source systems on a regular basis
– Frequently used as a data source for the data warehouse
 Enterprise Data Warehouse
– Store data from multiple data sources
– To be used for historical and trend analysis reporting
– Acts as a central repository for many subject areas
– Contains the “single version of truth”

8
Confidential |
DW Goals

 Information accessibility
 Information credibility
 Flexible to change
 Support for more fact-based decision making
 Support for the data security
 Information consistency

9
Confidential |
DW Benefits

 Direct Benefits:  Costs:


– End users performing analyses in – Hardware, software, network
different ways bandwidth,
– Consolidated view of the corporate data – Internal development, internal support,
is possible training, external consulting
– Availability of better and more-timely  Returns:
information – Money saved by improving traditional
– Enhanced system performance decision support functions
– Simplified data access – Money saved clue to automated
collection and dissemination of
 Indirect benefits: information
– Enhance business knowledge – Money saved or gained from decisions
made using DW
– Boost competitive advantage
– Improve customer service and
 Success Factors:
satisfaction – Clearly defining the business objectives
– Facilitate decision making – Support from top management
– Reforming business processes – Setting reasonable time frames and
budgets
– Managing expectations

10
Confidential |
DW Components

 Operational source systems


 Data staging area
– To gather data from different sources ready to be processed at
different times
– To quickly load information from the operational database
– To find changes against current DW/DM values
– To cleanse data
– To pre-calculate aggregates
 Data presentation area
 Data access tools

11
Confidential |
DW Architecture

12
Confidential |
Data Integration

 Comprises three major processes:


– Data access: Ability to access and extract data from any data
source)
– Data federation: Integration of business views across multiple
data stores
– Change capture: Identification, capture, and delivery of the
changes made to enterprise data sources
 Integration technologies that enable data, metadata
integration:
– Enterprise application integration (EAI)
– Service-oriented architecture (SOA)
– Enterprise information integration (EII)
– Extraction, transformation, and load (ETL)

13
Confidential |
Data Integration Technologies (EAI)

 Pushing data from source systems into the DW


 Integrates application functionality
 Focused on sharing functionality (rather than data) across systems,
thereby enabling flexibility and reuse
 Focus on enabling application reuse at the API level

 EAI is accomplished by using SOA coarse-grained services (a


collection of business processes or functions)
 Using Web services is a specialized way of implementing an SOA
 EAI can be used to facilitate data acquisition directly into a near
real-time DW or to deliver decisions to the OLTP systems
 Different approaches to and tools for EAI implementation

14
Confidential |
Data Integration Technologies (EII)

 Promises real-time data integration from a variety of sources


(relational DB, Web services, and multidimensional DB)
 Mechanism for pulling data from source systems
 Use predefined metadata to populate views that make integrated
data appear relational to end users
 Usage of XML to tag data
 Tags can be extended and modified to accommodate almost any
area of knowledge
 With EII new virtual data integration patterns are feasible
– Physical data integration has conventionally been the main mechanism
for creating an integrated view with DW and data marts
– New data integration patterns that can expand traditional physical
methodologies to present a comprehensive view for the enterprise

15
Confidential |
Data Integration

 Data Integration:
– Integration of data present in different sources for providing a unified view of the data
– Ability to consolidate data from several different sources while maintaining the integrity
and reliability of the data
 Approaches:
– Schema Integration
• Developing a unified representation of semantically similar information, structured and stored
differently in the individual databases
• Done using various mapping rules to handle structural differences
– Instance Integration
• Information is retrieved directly from the data
• Identify and integrate all the instance of the data items that represent the real-world entity
 Mechanisms
– Federated (Virtual) Database
• Fully integrated, logical composite result of all of its constituent databases
• Query submitted to constituent DBs in parts, results consolidated
– Data Warehousing
• Analyse and qualify source data
• Data profiling
• Source-to-target mapping
• Data cleansing and transformation
16
Confidential |
ETL

 Extract: reading data from sources into – Joining together data derived from
staging area or ODS’ multiple sources
– Summarizing multiple rows of data
 Transformation: Change and clean -
make it fit for upload – Splitting a column into multiple columns
– Rules, Lookup Tables, combining data:
integrate, cleanse  Load: upload the data into the data
– Cleanse warehouse
• Missing records or attributes – Transport data between sources and
• Redundant records targets
• Missing keys or other required data
– Document how data elements change as
• Erroneous relationships they move
• Inaccurate data
– Exchange metadata with other
 Samples applications as needed
– Selecting only certain columns to load
– Translating a few coded values
– Encoding some free-form values
– Deriving a new calculated value

17
Confidential |
ETL

 Administration: scheduling, error management, audit logs, statistics


 Make or buy decision involved: cost, learning curve, quality
 Four categories: sophisticated, enabler, simple, and rudimentary
 Sophisticated: better documented and managed
 Nowadays combined with Data Base
 Complexity dependent on data discipline
 Criteria for selection:
– Ability to read from and write to an different data source architectures
– Automatic capturing and delivery of metadata
– Open standards
– An easy-to-use interface

18
Confidential |
Data Quality

 Concept has a wider scope; rooted in the business


 Measured with reference to
– Appropriateness for purpose
• As defined by the business users
– Conformance to enterprise data quality standards
• As formulated by systems architects and administrators
 Goes beyond Data Integrity
– Primary, Foreign Keys; Not Null; Check Constraints
 Dimensions:
– Correctness/Accuracy
– Consistency
– Completeness
– Timeliness
– Metadata
 Data Governance and Security

19
Confidential |
Metadata

 Data about data


 Describe the structure of, and some meaning about data
 Contributes to the effective use of data
 Comprise information that increases our understanding of
traditional data: “releases potential”
 Purpose: provide context to the reported data: enrich info
 Assists conversion of data and information into knowledge
 Organizations need to understand importance of metadata, and
how to design and implement a metadata strategy
– Foundation to the “metabusiness architecture”
 Five maturity levels of management of metadata:
ad hoc, discovered, managed, optimized, automated
 Ethical considerations in ownership of metadata information
(privacy, IP)
20
Confidential |
Metadata (Contd.)

 Types of metadata:
– (Usage):
• Technical or business metadata
– (Pattern):
• Syntactic: describing the syntax of data
• Structural: describing the structure of data
• Semantic: describing the meaning of the data in a specific domain
 Successful metadata driven enterprise satisfies the following
requirements:
Extensibility Interoperability Effectiveness
Reusability Evolution Efficiency and performance
Versatility Low maintenance cost Flexibility
Versioning Segregation Entitlement
21
Confidential |
Data Profiling

 Automated studying and analysing data in the source


– Identifying data quality issues with the source
– Looking out for errors that crept in during data integration
 Helps us make a thorough assessment of data quality
– Discovery of anomalies in data
– Understand content, structure, relationships, etc. about the data
– Assess and validate metadata
 Analysis may happens against
– Certain specified business rules or requirements (DQ Profiling)
– With respect to the Database schema structure, relationships between
tables, columns used, data-type of columns, keys of the tables, etc.
(DB Profiling)
 Other aspects (Reading Material: Chapter 6: pg. 184 P&A)

22
Confidential |
DW Development Approaches

 (Bill) Inmon Model: EDW Approach


– Top-down development approach
– Does not preclude the creation of data marts
– Provides a consistent and comprehensive view of the enterprise
– More expensive, time-consuming slower process involving
several complexities
– Able to achieve the "single version of truth"
 (Ralph) Kimball Model: Data Mart Approach
– Bottom-up approach: "plan big, build small"
– Data Mart: single subject/department
– Development involves building one data mart at a time
– Faster, cheaper, and less complex approach
– "Single version of truth" might be compromised
23
Confidential |
Comparison of DW Dev Approaches

Effort Data Mart Approach EDW Approach


Scope One subject area Several subject areas
Development time Months Years
Development cost $10,000 to $100,000+ $1,000,000+
Development difficulty Low to medium High
Data prerequisite for
Common (within business area) Common (across enterprise)
sharing

Sources Only some operational and external systems Many operational and external systems

Size Megabytes to several gigabytes Gigabytes to petabytes


Time horizon Near-current and historical data Historical data
Data transformations Low to medium High
Update frequency Hourly, daily, weekly Weekly, monthly
Hardware Workstations and departmental servers Enterprise servers and mainframe computers
Operating system Windows and Linux Unix, Z/OS, OS/390
Databases Workgroup or standard database servers Enterprise database servers
Number of
10s 100s to 1,000s
simultaneous users
User types Business area analysts and managers Enterprise analysts and senior executives
Cross-functional optimization and decision
Business spotlight Optimizing activities within the business area
making
Confidential | 24
Data Representation in DW

 Relational database: Information contained in a series of two-dimensional tables


 Data Warehouse: Design based on Dimensional modelling
– Information is contained in layers of columns and rows
– A dimension is a particular attribute of information
– Each layer in a data warehouse represents information pertaining to an additional
dimension
– A cube is the common term for the representation of multi-dimensional information
 Aimed to provide fast query-response time, simplicity, and ease of maintenance for
read-only database structures
 Supports complex multi dimensional queries and analysis
– Users can analyse information in a number of different ways and with any number of
different dimensions
– Allows users to gain insights into their information
 Meant to accommodate and boost processing
 Implementation through star and snowflake schemas

25
Confidential |
Data Representation in DW

 Fact Table:
– Central fact table: decision analysis attribute
– Performance measures, operational metrics, aggregated measures
– Contains the descriptive attributes needed to perform decision
analysis and query reporting
– Connected to several dimension tables through foreign keys
 Dimension tables:
– Contain classification and aggregation information about the central
fact rows
– Address how data will be analysed and summarized
– Used to slice and dice the numerical values in the fact table
 Star schema: total de-normalization
 Snowflake schema: some level of normalization of dimensions

26
Confidential |
Star and Snowflake

Confidential | 27
Storage of OLAP Cubes

 Three different modes of storage:


 MOLAP: Multidimensional Online Analytical processing
– Traditional mode in OLAP analysis
– Data is stored in form of multidimensional cubes
– Calculations are pre-generated
– Cubes are built for fast data retrieval
– Provides excellent query performance
– Can handle only a limited amount of data
– Cube technology is proprietary
 ROLAP: Relational Online Analytical Processing
– Underlying data is stored in relational databases
– Can handle a large amount of data
– Can leverage all the functionalities of the relational database
– Performance is slow: limited by SQL functionalities
 HOLAP: Hybrid Online Analytical Processing
– Tries to combine the strengths of the other two models
– For summary type information HOLAP leverages cube technology and for drilling down into
details it uses the ROLAP model

28
Confidential |
DW Implementation: major tasks

 Establishment of service-level agreements and data-refresh


requirements
 Identification of data sources and their governance policies
 Data quality planning
 Data model design
 ETL tool selection
 Relational database software and platform selection
 Data transfer
 Data conversion
 Reconciliation process
 Purge and archive planning
 End-user support
29
Confidential |
DW Implementation best practices

 Alignment with corporate strategy and business objectives


 Stakeholders' buy-in to the project
 Manage user expectations about the project
 Incremental build
 Incorporate adaptability and scalability
 Joint ownership by both IT and business
 Focus on data quality, relevance and necessity
 Training for users
 Choice of tools and methodologies and service providers
 Organizational culture

30
Confidential |
DW Implementation Risks

 Executive sponsorship
 Unmet expectations
 Effective engagement of stakeholders
 Choice of information to be loaded onto the DW
 Specialized design of the DW
 Involvement of business users
 Quality of data
 Performance, capacity and scalability
 Continuous governance, maintenance

31
Confidential |
Thank you

Potrebbero piacerti anche