Unit 3 Architecting The Data: Structure

Business Intelligence and Tools Unit 3
Sikkim Manipal University Page No. 50

Unit 3 Architecting the Data
Structure:
3.1 Introduction
Objectives
3.2 Types of Data
3.2.1 Real-time Data
3.2.2 Derived Data
3.2.3 Reconciled Data
Self Assessment Question(s) (SAQs)
3.3 Enterprise Data Model
3.3.1 Phased Enterprise Data Modeling
3.3.2 A Simple Enterprise Data Model
3.3.3 Advantages and Limitations of EDM
3.4 Granularity of the Data
3.5 Data Partitioning
3.5.1 Objectives of partitioning
3.5.2 Criteria of Partitioning
3.5.3 Logical Data Partitioning
3.6 Meta Data
3.6.1 Data warehouse Metadata
3.6.2 Use of Metadata
3.6.3 Managing of the Metadata
3.7 Total Data Quality Management (TDQM)
3.7.1 Data Quality Types
3.7.2 Concept of TQDM
3.8 Summary
3.9 Terminal Questions (TQs)
3.10 Multiple Choice Questions (MCQs)
3.11 Answers to SAQs, TQs, and MCQs
3.11.1 Answers to Self Assessment Questions (SAQs)
3.11.2 Answers to Terminal Questions (TQs)
3.11.3 Answers to Multiple Choice Questions (MCQs)
3.1 Introduction
The basic concept of data warehousing is to clean, filter, transform,
summarize and aggregate the data and then keep it in a structured manner
so that it is easily accessible for making business decisions. Therefore,
defining the data in an appropriate structure is very crucial and plays an
important role. By architecting the data, you can structure and locate the
data in such a manner that the data loaded in the data warehouse best
supports your decision making process. In this Unit, we discuss various
types of data used in data warehousing and some key points for architecting
that data.
Objectives:
The objectives of the Unit include:
Various types of data used in an Organization
The Enterprise Data Model (EDM) to develop a data model
Granularity of the Data
Physical and Logical Partitioning of the Data
Significance and use of Meta data
Practice of Total Data Quality Management (TDQM)
3.2 Types of Data
Before structuring the data in a data warehousing, you can divide the data
into three categories according to usage, scope, and currency as provided
below.
Real-time data
Derived data
Reconciled data
3.2.1 Real-time Data
This data represents the current status of the business. Typically, real-time
data is used by operational applications to run the business and the data
constantly changes as operational transactions are processed. This data is
at a detailed level and is usually accessed in read/write mode by the
operational transactions. The real-time data is extracted and distributed to
informational systems (data warehouses) through out the organizations for
tactical decision-making. For instance, the real-time data plays an important
role in banking industry for tactical decision-making. Thus the real-time data
is delivered to the data warehouse from the operational systems through an
independent system, called deferred or delayed system.
Before using the real-time data in a data warehouse, you have to cleanse
the data to ensure that the quality of the data is appropriate, that means it
has to be summarized and transformed into a format that is easily
understood. This is because the real-time data consists of all the individual,
transactional, and detailed data values that are specific to the respective
operational systems. In addition, the data may not be consistent in
representation and meaning as the data may come from different multiple
systems. For example, the units of measure, currency, and exchange rates
may differ among the systems. Therefore, reconciliation of these anomalies
is to be done before loading the real-time data into a data warehouse.
3.2.2 Derived Data
This is the data that has been derived or created perhaps by aggregating or
averaging the real-time data through a defined process. This data can
represent a view of the business at a specific point of time or can be a
historical record of the business over a period of time. Based on the
requirements, the derived data can either be detailed or summarized.
Traditionally, derived data is used by the Data analysts for data analysis and
decision-making as they do not require large volumes of detailed data; but
the summarized data that is much useful for use and manipulation. To
improve the query processing capability, an efficient approach followed by
the business analysts is to predict the derived data elements, and
summarize the detailed data to meet the specific business requirements.
3.2.3 Reconciled Data
Reconciled data is the real-time data that has been cleansed, modified, or
enhanced. This data provides an integrated source of quality data for use of
data analysts in the data analysis. We can create and maintain historical
data while reconciling the data and therefore this data is also termed as a
type of derived data.
You can configure an appropriate warehouse on the basis of the above data
types considering the requirements for any particular implementation effort.
Depending on the nature of the operational systems, the type of business,
the number of users that access the warehouse and the frequency of
access, you can combine these types of data to develop the most
appropriate warehouse architecture.
For Section 3.2
1. What is Reconciled Data and analyze how this data is different from
other types of data; Real-time data and Derived data?
3.3 Enterprise Data Model
The Enterprise Data Model (EDM) is an approach to develop a data
warehouse data model. An EDM is a consistent definition of all the data
elements common to the business, from a high-level business view to a
generic logical data design. Using this model, you can derive the general
scope and understanding of the business requirements and the model also
includes links to the physical data designs of the individual applications.
A typical EDM can be any of the following types;
Phased Enterprise Data Modeling
A simple Enterprise Data Modeling
3.3.1 Phased Enterprise Data Modeling
There are several methodologies for Enterprise Data Modeling in practice.
Some authors propose a 3-tiered methodology (such as conceptual, logical
and physical data model) or a 5-tiered model. According to the Worldwide
Solution Design and Delivery Method proposed by IBM, a phased enterprise
data modeling includes the following tiers:
ISP: Information System Planning
BAA: Business Area Analysis
BSD: Business System Design
BSI: Business System Implementation
BSM: Business System Maintenance
Whatever is the type of methodology used, the methodology focuses on the
phased or layer approach and the phases include the following tasks or
phases:
Information Planning
Business Analyzing
Logical data modeling
Physical data design
Figure 3.1 represents the tasks or phases in a phased enterprise data
model. The size of the phases represents the amount of information to be
included in a specific phase. It is observed from the figure that the amount of
information is minimal at the Information Planning phase, but increases in
later phases.
The Information Planning phase provides the highly consolidated view of
the business wherein you can view the business concepts. These
business concepts can be categorized into business entity, super entity,
or subject area and each of these items maintain related data elements.
For example, the business concepts in an organization can be customer,
product, policy, schedule, etc. This phase provides the scope and
architecture of a data warehouse and acts as a single, comprehensive
view for the remaining phases.
The Business Analyzing phase provides means for further defining of
the concepts provided in the above phase. This phase is described in
business terms to make business people understand the data details
without any special training. The purpose of the phase is to gather and
arrange the business requirements and define the business terms.
The Logical Data Modeling phase is organizational-wise in scope and
generic to all applications located in the next phases. This model
typically consists of several hundred entities and contains the
identification and definition of all entities, relationships and attributes.
The entities of the logical data model can be further portioned into views
by subject areas or by applications. This phase can be divided into two
types; Generic logical data model for the organizational level and
Logical application model for the application level of data view.
The Physical Data Design is to design for the actual physical
implementation and applies physical constraints, such as space,
performance, and the physical distribution of the data.
3.3.2 A Simple Enterprise Data Model
A simple Enterprise Data Model is a data model wherein some of the core
components that are required for the data warehouse modeling can be
extracted and grouped to use in a phased approach. For a simple EDM, you
can make a list of around 20 subject areas and then draw a subject area
model of the organization by defining the business relationships among the
subject areas. Then you are required to define the contents of each of these
subject areas. For example, when you define a customer, you can also
mention whether the customer is a prospective customer or an ex-customer.
3.3.3 Advantages and Limitations of EDM
An EDM will ensure the following advantages in an organization compared
to that of an application or department model:
EDM stands as a single development base for the applications and
thereby promotes the integration of the existing applications
This model provides a single set of consistent data definitions
The model supports sharing of the data among various business areas
But organizations are facing the following challenges while implementing an
EDM:
All business areas of the organization need to be involved concurrently
to deliver the best possible value of EDM to the business, which is an
unrealistic expectation
As the scope of the EDM tends to cover the entire organization, the size
of the project would be so large that it seldom appropriates results in a
reasonable time period
The people who involve in the EDM project are expected to have both a
broad understanding of the business and a detailed knowledge of a
specific business area.
For Section 3.3
1. Explain the various phases involved in an Enterprise Data Model (EDM)
and discuss the type of data to be stored in each of these phases?
3.4 Granularity of the Data
Grain is a term that originates in photography. It refers to the degree of
detail and precision contained in an image; the pixels or dots per inch, or
preserved and communicated by a given medium. Thus granularity refers
to the level of detail. A typical data warehouse will have some tables in it
that have a lot of detail and have other tables that are summarized or
aggregated, which means less detail. The more the detailed data available,
the lower the level of granularity. Conversely, the lower the level of detailed
data, the higher the level of granularity. High granularity refers to the data
that is at or near the transaction level (data that is at the transaction level is
usually referred to as atomic level data).
The concept of granularity is very important as it offers the opportunity for
trade-off between important issues in data warehousing. For example, there
might be a trade-off between the ability to access data at a very detailed
level versus performance and the cost of storing and accessing the larger
amounts of data. Selecting an appropriate level of granularity significantly
affects the volume of data in a data warehouse. Thus the selection of an
appropriate level of granularity determines the capability of the data
warehouse to answer different types of queries. If the granularity does not
impact the ability to answer a specific query, the amount of resources
required for the execution of the same query could still differ considerably.
For example, you have two tables with different levels of granularity, such as
monthly account summary and transaction details. To answer a query about
the monthly report for channel utilization, you can use either of those two
tables without any dependency on the granularity level. However, using the
detailed transaction table requires a significantly higher volume of disk
activity to scan all the data as well as additional processing power for
calculation of the results. Here, using the monthly account summary table
requires lesser resources. Therefore, you should always consider the trade-
off between the cost of the volume of the data and the ability to answer the
queries in deciding the level of granularity.
To overcome the trade-offs, organizations that deal with larger amounts of
data can also opt for multiple levels of granularity. For example, the data in
a data warehouse can be divided into detailed raw data and summarized
data.
Detailed raw data is the lowest level of detailed transaction data without
any aggregation or summarization. As the volume of the data at this
level could be extremely large, you can choose a separate storage
medium such as magnetic tape or an optical disk device when this data
is not being used. This process enables easy and faster access and it
can be performed as and when required.
Summarized data is the transaction data aggregated at the level
required for the most used queries. So the volume of data required for
the summarized data sources is lower as compared to that of detailed
raw data. But there needs to be a limit to the number of queries and
level of details that can be extracted from the summarized data.
Thus you can overcome the trade-off between the volume of data and query
capability by creating two levels of granularity in a data warehouse. The
detailed raw data supports the limited number of queries requiring a
detailed level of data and the summarized level of data supports all queries
with the reduced amount of resources.
For Section 3.4
1. Define the following terms; Granularity of the data, Detailed raw data
and summarized data. How can a project manager choose an
appropriate level of granularity of the data for a new BI project?
3.5 Data Partitioning
The term partition refers to the physical status of a data structure that has
been divided into two or more separate structures. But logical partitioning of
the data is also required to understand and use the data in a better way. In
such a case, the logical partitioning overlaps with the physical partitioning.
3.5.1 Objectives of partitioning
The partitioning of the data in a data warehouse enables the
accomplishment of the following objectives:
Flexible access to the data
Easy and efficient data management services
Easy to restructure, index, sequentially scan, recover, reorganize and
monitor
Ensure scalability of the data warehouse
Enable the data warehouse elements to be portable
3.5.2 Criteria of Partitioning
There are a number of important criteria to be considered for partitioning of
the data and some of them include:
Product (according to the line of business)
Time period (date, week, or month)
Geography (location)
Department or organizational unit-wise
A combination of the above
However, the choice of criteria is based on the business requirements and
physical database constraints. But time period must always be considered
while partitioning the data.
Every database management system has it own way of implementing the
physical partitioning. But the important consideration for selecting a
database management system on which the data resides is, support for
partition indexing. Instead of system level partitioning, you can also consider
partitioning by application as this would provide flexibility in defining data
over time, and portability in moving the data to the other data warehouses.
But you have to understand that the issue of partitioning is closely related to
the data granularity modeling, multi-dimensional modeling and the ability of
the specific database management system to support data warehousing.
3.5.3 Logical Data Partitioning
The logical and physical partitioning of the data plays an important role to
understand, maintain and navigate the data warehouse in a better way. The
physical partitioning is designed according to the physical implementation
requirements and constraints. But logical partitioning of the data is much
more important as it affects physical partitioning not only for the overall
structure but also the detailed table partitioning.
Subject area is considered to be the most common criterion to determine
the overall logical data partitioning and it can be defined as a portion of a
data warehouse that is classified by a specific consistent perspective. This
perspective is usually based on the characteristics of the data, such as
product, customer, time period, account, geographical location, etc.
Partitioning of the data in a data warehouse environment is different from
the partitioning in an operational environment. In an operational
environment, the partitioning is according to an application or function
because the operational environment has been built around transaction-
oriented applications that perform a specific set of functions. Also, the
objective is to perform those functions as quickly as possible. The type of
queries performed in an operational environment is more tactical in nature
and they are concerned with a specific point of time. But the queries in a
data warehouse environment are strategic in nature and are concerned with
a larger scope. Therefore the data in a data warehouse should be structured
and oriented to subject areas such as product or organizational unit. As you
will remember, a data warehouse is subject-oriented as per its definition and
so the data is oriented to specific subject areas.
In a data warehouse modeling, identification of subject areas become useful
in defining the following criteria:
Unit of the data model
Unit of an executing project
Unit of management of the data
Basis for integration of the multiple implementations
3.6 Meta Data
In simple terms, metadata refers to "data about data." It is the information
that describes, or supplements the main data. For example, metadata of a
digital camera includes the settings used for the picture, such as exposure
value or flash intensity. Here, metadata acts as an additional information,
and is not critical to the functions of the main data. In other cases, such as a
Zip disk, metadata might provide the information regarding the write-
protected status of the disk. In such a case, metadata is essential for proper
functioning of the main product. So the value of metadata depends on the
context that it is provided, and the ways that contextual information can be
used. When data is made available, the potential user (human or computer)
must put the data into an existing model of knowledge, and may ask
questions to do so. For example, in the case of an image, metadata
provides answers to many of the questions like "When was the image
taken?" and "Who are in the image?" In sophisticated data systems, the
metadata includes the contextual information surrounding the data and will
also be very sophisticated, capable of answering many questions that help
understand the data. To sum up, metadata can be defined as the
structured, encoded data that describe characteristics of information-bearing
entities to aid in the identification, discovery, assessment, and management
of the described entities."
In general, there are two distinct classes of metadata: structural (or control)
metadata and guide metadata.
Structural metadata is used to describe the structure of the computer
systems like tables, columns, indexes, etc.
Guide metadata is used to help people find specific items and is usually
expressed as a set of keywords in a natural language.
It is difficult to distinguish the metadata from the data because:
Some items can be data and metadata at the same time. For example,
the headline of an article is both its title (metadata) and part of its text
(data).
In some cases, data and metadata can change their roles. For example,
a poem, as such, would be regarded as data. If the same poem is
attached as lyrics to an audio file of a song, it becomes the metadata.
Metadata can be categorized on the basis of the following functionalities:
Content: Metadata can describe the resource (name and size of a file)
or the content of the resource (for example, "This video file shows a boy
playing cricket").
Mutability: Metadata can be either immutable (for example, the video
title does not change when the video is being played) or mutable (the
"Scene description" does change) with respect to the whole resource.
Logical function: There are three layers of logical function; the bottom
layer is the sub-symbolic layer that contains the raw data, and on its top,
the symbolic layer with metadata that describes the raw data, and on its
top the logical layer that consists of metadata that allows logical
reasoning using the symbolic layer.
3.6.1 Data warehouse Metadata
In general, Data warehouse metadata systems are divided into two sections:
back room metadata that are used for Extract, transform, load functions
to get OLTP data into a data warehouse
front room metadata that are used to label screens and create reports
Kimball, a renowned author in the area of business Intelligence and of data
warehousing, lists the following types of metadata in a typical data
warehouse:
Source system metadata
Source specifications, such as repositories, and source schemas
Source descriptive information, such as ownership descriptions, update
frequencies, legal limitations, and access methods
Process information, such as job schedules and extraction code data
staging metadata
DBMS metadata, such as: DBMS system table contents
Data acquisition information, such as data transmission scheduling and
results, and file usage
Dimension table management, such as definitions of dimensions, and
surrogate key assignments
Transformation and aggregation, such as data enhancement and
mapping, DBMS load scripts, and aggregate definitions
Audit, job logs and documentation, such as data lineage records, data
transform logs
3.6.2 Use of Metadata
The applications of metadata are discussed below.
Metadata provides additional information to users of the data it
describes and the information can either be descriptive or algorithmic
Metadata speeds up and enriches searching for resources. Search
queries using metadata saves users from performing more complex filter
operations manually. Also, web browsers, P2P applications and media
management software automatically downloads and locally caches
metadata to improve the speed at which files can be accessed and
searched.
Metadata plays a vital role on the World Wide Web to find useful
information from the large amount of information available.
Metadata is an important part of electronic discovery. Application and file
system metadata derived from electronic documents and files acts as
important evidence.
Some metadata is intended to enable variable content presentation. For
example, if a picture has metadata that indicates the most important
region, the user can narrow the picture to that region and thus obtain the
details required.
Metadata can also be used to automate workflows. For example, if a
software tool knows content and structure of data, it can convert it
automatically and pass it to another tool as input so that users need not
perform copy-and-paste operations required.
Metadata helps to bridge the semantic gap by explaining how a
computer data items are related and how these relations can be
evaluated automatically. For example, if a search engine understands
that "Aditya Kaashyap" was a "Indian Engineer", it can answer a search
query on "Indian Engineers" with a link to a web page about Aditya
Kaashyap, although the exact words "Indian Engineers" never occur on
that page. This approach (called, knowledge representation) is of special
interest to the semantic web and artificial intelligence.
3.6.3 Managing of the Metadata
To successfully develop and use metadata, you need to understand the
following important issues that should be treated with care:
You need to keep track of the entire metadata created even in the early
phases of planning and designing. It is not economical to start attaching
metadata once the production process has been completed.
Metadata must adapt if the resource it describes changes. It should be
merged when two resources are merged.
It can be useful to keep metadata even after the resource it describes
has been removed.
Metadata can be stored either internally (in the same file as the data) or
externally (in a separate file). Internal storage allows transferring
metadata together with the data it describes. Thus metadata is at hand
and can be easily manipulated. This method creates high redundancy
and does not allow holding metadata together. External storage allows
bundling metadata, for example in a database, for more efficient
searching. There is no redundancy and metadata can be transferred
simultaneously when using streaming.
Storing the metadata in a human-readable format (such as XML) can be
useful because users can understand and edit it without specialized
tools. But these formats are not optimized for storage capacity. It may be
useful to store metadata in a binary, non-human-readable format instead
to speed up transfer and save memory.
Although the majority of the computer professionals see metadata as a
chance for better interoperability, there are some demerits as detailed
below:
Metadata is expensive and time-consuming. Also, it is very much
complicated.
Metadata is subjective and depends on context. Two persons will attach
different metadata to the same resource due to their different points of
view. Moreover, metadata can be misinterpreted due to its dependency
on context.
There is no end to metadata. For example, when annotating a soccer
match with metadata, one can describe only the players and their
actions. Others can also describe the advertisements in the background
and the clothes the players wear. So even for a simple resource the
amount of possible metadata can be gigantic.
There is no real need for metadata as most of today's search engines
allow finding text very efficiently.
For Section 3.6
1. Discuss the features of meta data in the context of a data warehouse?
3.7 Total Data Quality Management (TDQM)
One of the main reasons for the failure of a data warehouse deployment is
the poor quality of the data loaded into a warehouse. So the managers need
to be careful and take up the precautions required to ensure that the quality
of data loaded into the warehouse is appropriate.
3.7.1 Data Quality Types
There are two significant dimensions in understanding the quality of the
data; intrinsic quality and realistic quality. Here, the 'intrinsic data quality' is
the correctness or accuracy of data and 'realistic data quality' is the value
that the correct data has in supporting the work of the business or
organization.
To state simply, the intrinsic data quality is the accuracy of the data. It is
the degree to which data accurately reflects the real-world object that the
data represents. If all facts that an organization needs to know about an
entity are accurate, then that data has intrinsic quality.
Data that does not enable the organization to accomplish its mission has no
quality, no matter how accurate it is. Thus realistic data quality comes into
the picture. Realistic data quality is the degree of utility and value the data
has to support the organizational processes to accomplish the
organizational objectives. Fundamentally, realistic data quality is the degree
of customer satisfaction that the knowledge workers derive out of the use of
the data.
3.7.2 Concept of TQDM
Many of the business intelligence projects do not deliver to full potential
because of one reason that people tend to see data quality as a one-time
undertaking as a part of user acceptance testing (UAT). But it is very
important that data quality management is to be undertaken as a continuous
improvement process. You have to use an iterative approach as detailed
below to achieve the data quality:
Step 1: To establish Data Quality Management environment
Undertaking a commitment to the Data Quality Management process can be
accomplished by establishing the data quality management environment
between information system managers and establishing the conditions to
encourage coordination between functional and information system
development professionals. Functional users of legacy information systems
know data quality problems of the existing systems but hardly know how to
improve the quality of the existing data systematically. But the Information
system developers know how to identify data quality problems, but hardly
know how to change the functional requirements that drive the systematic
improvement of data. Given the existing barriers to communication,
establishing the data quality environment requires participation of both
functional users and information system administrators.
Step 2: To draft the Project scope
For each data quality analysis project selected, you may have to draft an
initial plan that addresses the following items:
Task Summary
Task Description
Project Approach
Schedule
Resources
Step 3: To implement the Data Quality Projects
A data quality project consists of four activities;
Define
Measure
Analyze
Improve
The data quality project manager performs these activities with input from
the functional users of the data, system developers, and database
administrators of the legacy and target database systems.
Step 4: To evaluate Data Quality Management Methods
The objective of this step is to evaluate and assess the progress made in
implementing data quality initiatives. All stakeholders in the Data Quality
Management process (functional users, program managers, developers,
and the Office of Data Management) are required to review the progress to
determine whether data quality projects have helped to achieve goals and
benefits.
For Section 3.7
5. Drafting of a Project Plan plays a crucial role in the context of Total Data
Quality Management (TDQM). According to you, what details does a typical
project plan consist of?
3.8 Summary
The data to be stored in a data warehouse can be divided into three
categories; real-time data, derived data and reconciled data. Real-time data
represents the current status of the business. Derived data is the data
created by aggregating the real-time data through a defined process.
Reconciled data is the real-time data that has been cleansed, modified, or
enhanced. The Enterprise Data Model (EDM) is an approach to develop a
data warehouse data model. A typical EDM can be any of the two types;
Phased Enterprise Data Modeling and a simple Enterprise Data Modeling.
Granularity refers to the level of detail of data provided in a warehouse.
Selection of an appropriate level of granularity determines the capability of
the data warehouse to enable answers to different types of queries. Also,
logical partitioning of the data is required to better understand and use the
data. The criteria for partitioning the data include; product as per the line of
business, time period, geographical location, department-wise, or a
combination of the above.
In simple terms, metadata refers to "data about data. The data warehouse
metadata systems are divided into two sections; back room metadata to
extract, transform, and load functions to get OLTP data into a data
warehouse, and front room metadata to label screens and create the
business reports. There are two important dimensions in understanding the
data quality; intrinsic quality and realistic quality. 'Intrinsic data quality' is the
correctness or accuracy of data and 'realistic data quality' is the value that
the correct data has in supporting the work of the business or organization.
The concept of Total Data Quality Management (TDQM) aims at taking up
the precautions required to ensure that the quality of data loaded into the
warehouse is appropriate.
3.9 Terminal Questions (TQs)
1. Before architecting the data, how can you divide the data into various
categories according to its characteristics?
2. Explain the concept of Physical partitioning and Logical partitioning of
the data?
3. Elaborate the activities involved in a data quality project and explain how
the project manager can ensure the quality of the project?
4. Discuss intrinsic and realistic types of data quality and analyze what
are the areas you need to concentrate to improve the quality of the data
being used in a data warehouse?
3.10 Multiple Choice Questions (MCQs)
1. Which of the following represents the data that has been prepared by
aggregating or averaging the real-time data through a defined
process?
a. Reconciled data
b. Granular data
c. Derived data
d. Non-real time data
2. Which of the following statements is False?
a. Real-time data is a detailed level data and is usually accessed in
read/write mode in the operational transactions.
b. Reconciled data is the real-time data that has been cleansed,
modified, or enhanced.
c. Derived data is used for data analysis and decision-making as it is
the summarized data that is useful for use and manipulation.
d. Real-time data can be directly loaded into a data warehouse as it
does not require any reconciliation of anomalies.
3. Which of the following phases of an Enterprise Data Model (EDM)
provides the further defining of the concepts provided in the phase,
Information Planning and describes the business terms to make
business people understand the data details without any special
training?
a. Business Designing
b. Business Analyzing
c. Business Decision-making
d. Business Implementing
4. The data that is at the transaction level is usually referred to
__________.
a. Meta data
b. atomic level data
c. derived data
d. reconciled data
5. The partitioning of the data in a data warehouse is oriented towards
__________.
a. Subject areas
b. Units
c. Models
d. Data types
6. The backroom metadata in a data warehouse is used to
_______________.
a. extract, transform, load functions to get OLTP data into a data
warehouse
b. label screens and create reports
c. plan and implement the data warehouse project
d. determine the resources required for the data warehouse project
implementation
7. TDQM stands for _____________.
a. Total Data Quantity Management
b. Tentative Data Quality Management
c. Total Data Quality Management
d. Tentative Design Quality Management
8. If all the observations that an organization collected about an entity are
accurate, then ________ quality of the data is achieved?
a. Intrinsic
b. Extrinsic
c. Realistic
d. Experimental
9. Which of the following statements is False with regard to metadata?
a. Metadata can be stored internally (in the same file) or externally (in
a separate file)
b. Metadata speeds up and enriches searching for resources.
c. Some items of data act as data and metadata at the same time.
d. None of the above.
10. Which of the following best describes metadata?
a. Quality of the data
b. Data about the data
c. Volume of data
d. Design of the data
11. Which of the following statements is not correct?
a. Realistic data quality is the degree of utility and value the data has
to support the organizational processes to accomplish the
organizational objectives.
b. Front room metadata in a data warehouse metadata system is
used to label screens and create the business reports.
c. The implementation of a data quality project consists of four
activities; Define, Measure, Analyze, and Improve.
d. None of the above
12. Data that does not enable the organization to accomplish its mission
has no quality, no matter how accurate it is. Therefore, ___________
quality of data plays a vital role in business decision making?
a. Intrinsic
b. Extrinsic
c. Realistic
d. Experimental
3.11 Answers to SAQs, TQs, and MCQs
3.11.1 Answers to Self Assessment Questions (SAQs)
Section 3.2
1. Reconciled data is the real-time data that has been cleansed, modified,
or enhanced. Real-time data is used by operational applications to run
the business and this data constantly changes as operational
transactions are processed. Derived data is the data that has been
derived by aggregating the real-time data through a defined process.
Thus this data can represent a view of the business at a specific point of
time. Since the Reconciled data also maintain historical data while
reconciling the data it is also termed as a type of derived data. Thus the
Reconciled data exhibits the characteristics of both real-time data and
derived data.
Section 3.3
1. An Enterprise Data Model (EDM) provides all the data elements from a
high-level business view to a generic logical data design. Generally, the
phased enterprise data modeling follows a 3-tiered methodology;
conceptual, logical and physical data model. IBM proposed a 5-tiered
model with the following tiers; Information System Planning (ISP),
Business Area Analysis (BAA), Business System Design (BSD),
Business System Implementation (BSI), Business System Maintenance
(BSM). Most of the organizations today are following a methodology with
the following phases; Information Planning, Business Analyzing, Logical
data modeling, and Physical data design.
Section 3.4
1. Granularity refers to the level of detail of the data. The more detail data
that is provided, the lower the level of granularity. Similarly, higher
granularity means that the data is at or near the transaction level.
Detailed raw data is the lowest level of detailed transaction data without
any aggregation or summarization. Summarized data is the transaction
data that has been aggregated at the level required for the most used
queries.
The project manager has to choose an appropriate granularity level by
assessing his decision-making requirements. This is because the
detailed raw data supports the limited number of queries requiring a
detailed level of data and the summarized level of data supports all
queries with the reduced amount of resources.
Section 3.6
1. Metadata refers to the "data about data." Here, structural metadata
describes the structure of the computer systems like tables, columns,
indexes, etc. and the guide metadata helps people in finding specific
items.
There are two types of Metadata; back room metadata that is used to
extract, transform, and load the OLTP data into a data warehouse and
front room metadata that is used to label screens and create reports.
Metadata provides descriptive or algorithmic type additional information
to its users. It speeds up and enriches the searching functionality. But
maintaining the metadata is complex, expensive and time-consuming.
Section 3.6
1. The details of items to be provided while drafting an initial plan of a data
quality analysis project are as follows:
Document the Project goals, scope, and potential benefits under
Task Summary
Describe the data quality analysis tasks under Task Description
Summarize tasks and tools used to provide a baseline of existing
data quality under Project Approach
Identify the start and completion dates of the tasks and the project
milestones under Schedule
Identify the resources required to complete the data quality
assessment including the costs connected with tools acquisition,
labor hours (by labor category), training, travel, and other direct and
indirect costs, etc. under Resources
3.11.2 Answers to Terminal Questions (TQs)
1. Before structuring the data in a data warehousing, the data in a data
warehouse can be categorized into three areas according to usage,
scope, and currency; real-time data, derived data, and reconciled data.
Real-time data represents the current status of the business and is used
by operational applications to run the business and so this data
constantly changes. Derived Data is the data that has been developed
by aggregating the real-time data through a defined process. So, this
data represents a view of the business at a specific point of time.
Reconciled Data is the real-time data that has been cleansed, modified,
or enhanced. As this data enables the creation and maintenance of the
historical data, it can also be termed as a type of derived data.
2. The physical partitioning is done according to the physical
implementation requirements and constraints. But logical partitioning of
the data is done according to the subjects. These subject areas act as
important criterion to determine the overall logical data partitioning and it
can be defined as a portion of a data warehouse that is classified by a
specific consistent perspective. This perspective is usually based on the
characteristics of the data, such as product, customer, time period,
account, geographical location, etc. The logical partitioning of the data is
much more important as it affects physical partitioning not only for the
overall structure but also the detailed table partitioning. The logical and
physical partitioning of the data plays an important role to understand,
maintain and navigate the data warehouse in a better way.
3. The description of the activities involved in a data quality project is as
follows:
Define: Identifying the functional user data quality requirements and
establishing the data quality metrics.
Measure: Measuring the conformance to current business rules and
developing the exception reports.
Analyze: Verifying, validating, and assessing the poor data quality
causes and defining the improvement opportunities.
Improve: Selecting and prioritizing the data quality improvement
opportunities.
Improving data quality may lead to changing data entry procedures,
updating data validation rules, and/or use of company data
standards to prescribe a uniform representation of data used
throughout the company.
4. The intrinsic data quality is the accuracy of the data. It is the degree to
which data accurately reflects the real-world object that the data
represents. The realistic data quality is the degree of utility and value
the data has to support the organizational processes to accomplish its
objectives.
The important areas that one has to concentrate for achieving the Data
Quality Improvement are as follows:
a. Processes Improvement: This area includes improvement of the
functional processes that are used to create, manage, access, and use
data. The functional process changes encourage centralized data entry,
elimination of non-value added activities, and placing the data quality
responsibilities where data is entered into the data set (e.g., certification
of data)
b. System Improvement: This area includes software, hardware, and
telecommunication changes for the improvement of the data quality. For
instance, security software can minimize the damage done by malicious
updates to databases by unauthorized users. The hardware
improvements may lead batch loads faster and thereby make it
unnecessary to turn off edit and validation constraints while loading the
data into a database. The telecommunications improvements (e.g.
increasing the bandwidth) may provide easier access to data and
provide both accurate and timely data. Other system improvements may
include updating end users, operational and maintenance manuals,
provision of required user training, etc.
c. Policy & Procedure Improvement: This area includes resolving of the
conflicts in existing policies and procedures to institutionalize the
behaviors that promote good data quality. One has to develop standard
operating procedures for the information system to document the data
quality rule sets/filters and measure the data quality accordingly. Also,
periodical data quality checks need to be performed as part of the
standard operating procedures to improve the data quality.
d. Data Design Improvement: This area includes improving of the overall
data design and use of the data standards. Some of the activities in this
area which include adding the primary key constraints, indexes, unique
key constraints, triggers, stored functions and procedures, controlling
administration of user privileges, enforcing security features, referential
integrity constraints can improve database design, etc.
3.11.3 Answers to Multiple Choice Questions (MCQs)
1. Ans: c
2. Ans: d
3. Ans: b
4. Ans: b
5. Ans: a
6. Ans: a
7. Ans: c
8. Ans: a
9. Ans: d
10. Ans: b
11. Ans: d
12. Ans: c

Unit 3 Architecting The Data: Structure

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Unit 3 Architecting The Data: Structure

Caricato da

Copyright:

Formati disponibili

Business Intelligence and Tools Unit 3

Sikkim Manipal University Page No. 50

Potrebbero piacerti anche