Data Stage

ETL Concepts:1.
A)
What is the flow of loading data into fact & dimensional tables?
Fact table - Table with Collection of Foreign Keys corresponding to the Primary
Keys in Dimensional table. Consists of fields with numeric values.
Dimension table - Table with Unique Primary Key.
Load - Data should be first loaded into dimensional table. Based on the primary key
Values in dimensional table, the data should be loaded into Fact table.
2.
A)
Types of Parallel Processing?

Parallel Processing is broadly classified into 2 types.
a)
SMP - Symmetrical Multi Processing.
b)
MPP - Massive Parallel Processing.
3.
A)
Importance of Surrogate Key in Data warehousing?

Surrogate Key is a Primary Key for a Dimension table. Most importance of using it is
it is independent of underlying database. i.e. Surrogate Key is not affected by the

changes
going on with a database.
4.
A)
a)
b)
Types of Dimensional Modeling?

Dimensional modeling is again sub divided into 2 types.
Star Schema - Simple & Much Faster. Denormalized form.
Snowflake Schema - Complex with more Granularity. More normalized form.
5.
Differentiate Primary Key and Partition Key?
A)
Primary Key is a combination of unique and not null. It can be a collection of key values
called as composite primary key. Partition Key is a just a part of Primary Key. There are
several methods of partition like Hash, DB2, and Random etc. While using Hash partition
we specify the Partition Key.
6.
A)
a)
b)
c)
Differentiate Database data and Data warehouse data?

Data in a Database is
Detailed or Transactional
Both Readable and Writable.
Current.
7.
A)
a)
b)
c)
d)
e)
f)
What is Meta Data Repository?

Meta Data is a data about the data.
It also contains
Query statistics
ETL statistics
Business subject area
Source Information
Target Information
Source to Target mapping Information.
8.
Explain Types of Fact Tables?
A)
Factless Fact
Additive Fact
Semi-Additive
Non-Additive
Conformed Fact
9.
A)
: It contains only foreign keys to the dimension tables.

: Measures can be added across any dimensions.
: Measures can be added across some dimensions. Eg, % age,
discount
: Measures cannot be added across any dimensions. Eg, Average
: The equation or the measures of the two fact tables are the
same under the facts are measured across the dimensions with a
same set of measures.
Explain the Types of Dimension Tables?

Conformed Dimension: If a dimension table is connected to more than one fact table,
the granularity that is defined in the dimension table is common

across between the fact tables.
Junk Dimension
: The Dimension table, which contains only flags.
Monster Dimension : If rapidly changes in Dimension are known as Monster
Dimension.
De-generative Dimension: It is line item-oriented fact table design.
10.
A)
What index is created on Data Warehouse?

Bitmap index is created in Data Warehouse.
Conformed dimension:
A dimension table connects to more than one fact table. We present this same
dimension table in both schemes and we refer to dimension table as conformed
dimension.
Or
If one primary key of dimension is defined in two fact tables it is called conformed dimension.
Conformed fact:
Definitions of measurements (facts) are highly consistent we call them as conformed
fact.
Junk dimension:
It is convenient grouping of random flags and aggregates to get them out of a fact
table and into a useful dimensional framework.
Degenerated dimension:
Usually occur in line item oriented fact table designs. Degenerate dimensions are
normal, expected and useful.
The degenerated dimension key should be the actual production order of number and
should set in the fact table without a join to anything.
Or
A Degenerate dimension is a Dimension which has only a single attribute. This dimension is
typically represented as a single field in a fact table. Degenerate Dimensions are the fastest way
to group similar transactions. Degenerate Dimensions are used when fact tables represent
transactional data. They can be used as primary key for the fact table but they cannot act as
foreign keys.
Or
Degenerate dimension: A column of the key section of the fact table that does not have the
associated dimension table but used for reporting and analysis such column is called
degenerate dimension or line item dimension. For ex we have a fact table with customer_id
product_id branch_id employee_id bill_no date in key section and price quantity amount in
measure section. In this fact table bill_no from key section is a single value it has no associated
dimension table. Instead of creating a separate dimension table for that single value we can
include it in fact table to improve performance. So here the column bill_no is a degenerate
dimension or line item dimension.
Time dimension:
It contains a number of useful attributes for describing calendars and navigating.
An exclusive time dimension is required because the SQL date semantics and
functions cannot generate several important features, attributes required for analytical
purposes.
Attributes like week days, week ends, holidays, physical periods cannot be generated
by SQL statements.
Fact less fact table:
Fact table which do not have any facts are called fact less fact table.
They may consist of keys; these two kinds of fact tables do not have any facts at all.
The first type of fact less fact table records an event.
Many event tracking tables in dimensional data warehouses turn out to be factless.
Ex: A student tracking system that details each student attendance event each day.
The second type of fact less fact table is coverage. The coverage tables are frequently
needed when a primary fact table in dimensional DWH is sparse.
Ex: The sales fact table that records the sales of products in stores on particular days
under each promotion condition
Types of facts:
Additive: facts involved in the calculations for deriving summarized data.
Semi additive: facts that involved in the calculations at a particular context of time.
Non additive: facts that cannot involved in the calculations at every point of time.
What is Data Warehousing?
Subject oriented, integrated, time variant, non volatile Collection of data in support of
management / business user decisions. Encompasses not just data in the warehouse, but also
the architecture and tools to collect, query and analyze the information
Subject Oriented: All relevant data about a subject area is gathered and stored as a
single set in a useful format
Integrated : Data being stored in a globally accepted fashion, consistent naming
conventions, measurements, physical attributes etc even while the underlying source systems
store the data differently
Non Volatile : implies Data warehouse is read only
Time Variant : implies data gets added on as time goes by. Time being the most
important dimension.
Operational Data Data used to run your business, used by OLTP systems.
Informational Data Created from the wealth of operational data and some
external data useful to analyze your business.
Operational DataStore A staging area where you store and integrate
operational data before loading into warehouse. It is subject oriented non volatile
CURRENT DATA (not historical). ODS data is used for analysis, collected within a
few days or months and updated every time the underlying detail data changes.
Data Mart Is a subset of a warehouse, that enables certain targeted business
user groups to access functionally departmentalized data. (Business area WH).
Contains significantly smaller amount of data. Reduces demand on EDW, localized,
faster access and reduces network traffic.
Dependent Data mart: Build from a Warehouse
Independent Data mart: Build from independent operational sources.
Drivers for Data warehousing

-Seek competitive advantage and better understand the market trends and make better
forecasting decisions.
-Help react quickly and effectively to changes eg: Bring better products to market in a
more timely manner. Product life cycle is becoming shorter and customers expect better
service in shorter time frame.
-Analyze daily sales information and make quick decisions that can significantly affect
your companys performance. (Needs to adapt to regulatory requirements and internal
organization pressures)
-Average 3 year ROI is about 400 %.
OLTP vs OLAP
OLTP
OLAP
Data Content
transactional information
necessary to run business
operations
Calculate, Archive and

Aggregate data for decision
making process
Organization of data
Application specific
Enterprise wide
Refresh frequency
Dynamic
Static until refreshed
Data Model
Normalized, Complex data

model. Suitable for
Operational computation
De normalized or partially
de normalized (star schema)
to optimize query
performance
Probability of Access
High
Moderate to Low
Response Time
Microseconds to few seconds
Seconds to minutes
Usage
Highly structured and

repetitive Processing
Unstructured analytical
processing
Dimensional Modeling
1. Data modeling means structuring and organizing data and these data structures are
typically implemented in a database. In addition to defining and organizing the data,
data modeling will impose (implicitly or explicitly) constraints or limitations on the data
placed within the structure.
2. Dimensional modeling (DM) is the name of a logical design / model technique often
used for data warehouses.
3. It is different from, and contrasts with, entity-relationship modeling (ER).
4. Intended to support end-user queries in a data warehouse for performance. Easy to
understand, less joins, denormalized.
5. It is oriented around business user understandability, as opposed to ER Modeling and 3
NF where the goal is to remove redundancy.
6. Eg : Joining Hundreds of thousands of rows across several table will seriously
compromise Performance
7. Usage / Access path is fixed in OLTP, while in DSS it is adhoc.
Dimensional Modeling - Facts and Dimensions

Fact Table Contains fact data or measures. For eg SALES FACT tables will have metrics for
sales data at a STORE, for a given PRODUCT, by a given CUSTOMER, for a given TIME.
Fact Tables comprise the core and bulk of the data in a warehouse. Comprises millions of rows.
Highly De-normalized in Structure.
Contains multi part primary key values and each part references a dimension by which the fact
is accessed and analyzed.
Could contain summary / precalculated results.
Top down vs. Bottom up Approach to DW Design
Ralph Kimball Proposes Bottom Up Approach in which INDEPENDENT data marts are
created with a view of integrating them into an EDW at some time. Believes Datawarehouse is
nothing more than the union of all the data marts.
Bill Inmon Advocates a Top Down approach in which companies build an Enterprise DW in an
iterative manner business area by business area. The DEPENDENT Data marts and data
structures are created based on the content inside the EDW. Believes You can stack all the
minnows in the ocean and stack them together and they still do not make a Whale.
Both approaches converge on maintaining a single version of truth an ability to adapt with
changing business user needs.
Challenges / Issues in DW Projects
Bringing in data from multiple source systems and integrating them. Some projects fail when
they attempt to synchronize data volatile data.
Logical Transformation of operational data may require considerable analysis and design effort.
Architecture, Data model and partitioning strategy greatly impacts the success of the projects
Data Cleansing / Scrubbing may be needed.
Data coming in may be incomplete or contains values that cannot be transformed properly.
Business rules to generate summary views can be complex database operations such as multi
table joins and sub queries.
Choice of Reporting / ETL Tool for most active users.
Performance and Storage (Aggregate Explosion) for loading and retrieval of data.
Risks in DW Projects
Definition of data are inconsistent across user types and upstream data systems during Reqt
Analysis.
Data ownership in Datawarehouse is very less and hence the sanctity of the data.
High dependency on upstream systems. Delay in upstream interfaces impact design and
development.
DQ issues originated in source system result in end user dissatisfaction.
Obtaining Test Data for validation, if real time data is confidential.

Squeezed Development cycle due to high visibility for end user reporting.
Data Processing Time needs to be minimized to ensure up to data information.
What is the difference between Data modelling and Dimensional modelling?
Modelling data storage optimized for Transactional Processing is Data Modelling.
Modelling data storage optimized for Analytical Processing is Dimension Modelling.
Or
When data modeling we are structuring and organizing data. These data structures are then
typically implemented in a database management system. In addition to defining and
organizing the data data modeling will impose (implicitly or explicitly) constraints or limitations
on the data placed within the structure.
Managing large quantities of structured and unstructured data is a primary function of
information systems. Data models describe structured data for storage in data management
systems such as relational databases.
Data warehouses are typically developed using dimensional models rather than the traditional
entity/relationship models associated with conventional relational databases.
Dimensional Modelling is a design concept used by many data warehouse designers to build
their datawarehouse. In this design model all the data is stored in two types of tables - Facts
table and Dimension table. Fact table contains the facts/measurements of the business and the
dimension table contains the context of measurements ie the dimensions on which the facts
are calculated.
Data is modeled as a hypercube and the schema is a so-called star schema with a centralized
fact table surrounded by smaller dimensional tables representing key scientific objects.
Dimensional database systems allow multidimensional data to be modeled natively. Or they can
be modeled using the star schema or snowflake schema.
Can a dimension table contain numeric values?

Yes Dimension tables can contain Numeric values like "Code" "ID" or anything else. It has no
rustications on the datatype.It depends upon Business logic.
What is important is that on the reporting side you cant treat it as MEASURE.
A dimensional data should always be treated as DIMENSION irrespective of its DATATYPE.
eg you must have Customer Codes to identify your Customers but obviously you neither
require nor put SUM () on Cust_Code?
So its not about datatype its about functionality what you want to achieve.
What is the main difference between schema in RDBMS and schemas in Data Warehouse....?
Diff b.w OLTP and OLAP:

-----------------------OLTP Schema:
* Normalized
* More no.of trans
* Less time for queries execution
* More no.of users
* Have Insert delete and update trans.
OLAP (DWH) Schema:
* De Normalized
* Less no.of trans
* Less no.of users
* More time for query exec

* Will not have more insert delete and updates.
What type of Indexing mechanism do we need to use for a typical data warehouse?
It generally depends upon the data which u have in table if u have less distinct values in
particular column its always that u built up bit map index... rather that other one on dimension
tables generally we have indexes...
What is the difference between ODS and Staging?
An Operation Data store (ODS) is a type of database often used as an interim area for a data
warehouse. Unlike a data warehouse, which contains static data, the contents of the ODS are
updated through the course of business operations. An ODS is designed to quickly to perform
relatively simple queries on small amounts of data rather than the complex queries on large
amounts of data typical of the data warehouse. An ODS is similar to your short term memory in
that it stores only recent information, in comparison, the data warehouse is more like long term
memory in that is stores relatively permanent information. But in staging we are storing current
as well as historic data. This data might be a raw and then need cleansing and transform before
load into data warehouse
What is the main functional difference between ROLAP, MOLAP and HOLAP?
The FUNCTIONAL difference between these is how they information is stored. In all cases, the
users see the data as a cube of dimensions and facts.
ROLAP - detailed data is stored in a relational database in 3NF, star, or snowflake form. Queries
must summarize data on the fly.
MOLAP - data is stored in multidimensional form - dimension and facts stored together. You
can think of this a persistent cube. Level of detail is determined by the intersection of the
dimension hierarchies.
HOALP - data is stored using a combination of relational and multi-dimensional storage.
Summary data might persist as a cube, while detail data is stored relationally, but transitioning
between the two is invisible to the end-user.

Data Stage

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Data Stage

Caricato da

Copyright:

Formati disponibili

ETL Concepts:1.

Types of Parallel Processing?

Importance of Surrogate Key in Data warehousing?

it is independent of underlying database. i.e. Surrogate Key is not affected by the

Types of Dimensional Modeling?

Differentiate Primary Key and Partition Key?

Differentiate Database data and Data warehouse data?

What is Meta Data Repository?

Explain Types of Fact Tables?

: It contains only foreign keys to the dimension tables.

Explain the Types of Dimension Tables?

the granularity that is defined in the dimension table is common

What index is created on Data Warehouse?

Drivers for Data warehousing

Calculate, Archive and

Static until refreshed

Normalized, Complex data

Microseconds to few seconds

Highly structured and

Dimensional Modeling - Facts and Dimensions

Obtaining Test Data for validation, if real time data is confidential.

Can a dimension table contain numeric values?

Diff b.w OLTP and OLAP:

* More time for query exec

Potrebbero piacerti anche