Sei sulla pagina 1di 12

ETL Specific

1) What is ETL?

ETL stands for Extract-Transform-Load and it is a process of how data is loaded from the source system
to the data warehouse. Data is extracted from an OLTP database, transformed to match the data
warehouse schema and loaded into the data warehouse database. Many data warehouses also incorporate
data from non-OLTP systems such as text files, legacy systems and spreadsheets.

2) Why ETL testing is required?

Due to high volume of transactions, Organizations move all their transactional data into a data warehouse.

ETL testing includes:

1. Verify whether the data is transforming correctly according to business requirements


2. Verify that the projected data is loaded into the data warehouse without any truncation and
data loss.
3. To make sure that ETL application reports invalid data and replaces with default values
4. To make sure that data loads at expected time frame to improve scalability and
Performance

3) What is Data warehouse?

Bill Inmon : Father of Data warehouse

A Data warehouse is a subject oriented, integrated, time variant, non-volatile collection of data in support
of management’s decision making process.

A data warehouse is a database that is designed for query and analysis rather than for transaction
processing. The data warehouse is constructed by integrating the data from multiple heterogeneous
sources.It enables the company or organization to consolidate data from several sources and separates
analysis workload from transaction workload. Data is turned into high quality information to meet all
enterprise reporting requirements for all levels of users.

4) What are the characteristics of a Data Warehouse?

Subject-Oriented:

A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a
particular subject.

Integrated:

A data warehouse integrates data from multiple data sources. For example, source A and source B may
have different ways of identifying a product, but in a data warehouse, there will be only a single way of
identifying a product.

Time-Variant:
Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months,
12 months, or even older data from a data warehouse. This contrasts with a transactions system, where
often only the most recent data is kept. For example, a transaction system may hold the most recent
address of a customer, where a data warehouse can hold all addresses associated with a customer.

Non-volatile:

Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should
never be altered.

5) What is the difference between Data Mining and Data Warehousing?

1. Data mining - analyzing data from different perspectives and concluding it into useful decision making
information. It can be used to increase revenue, cost cutting, increase productivity or improve any
business process. There are lot of tools available in market for various industries to do data mining.
Basically, it is all about finding correlations or patterns in large relational databases.

1. Data warehousing comes before data mining. It is the process of compiling and organizing data into
one database from various source systems whereas data mining is the process of extracting meaningful
data from that database (data warehouse).

Or

1. Data mining is the process of finding patterns in a given data set. These patterns can often provide
meaningful and insightful data to whoever is interested in that data. Data mining is used today in a wide
variety of contexts – in fraud detection, as an aid in marketing campaigns, and even supermarkets use it to
study their consumers.

2. Data warehousing can be said to be the process of centralizing or aggregating data from multiple
sources into one common repository.

6) What are the main stages of Business Intelligence?

The five key stages of Business Intelligence:

Data sourcing:

Business Intelligence is about extracting information from multiple sources of data. The data might be:
text documents - e.g. memos or reports or email messages; photographs and images; sounds; formatted
tables; web pages and URL lists. The key to data sourcing is to obtain the information in electronic form.
So typical sources of data might include: scanners; digital cameras; database queries; web searches;
computer file access etc.

Data analysis:
Business Intelligence is about synthesizing useful knowledge from collections of data. It is about
estimating current trends, integrating and summarizing disparate information, validating models of
understanding and predicting missing information or future trends. This process of data analysis is also
called data mining or knowledge discovery.

Situation awareness:

Business Intelligence is about filtering out irrelevant information, and setting the remaining information
in the context of the business and its environment. The user needs the key items of information relevant to
his or her needs, and summaries that are syntheses of all the relevant data (market forces, government
policy etc.). Situation awareness is the grasp of a the context in which to understand and make
decisions. A Algorithms for situation assessment provide such syntheses automatically.

Risk assessment:

Business Intelligence is about discovering what plausible actions might be taken, or decisions made, at
different times. It is about helping you weigh up the current and future risk, cost or benefit of taking one
action over another, or making one decision versus another. It is about inferring and summarizing your
best options or choices.

Decision support

Business Intelligence is about using information wisely. It aims to provide warning you of important
events, such as takeovers, market changes, and poor staff performance, so that you can take preventative
steps. It seeks to help you analyze and make better business decisions, to improve sales or customer
satisfaction or staff morale. It presents the information you need, when you need it.

7) What tools you have used for ETL testing?

1. Data access tools e.g., TOAD, WinSQL, AQT etc. (used to analyze content of tables)
2. ETL Tools e.g. Informatica, DataStage
3. Test management tool e.g. Test Director, Quality Center etc. ( to maintain requirements, test cases,
defects and traceability matrix)

8) What is Grain of Fact?

Grain fact can be defined as the level at which the fact information is stored. It is also known as Fact
Granularity

9) What factless fact schema is and what is Measures?

A fact table without measures is known as Factless fact table. It can view the number of occurring events.
For example, it is used to record an event such as employee count in a company.

The numeric data based on columns in a fact table is known as Measures


10) What is partitioning, hash partitioning and round robin partitioning?

To improve performance, transactions are sub divided, this is called as Partitioning. Partitioning enables
Informatica Server for creation of multiple connection to various sources.

The 2 types of partitions are

Round-Robin Partitioning:

In round-robin partitioning, data is evenly distributed among all the partitions so the number of rows in
each partition is relatively same.

Hash Partitioning:

Hash partitioning is when the server uses a hash function in order to create partition keys to group the
data.

11) Name the three approaches that can be followed for system integration.

The three approaches are − top-down, bottom-up, and hybrid.

12) What are the different ETL Testing categories as per their function?

ETL testing can be divided into the following categories based on their function −

Source to Target Count Testing − It involves matching of count of records in source and target system.

Source to Target Data Testing − It involves data validation between source and target system. It also
involves data integration and threshold value check and Duplicate data check in target system.

Data Mapping or Transformation Testing − It confirms the mapping of objects in source and target
system. It also involves checking functionality of data in target system.

End-User Testing − It involves generating reports for end users to verify if data in reports are as per
expectation. It involves finding deviation in reports and cross check the data in target system for report
validation.

Retesting − It involves fixing the bugs and defects in data in target system and running the reports again
for data validation.

System Integration Testing − It involves testing all the individual systems, and later combine the result to
find if there is any deviation.

13) Explain the key challenges that you face while performing ETL Testing.

Data loss during the ETL process.


Incorrect, incomplete or duplicate data.

DW system contains historical data so data volume is too large and really complex to perform ETL
testing in target system.

ETL testers are normally not provided with access to see job schedules in ETL tool. They hardly have
access on BI Reporting tools to see final layout of reports and data inside the reports.

Tough to generate and build test cases as data volume is too high and complex.

ETL testers normally doesn’t have an idea of end user report requirements and business flow of the
information.

ETL testing involves various complex SQL concepts for data validation in target system.

Sometimes testers are not provided with source to target mapping information.

Unstable testing environment results delay in development and testing the process.

14) What are your responsibilities as an ETL Tester?

The key responsibilities of an ETL tester include −

Verifying the tables in the source system − Count check, Data type check, keys are not missing, duplicate
data.

Applying the transformation logic before loading the data: Data threshold validation, surrogate ky check,
etc.

Data Loading from the Staging area to the target system: Aggregate values and calculated measures, key
fields are not missing, Count Check in target table, BI report validation, etc.

Testing of ETL tool and its components, Test cases − Create, design and execute test plans, test cases,
Test ETL tool and its function, Test DW system, etc.

15) What do you understand by Threshold value validation Testing? Explain with an example.

In this testing, a tester validates the range of data. All the threshold values in the target system are to be
checked to ensure they are as per the expected result.

Example − Age attribute shouldn’t have a value greater than 100. In Date column DD/MM/YY, month
field shouldn’t have a value greater than 12.

16) How does duplicate data appear in a target system?

When no primary key is defined, then duplicate values may appear.

Data duplication may also arise due to incorrect mapping, and manual errors while transferring data from
source to target system.
17) What is Regression testing?

Regression testing is when we make changes to data transformation and aggregation rules to add a new
functionality and help the tester to find new errors. The bugs that appear in data which comes in
Regression testing are called Regression.

18) What are the common ETL Testing scenarios?

The most common ETL testing scenarios are −

Structure validation
Validating Mapping document
Validate Constraints
Data Consistency check
Data Completeness Validation
Data Correctness Validation
Data Transform validation
Data Quality Validation
Null Validation
Duplicate Validation
Date Validation check
Full Data Validation using minus query
Other Test Scenarios
Data Cleaning

19) What do you understand by a cosmetic bug in ETL testing?

Cosmetic bug is related to the GUI of an application. It can be related to font style, font size, colors,
alignment, spelling mistakes, navigation, etc.

20) What do you call the testing bug that comes while performing threshold validation testing?

It is called Boundary Value Analysis related bug.

21) Name a few checks that can be performed to achieve ETL Testing Data accuracy.

Value comparison − It involves comparing the data in the source and the target systems with minimum or
no transformation. It can be done using various ETL Testing tools such as Source Qualifier
Transformation in Informatica.

Critical data columns can be checked by comparing distinct values in source and target systems.

22) What do you understand by fact-less fact table?

A fact-less fact table is a fact table that does not have any measures. It is essentially an intersection of
dimensions. There are two types of fact-less tables: One is for capturing an event, and the other is for
describing conditions.

23) What is a slowly changing dimension and what are its types?
Dimensions that change overtime are called Slowly Changing Dimensions.

SCDs are of three types − Type 1, Type 2, and Type 3.

SCD Type 1: Has only current records

SCD Type 2: Has Current records + Historical records

SCD Type 3: Has Current records + 1 previous records

24. What is RTM?


It is a document that captures, maps and traces user requirement with test cases.

The main purpose of Requirement Traceability Matrix is to see that all test cases are covered so that no
functionality should miss while doing Software testing.

25. ETL Test Scenarios and Test Cases.

Test Scenario Test Cases

 Mapping doc validation Verify mapping doc whether corresponding ETL information is provided or
not. Change log should maintain in every mapping doc.
Define the default test strategy If mapping docs are missed out some
optional information. Ex: data types length etc

 Structure validation 1. Validate the source and target table structure against corresponding
mapping doc.

2. Source data type and Target data type should be same.

3. Length of data types in both source and target should be equal.

4. Verify that data field types and formats are specified

5. Source data type length should not less than the target data type length.

6. Validate the name of columns in table against mapping doc.

 Constraint Validation Ensure the constraints are defined for specific table as expected.
 Data Consistency Issues 1. The data type and length for a particular attribute may vary in files or
tables though the semantic definition is the same.
Example: Account number may be defined as: Number (9) in one field or
table and Varchar2(11) in another table

2. Misuse of Integrity Constraints: When referential integrity constraints are


misused, foreign key values may be left “dangling” or
inadvertently deleted.
Example: An account record is missing but dependent records are not
deleted.

 Data Completeness Issues Ensures that all expected data is loaded in to target table
1. Compare records counts between source and target. Check for any
rejected records.

2. Check Data should not be truncated in the column of target table.

3. Check boundary value analysis (ex: only >=2008 year data has to load
into the target)

4. Comparing unique values of key fields between source data and data
loaded to the warehouse. This is a valuable technique that points out a
variety of possible data errors without doing a full validation on all fields.

 Data Correctness Issues 1. Data that is misspelled or inaccurately recorded.

2. Null, non-unique, or out of range data may be stored when the integrity
constraints are disabled.
Example: The primary key constraint is disabled during an import
function. Data is entered into the existing data with null unique identifiers.
 Data Transformation 1. Create a spread sheet of scenarios of input data and expected results and
validate these with the business customer. This is an excellent requirements
elicitation step during design and could also be used as part of testing.

2. Create test data that includes all scenarios. Utilize an ETL developer to
automate the entire process of populating data sets with the scenario
spread sheet to permit versatility and mobility for the reason that scenarios
are likely to change.

3. Utilize data profiling results to compare range and submission of values in


each field between target and source data.

4. Validate accurate processing of ETL generated fields; for example,


surrogate keys.

5. Validate that the data types within the warehouse are the same as was
specified in the data model or design.

6. Create data scenarios between tables that test referential integrity.

7. Validate parent-to-child relationships in the data. Create data scenarios


that test the management of orphaned child records.

 Data Quality 1. Number check: if in the source format of numbering the columns are as
xx_30 but if the target is only 30 then it has to load not pre_fix(xx_). we
need to validate.

2. Date Check: They have to follow Date format and it should be same
across all the records. Standard format: yyyy-mm-dd etc..

3. Precision Check: Precision value should display as expected in the target


table.
Example: In source 19.123456 but in the target it should display as 19.123
or round of 20.

4. Data Check: Based on business logic, few record which does not meet
certain criteria should be filtered out.
Example: only record whose date_sid >=2008 and GLAccount != ‘CM001’
should only load in the target table.

5. Null Check: Few columns should display “Null” based on business


requirement.
Example: Termination Date column should display null unless & until if his
“Active status” Column is “T” or “Deceased”.
Note: Data cleanness will be decided during design phase only.

 Null Validation Verify the null values where "Not Null" specified for specified column.
 Duplicate check 1. Needs to validate the unique key, primary key and any other column
should be unique as per the business requirements are having any duplicate
rows.

2. Check if any duplicate values exist in any column which is extracting from
multiple columns in source and combining into one column.

3. Some time as per the client requirements we needs ensure that no


duplicates in combination of multiple columns within target only.

Example: One policy holder can take multiple polices and multiple claims.
In this case we need to verify the CLAIM_NO, CLAIMANT_NO,
COVEREGE_NAME, EXPOSURE_TYPE, EXPOSURE_OPEN_DATE,
EXPOSURE_CLOSED_DATE, EXPOSURE_STATUS, PAYMENT

 DATE Validation Date values are using many areas in ETL development for:

1. To know the row creation date ex: CRT_TS

2. Identify active records as per the ETL development perspective Ex:


VLD_FROM, VLD_TO

3. Identify active records as per the business requirements perspective Ex:


CLM_EFCTV_T_TS, CLM_EFCTV_FROM_TS

4. Sometimes based on the date values the updates and inserts are
generated.

Possible Test scenarios to validate the Date values:

a. From_Date should not greater than To_Date


b. Format of date values should be proper.
c. Date values should not any junk values or null values

 Complete Data Validation 1. To validate the complete data set in source and target table minus query
(using minus and is best solution.
intersect)
2. We need to source minus target and target minus source.

3. If minus query returns any value those should be considered as


mismatching rows.

4. And also we needs to matching rows among source and target using
Intersect statement.
5. The count returned by intersect should match with individual counts of
source and target tables.

6. If minus query returns o rows and count intersect is less than source
count or target table count then we can considered as duplicate rows are
exists.

 Some Useful test 1. Verify that extraction process did not extract duplicate data from the
scenarios source (usually this happens in repeatable processes where at point zero we
need to extract all data from the source file, but the during the next
intervals we only need to capture the modified, and new rows.)
2. The QA team will maintain a set of SQL statements that are automatically
run at this stage to validate that no duplicate data have been extracted from
the source systems.

 Data cleanness Unnecessary columns should be deleted before loading into the staging
area.

Example2: If a column have name but it is taking extra space , we have to


“trim” space so before loading in the staging area with the help of
expression transformation space will be trimmed.

Example1: Suppose telephone number and STD code in different columns


and requirement says it should be in one column then with the help of
expression transformation we will concatenate the values in one column.

Potrebbero piacerti anche