Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1) What is ETL?
ETL stands for Extract-Transform-Load and it is a process of how data is loaded from the source system
to the data warehouse. Data is extracted from an OLTP database, transformed to match the data
warehouse schema and loaded into the data warehouse database. Many data warehouses also incorporate
data from non-OLTP systems such as text files, legacy systems and spreadsheets.
Due to high volume of transactions, Organizations move all their transactional data into a data warehouse.
A Data warehouse is a subject oriented, integrated, time variant, non-volatile collection of data in support
of management’s decision making process.
A data warehouse is a database that is designed for query and analysis rather than for transaction
processing. The data warehouse is constructed by integrating the data from multiple heterogeneous
sources.It enables the company or organization to consolidate data from several sources and separates
analysis workload from transaction workload. Data is turned into high quality information to meet all
enterprise reporting requirements for all levels of users.
Subject-Oriented:
A data warehouse can be used to analyze a particular subject area. For example, “sales” can be a
particular subject.
Integrated:
A data warehouse integrates data from multiple data sources. For example, source A and source B may
have different ways of identifying a product, but in a data warehouse, there will be only a single way of
identifying a product.
Time-Variant:
Historical data is kept in a data warehouse. For example, one can retrieve data from 3 months, 6 months,
12 months, or even older data from a data warehouse. This contrasts with a transactions system, where
often only the most recent data is kept. For example, a transaction system may hold the most recent
address of a customer, where a data warehouse can hold all addresses associated with a customer.
Non-volatile:
Once data is in the data warehouse, it will not change. So, historical data in a data warehouse should
never be altered.
1. Data mining - analyzing data from different perspectives and concluding it into useful decision making
information. It can be used to increase revenue, cost cutting, increase productivity or improve any
business process. There are lot of tools available in market for various industries to do data mining.
Basically, it is all about finding correlations or patterns in large relational databases.
1. Data warehousing comes before data mining. It is the process of compiling and organizing data into
one database from various source systems whereas data mining is the process of extracting meaningful
data from that database (data warehouse).
Or
1. Data mining is the process of finding patterns in a given data set. These patterns can often provide
meaningful and insightful data to whoever is interested in that data. Data mining is used today in a wide
variety of contexts – in fraud detection, as an aid in marketing campaigns, and even supermarkets use it to
study their consumers.
2. Data warehousing can be said to be the process of centralizing or aggregating data from multiple
sources into one common repository.
Data sourcing:
Business Intelligence is about extracting information from multiple sources of data. The data might be:
text documents - e.g. memos or reports or email messages; photographs and images; sounds; formatted
tables; web pages and URL lists. The key to data sourcing is to obtain the information in electronic form.
So typical sources of data might include: scanners; digital cameras; database queries; web searches;
computer file access etc.
Data analysis:
Business Intelligence is about synthesizing useful knowledge from collections of data. It is about
estimating current trends, integrating and summarizing disparate information, validating models of
understanding and predicting missing information or future trends. This process of data analysis is also
called data mining or knowledge discovery.
Situation awareness:
Business Intelligence is about filtering out irrelevant information, and setting the remaining information
in the context of the business and its environment. The user needs the key items of information relevant to
his or her needs, and summaries that are syntheses of all the relevant data (market forces, government
policy etc.). Situation awareness is the grasp of a the context in which to understand and make
decisions. A Algorithms for situation assessment provide such syntheses automatically.
Risk assessment:
Business Intelligence is about discovering what plausible actions might be taken, or decisions made, at
different times. It is about helping you weigh up the current and future risk, cost or benefit of taking one
action over another, or making one decision versus another. It is about inferring and summarizing your
best options or choices.
Decision support
Business Intelligence is about using information wisely. It aims to provide warning you of important
events, such as takeovers, market changes, and poor staff performance, so that you can take preventative
steps. It seeks to help you analyze and make better business decisions, to improve sales or customer
satisfaction or staff morale. It presents the information you need, when you need it.
1. Data access tools e.g., TOAD, WinSQL, AQT etc. (used to analyze content of tables)
2. ETL Tools e.g. Informatica, DataStage
3. Test management tool e.g. Test Director, Quality Center etc. ( to maintain requirements, test cases,
defects and traceability matrix)
Grain fact can be defined as the level at which the fact information is stored. It is also known as Fact
Granularity
A fact table without measures is known as Factless fact table. It can view the number of occurring events.
For example, it is used to record an event such as employee count in a company.
To improve performance, transactions are sub divided, this is called as Partitioning. Partitioning enables
Informatica Server for creation of multiple connection to various sources.
Round-Robin Partitioning:
In round-robin partitioning, data is evenly distributed among all the partitions so the number of rows in
each partition is relatively same.
Hash Partitioning:
Hash partitioning is when the server uses a hash function in order to create partition keys to group the
data.
11) Name the three approaches that can be followed for system integration.
12) What are the different ETL Testing categories as per their function?
ETL testing can be divided into the following categories based on their function −
Source to Target Count Testing − It involves matching of count of records in source and target system.
Source to Target Data Testing − It involves data validation between source and target system. It also
involves data integration and threshold value check and Duplicate data check in target system.
Data Mapping or Transformation Testing − It confirms the mapping of objects in source and target
system. It also involves checking functionality of data in target system.
End-User Testing − It involves generating reports for end users to verify if data in reports are as per
expectation. It involves finding deviation in reports and cross check the data in target system for report
validation.
Retesting − It involves fixing the bugs and defects in data in target system and running the reports again
for data validation.
System Integration Testing − It involves testing all the individual systems, and later combine the result to
find if there is any deviation.
13) Explain the key challenges that you face while performing ETL Testing.
DW system contains historical data so data volume is too large and really complex to perform ETL
testing in target system.
ETL testers are normally not provided with access to see job schedules in ETL tool. They hardly have
access on BI Reporting tools to see final layout of reports and data inside the reports.
Tough to generate and build test cases as data volume is too high and complex.
ETL testers normally doesn’t have an idea of end user report requirements and business flow of the
information.
ETL testing involves various complex SQL concepts for data validation in target system.
Sometimes testers are not provided with source to target mapping information.
Unstable testing environment results delay in development and testing the process.
Verifying the tables in the source system − Count check, Data type check, keys are not missing, duplicate
data.
Applying the transformation logic before loading the data: Data threshold validation, surrogate ky check,
etc.
Data Loading from the Staging area to the target system: Aggregate values and calculated measures, key
fields are not missing, Count Check in target table, BI report validation, etc.
Testing of ETL tool and its components, Test cases − Create, design and execute test plans, test cases,
Test ETL tool and its function, Test DW system, etc.
15) What do you understand by Threshold value validation Testing? Explain with an example.
In this testing, a tester validates the range of data. All the threshold values in the target system are to be
checked to ensure they are as per the expected result.
Example − Age attribute shouldn’t have a value greater than 100. In Date column DD/MM/YY, month
field shouldn’t have a value greater than 12.
Data duplication may also arise due to incorrect mapping, and manual errors while transferring data from
source to target system.
17) What is Regression testing?
Regression testing is when we make changes to data transformation and aggregation rules to add a new
functionality and help the tester to find new errors. The bugs that appear in data which comes in
Regression testing are called Regression.
Structure validation
Validating Mapping document
Validate Constraints
Data Consistency check
Data Completeness Validation
Data Correctness Validation
Data Transform validation
Data Quality Validation
Null Validation
Duplicate Validation
Date Validation check
Full Data Validation using minus query
Other Test Scenarios
Data Cleaning
Cosmetic bug is related to the GUI of an application. It can be related to font style, font size, colors,
alignment, spelling mistakes, navigation, etc.
20) What do you call the testing bug that comes while performing threshold validation testing?
21) Name a few checks that can be performed to achieve ETL Testing Data accuracy.
Value comparison − It involves comparing the data in the source and the target systems with minimum or
no transformation. It can be done using various ETL Testing tools such as Source Qualifier
Transformation in Informatica.
Critical data columns can be checked by comparing distinct values in source and target systems.
A fact-less fact table is a fact table that does not have any measures. It is essentially an intersection of
dimensions. There are two types of fact-less tables: One is for capturing an event, and the other is for
describing conditions.
23) What is a slowly changing dimension and what are its types?
Dimensions that change overtime are called Slowly Changing Dimensions.
The main purpose of Requirement Traceability Matrix is to see that all test cases are covered so that no
functionality should miss while doing Software testing.
Mapping doc validation Verify mapping doc whether corresponding ETL information is provided or
not. Change log should maintain in every mapping doc.
Define the default test strategy If mapping docs are missed out some
optional information. Ex: data types length etc
Structure validation 1. Validate the source and target table structure against corresponding
mapping doc.
5. Source data type length should not less than the target data type length.
Constraint Validation Ensure the constraints are defined for specific table as expected.
Data Consistency Issues 1. The data type and length for a particular attribute may vary in files or
tables though the semantic definition is the same.
Example: Account number may be defined as: Number (9) in one field or
table and Varchar2(11) in another table
Data Completeness Issues Ensures that all expected data is loaded in to target table
1. Compare records counts between source and target. Check for any
rejected records.
3. Check boundary value analysis (ex: only >=2008 year data has to load
into the target)
4. Comparing unique values of key fields between source data and data
loaded to the warehouse. This is a valuable technique that points out a
variety of possible data errors without doing a full validation on all fields.
2. Null, non-unique, or out of range data may be stored when the integrity
constraints are disabled.
Example: The primary key constraint is disabled during an import
function. Data is entered into the existing data with null unique identifiers.
Data Transformation 1. Create a spread sheet of scenarios of input data and expected results and
validate these with the business customer. This is an excellent requirements
elicitation step during design and could also be used as part of testing.
2. Create test data that includes all scenarios. Utilize an ETL developer to
automate the entire process of populating data sets with the scenario
spread sheet to permit versatility and mobility for the reason that scenarios
are likely to change.
5. Validate that the data types within the warehouse are the same as was
specified in the data model or design.
Data Quality 1. Number check: if in the source format of numbering the columns are as
xx_30 but if the target is only 30 then it has to load not pre_fix(xx_). we
need to validate.
2. Date Check: They have to follow Date format and it should be same
across all the records. Standard format: yyyy-mm-dd etc..
4. Data Check: Based on business logic, few record which does not meet
certain criteria should be filtered out.
Example: only record whose date_sid >=2008 and GLAccount != ‘CM001’
should only load in the target table.
Null Validation Verify the null values where "Not Null" specified for specified column.
Duplicate check 1. Needs to validate the unique key, primary key and any other column
should be unique as per the business requirements are having any duplicate
rows.
2. Check if any duplicate values exist in any column which is extracting from
multiple columns in source and combining into one column.
Example: One policy holder can take multiple polices and multiple claims.
In this case we need to verify the CLAIM_NO, CLAIMANT_NO,
COVEREGE_NAME, EXPOSURE_TYPE, EXPOSURE_OPEN_DATE,
EXPOSURE_CLOSED_DATE, EXPOSURE_STATUS, PAYMENT
DATE Validation Date values are using many areas in ETL development for:
4. Sometimes based on the date values the updates and inserts are
generated.
Complete Data Validation 1. To validate the complete data set in source and target table minus query
(using minus and is best solution.
intersect)
2. We need to source minus target and target minus source.
4. And also we needs to matching rows among source and target using
Intersect statement.
5. The count returned by intersect should match with individual counts of
source and target tables.
6. If minus query returns o rows and count intersect is less than source
count or target table count then we can considered as duplicate rows are
exists.
Some Useful test 1. Verify that extraction process did not extract duplicate data from the
scenarios source (usually this happens in repeatable processes where at point zero we
need to extract all data from the source file, but the during the next
intervals we only need to capture the modified, and new rows.)
2. The QA team will maintain a set of SQL statements that are automatically
run at this stage to validate that no duplicate data have been extracted from
the source systems.
Data cleanness Unnecessary columns should be deleted before loading into the staging
area.