Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1. with necessary diagram, Explain about Data Warehouse Development Life Cycle?
Ans :-
Data warehouse life cycle cover two vital areas. First is Warehouse management and second is Data management. The former deals with defining the project activities and requirements gathering. Life Cycle of Data Warehouse Development :-
Define the
Dd
Gather Requirements
Implementation
Meta Data Capture: Initial capture of Meta data from a variety of sources. Meta Data Synchronizer: Processes to keep Meta data up to date. Meta Data Search Engine: This will be the front end for the users to search and access Meta data. Meta Data Results Manager: This will process the results of the Meta data search and allow the user to make an appropriate selection. Meta Data Alerter: As part of a push technology, this will notify the subscribers about any new changes to the Meta data contents depending on the user profile. Meta Data Query Trigger: Will trigger an appropriate query tool to get data from a data warehouse or any other source based on the selection made by the user in the Meta data results manager.
3. Write briefly any four ETL tools. What is transformation? Briefly explain the basic transformation types.
Ans:ETL process can be created using almost any programming language, creating them from scratch is quite complex. Companies are buying ETL tools to help in the creation of
ETL processes. A good ETL tool must be able to communicate with the many different relational database and read the various file formats used throughout an organization. ETL tool have started to migrate into enterprise application integration, or even enterprise service bus, systems that now cover much more than just the extraction transformation and loading of data. Many ETL vendors now have data profiling, data quality and metadata capabilities. ETL Tools : PL/SQL SAS Data Integrator/SAS-integration studio Ascential Data Stage Cognos Decision Stream Microsoft DTS Business Objects Data Integrator
Transformation :Data transformations are often the most complex and, in terms of processing time , the most costly part of the ETL process. They can range from single data conversion to extremely complex data scrubbing techniques. Most common transformation types are : Format revisions:- you will come across these quite often. These revisions include changes to the data types and lengths of individual fields. In your source system, product package types may be indicated by codes and names in which the fields are numeric and text data types. The lengths of the package types may vary among the different source systems. Decoding of Fields:- A common type of data transformation. When you deal with multiple source system, you are bound to have the same data items described by a plethora of field values. Ex. The coding for gender, with one source system using 1 and 2 for Male and Female and another system using M and F. Calculated and derived Values:- The extracted data from the sales system contains sales amounts, sales unit and operating cost estimates by product. You will have to calculate the total cost and the profit margin before data can be stored in the data warehouse.
Splitting of Single Field:- Earlier the legacy systems stored name and addresses of customers and employees in large text fields. The first name, middle initials and last name were stored as a large text in a single field. Merging of Information :- this type of data transformation does not literally mean the merging of several field to create a single field of data. Character Set Conversion:- this type of transformation relates to the conversion of character set to an agreed standard character set for textual data in the data warehouse. If you have mainframe legacy system as source system, the source data from these systems will be in EBCDIC characters. Conversion of unit of measurements Date/Time conversion . Summarization De-duplication.
Multidimensional analysis:- It is a data analysis process that groups data into two or more categories: data dimensions and measurements. For example, a data set consisting of the number of wins for a single football team at each of several years is a single-dimensional (in this case, longitudinal) data set. A data set consisting of the number of wins for several football teams in a single year is also a single-dimensional (in this case, cross-sectional) data set. A data set consisting of the number of wins for several football teams over several years is a two-dimensional data set. Twodimensional data sets are also called panel data. While, strictly speaking, two- and higher- dimensional data sets are multi-dimensional, the term multidimensional tends to be applied only to data sets with three or more dimension.
code being tested. Integration testing shows how the application fits into the overall flow of all upstream and downstream application. When creating integration test scenarios, consider how the overall process can break and focus on touch point between applications rather than within one application. Integration testing will involve following: Sequence of ETL jobs in batch. Dependency and sequencing. Job re-starts ability. Initial loading of records at a later date to verify the newly inserted or updated data. Testing the rejected those dont fulfil transformation rules. Error log generation.
Business
Requirements Testing
QA Team builds Test Plan
Review of HLD
Test Execution
Here are some main areas of testing that should be done for the ETL process:
Making sure that all the records in the source system that should be brought into the data warehouse actually are extracted into the data warehouse: no more, no less. Making sure that all of the components of the ETL process complete successfully All of the extracted source data is correctly transformed into dimension tables and fact tables All of the extracted and transformed data is successfully loaded into the data warehouse