Sei sulla pagina 1di 14

What is ETL?

ETL stands Ior Extract, TransIorm and Load, which is a process used to collect data Irom
various sources, transIorm the data depending on business rules/needs and load the data into a
destination database. The need to use ETL arises Irom the Iact that in modern computing
business data resides in multiple locations and in many incompatible Iormats. For example
business data might be stored on the Iile system in various Iormats (Word docs, PDF,
spreadsheets, plain text, etc), or can be stored as email Iiles, or can be kept in a various
database servers like MS SQL Server, Oracle and MySQL Ior example. Handling all this
business inIormation eIIiciently is a great challenge and ETL plays an important role in solving
this problem.
Extract, Transform and Load
The ETL process has 3 main steps, which are Extract, TransIorm and Load.

Extract The Iirst step in the ETL process is extracting the data Irom various sources. Each oI
the source systems may store its data in completely diIIerent Iormat Irom the rest. The sources
are usually Ilat Iiles or RDBMS, but almost any data storage can be used as a source Ior an
ETL process.

Transform Once the data has been extracted and converted in the expected Iormat, it`s time
Ior the next step in the ETL process, which is transIorming the data according to set oI
business rules. The data transIormation may include various operations including but not
limited to Iiltering, sorting, aggregating, joining data, cleaning data, generating calculated data
based on existing values, validating data, etc.

Load The Iinal ETL step involves loading the transIormed data into the destination target,
which might be a database or data warehouse.
ETL Tools
Many oI the biggest soItware players produce ETL tools, including IBM (IBM InIoSphere
DataStage), Oracle (Oracle Warehouse Builder) and oI course MicrosoIt with their SQL Server
Integration Services (SSIS) included in certain editions oI MicrosoIt SQL Server 2005 and
2008.





ate: 10-23-2011
Source (Marketing, Website, customer care)ETLTarget
Extract transIorm and load which means the process where you extract the data Irom source
tables transIorms them into the desired Iormat based on certain rules and Iinally load them onto
target tables.
Initial load it will only happen once in every 24 hours
Delta load (incremental load)
1. Insert a record
2. Update a record
3. Delete a record
Batch processing which runs every 24 hours
There`s a source database and this is going to happen aIter a certain time length.
QA database (QA environment) staging database production database (production
environment)
ackend Testing
````The goals of testing and ETL application is
1. Data completeness it ensures all the expected database is loaded properly
2. Data transIormation - it makes sure that all the data are transIormed correctly according
to business rules
3. Data Quality- It ensures that the ETL application correctly rejects and substitutes with
deIault values, corrects or ignores and reports invalid data.
4. PerIormance and scalability It ensures that data loads and queries perIorm within
expected time Irames.
````The documents needed to start up Testing
1. Source to target level mapping documents
2. Field level mapping document ( how the Iields are mapped) each table has a lot oI Iields
and how the Iields are mapped to the source to target level documents
3. BRD`s documents


SSIS- SQL Server Integration Services (Packages UI)
SSAS (SQL Server analysis Services
SSRS (SQL Server Reporting Services)

Advantage oI using select queries
Select statement allows you to retrieve the records Irom one or more tables in your database.
Select * Irom students (all the table Irom the students)
Select columns Irom tablename where condition

istinction clause
The distinct clause allows to removes the duplicates Irom the result set.
Select DISTINCT column
From tablename
Where (condition)
Select DISTINCT city, state
From suppliers
Where clause
The where clause allows to Iilter the results Irom any SQL statement
Select columnname
From tablename

Where (condition)
Select supplier.id
From Supplier
Where suppliername`IBM` and suppliercity`newark`;
SUM function
The SUM Iunction returns the summed value oI an expression
Select SUM(Expression)
From Table
Where( condition )
Aggregate Iunctions(SUM, MIN, MAX, COUNT)
Select sum(salary) as totoal salary`
FROM employees
Where salary~25000;

Select sum(DISTINCT salary) as Unique salary`
From employees
Where salary~25000;
Select min(salary)
From employees
Where salary1000;
Select max(Salary)
From employees
Where salary~2500;
Select count(salary)
From employees
Where salary~1000;
The count Iunction returns the number oI rows in a query.
Select count(supplier id)
From supplier table;
etween condition
The between condition allows you to retrieve the values within a range.
Select columnname
From tablename
Where column between value1 and value2;
Select *
From supplier
Where supplierid between 5000 and 10000;
Select *
From supplier
Where supplierid~5000 and supplier id10000;
NOT etween
Select *
From suppliers
Where supplierid NOT between 5000 and 10000;

IN Function
The IN Iunction reduces the need oI multiple OR conditions
Select column
From Table
Where column1 in (value1,value2..);

Select *
From supplier
Where suppliername in(IBM`, MicrosoIt`,` HP`,` TCS`);
Select *
From Orders
Where orderid in(1000, 1100, 1200, 1300);
Wildcards
LIKE condition
The like condition allows you to use the wildcards in the where clause oI a SQL statement.
The like condition can be used in any valid SQL statement
Select *
From suppliers
Where suppliername like Harry`;( Starts with Harry with the aIter the name)
Whose name starts with Harry and ends with Harry will be Harry
Group Clause
The groupby clause can be used in a select statement to collect the data across multiple records
and group the results by one or more column.
Select column1, column2, aggregateIunction(expression)
From tables
Where (condition)
Groupby column1, column2;
Select deptname, deptaddress, sum(totoalsales)
From department
Where
Groupby deptname, deptaddress;
Select deptno, deptname, min(Salary)
From department
Where
Groupby deptno, deptname;
Select max(Salary)
From department
Groupby department;

aving Clause
The having clause is used in combination with the groupby clause. It can be used in a select
statement to Iilter the records that a groupby returns. Always having should be Iollowed by a
groupby` clause.
Select dept, sum(Sales)
From Suppliers
Groupby dept
Having sum(sales)~30000;
Orderb clause
The orderby clause allows you to sort the records in a result set.
Select column
From table
Where condition
Orderby column;

Select supplier city
From Supplier
Where suppliername`IBM`
Orderby suppliercity DESC;


OINS
A join is used to combine rows Irom multiple tables. A join is perIormed whenever two or more
tables is listed in the FROM clause oI a SQL statement.
Inner oins
Select supplier.supplierid, supplier.suppliername, order.orderdate
From supplier, order
Where supplier.supplieridorder.supplierid;
Outer joins
This type oI join returns all rows Irom one table and only those rows Irom a secondary table
where the joined Iields are equal.
Left outer join
Select supplier.supplierid, supplier.suppliername, order.orderdate
From supplier, order
Where supplier.supplieridorder.supplierid();
This SQL statement would return all the rows Irom the suppliers table and only those rows Irom
the order table where the joined Iields are equal.
Right outer join
Select supplier.supplierid, supplier.suppliername, order.orderdate
From supplier, order
Where supplier.supplierid()order.supplierid:
The SQL statement would return all the rows Irom the order table and only those rows Irom the
supplier table where the joined Iields are equal.





Insert table
The insert statement allows to insert a single record or multiple records into a table.
Insert into tablename
(column1, column2, column3)
Values
(value1, value2, value3)
Insert into suppliers
(supplierid, suppliername, supplierdpt)
Values
(112345, IBM`, 231245)
Update statement
The update statement allows to update a single record or multiple records in a table
Update tablename
Set columnexpression
Where (condition);
Update Supplier
Set suppliername`IBM`
Where suppliername`HP`;
Delete statement
The delete statement allows you to delete a single or multiple records Irom a table.
Insert into
Update set
Delet Irom

Delete Irom table
Where suppliername`IBM`;


Union quer
The union query allows you to combine the result sets oI two or more select queries. It removes
the duplicate rows between the various select statements.
Select Iield1, Iield2
From table A
Union
Select Iield1, Iield2
From table B;
Union all
The union all will repeat all the Iields which are repeated.
Select Iield1, Iield2
From table A
Union all
Select Iield1, Iield2
From table B:







Normalization
Whenever you are organizing a database, in order to redundancy.
In any relational data base, the process oI organizing the data in order to minimize redundancy.
Normalization involves dividing a database into two or more tables and deIining the
relationships between the tables.
De-normalization:
It is the process oI attempting to optimize the perIormance oI a database by adding redundant
data. De-normalization is a technique to move Irom higher to lower normal Iorms oI database in
order to speed up the database access.

1. Star Schema:
2. Vault Methodology:

What is a store procedure?
Store procedure is a group oI SQL statements that have been previously created and stored in the
server database. Store procedure accepts input parameters so that a single procedure can be used
over the network by several clients using diIIerent input data.
Store procedure reduces network traIIic and improves the perIormance. It can be used to help
ensure the integrity oI the database.
What is a trigger?
A trigger is a SQL procedure that initiates an action when an event (insert, update, delete)
occurs. A trigger cannot be called or executed. DBMS automatically Iiles the trigger as a result
oI a data modiIication to the associated table. Triggers can be viewed as similar to store
procedure
Store procedures are explicitly executed by invoking a call to that procedure while triggers are
implicitly executed.

What is a view?
A simple view can be thought as a subset oI a table. It can be used Ior retrieving the data as well
as updating or deleting rows. The results oI using a view are not permanently stored in the
database.

What is an index?
An index is a physical structure containing pointers to the data. Indexes are created to locate the
rows more quickly and eIIiciently. The users cannot see indexes, they are just used to speed up
the queries. EIIective indexes are one oI the best ways to improve the perIormance in a database
application.

Cursor
It is a database object used by the applications to manipulate the data on a row by row basis
instead oI typical SQL commands that operate on all the rows in the set at one time.
Steps
1. Declare the cursor
2. Open the cursor
3. Fetch the row Irom the cursor
4. Process that Ietched row
5. Close the cursor
6. Deallocate the cursor
What is a diIIerence between a Iunction and a stored procedure?
User deIined Iunctions (UDF) can be used in the SQL statements anywhere in the
WHERE/HAVING/ SELECT sections whereas stored procedures cannot.
What is a subquery? What is the advantage oI using sub querry?
Subquerries are oIten reIerred as sub select statements as they allow SELECT statements to be
executed within the body oI another SQL statement. Sub querries are generally used to return a
single value though they may be used to compare against multiple values.
What are primary keys and Ioreign keys?
Primary keys are unique identiIiers. For each row they must contain unique values and cannot be
null. Primary keys are the most Iundamental key and a table can have only one primary key. A
Ioreign key ensures the reIerential integrity and it maintains the relationship between tables. A
table can have more than one Ioreign key.
Inner join: A join that displays only the rows that have a match in both joined tables is known as
an inner join.
LeIt outer join: In leIt outer join, all the rows in the leIt table appears unmatched rows in the
right table do not appear.
Right outer join: In right outer join, all the rows in the right tale appears unmatched rows in the
leIt table do not appear.
SelI join: this is a particular case when one table joins to itselI. A selI join is unique and it
involves the relationship between only one table. Example: when a company has a hierarchal
reporting structure where one member oI the staII reports to another member.
The diIIerence between a clustered and a non clustered index.
Clustered index: It is a special type oI index that reorders the way records in the table are
physically stored. ThereIore a table has only one clustered index.
Physical Order is always Iixed. It always remains constant.
Logical order is given a logical idea as 1000, 1001. The physical id is same but the logical id can
be changed.
A non clustered index is a special type oI index in which the logical order oI the index doesn`t
match with the physical order stored in the database.
DiIIerence between primary key and a unique key.
Both primary and unique key enIorces the uniqueness oI the column on which they are deIined
but by deIault Primary key creates clustered index in the column whereas unique key creates a
non clustered index by deIault. Primary key doesn`t allow NULLS but unique will allow one that
NULL value.

DiIIerence between delete and truncate.
Delete command removes the rows Irom a table based on the conditions that we provide with a
where clause. Truncate will actually remove all the rows Irom a table and there will be no data
aIter we run the truncate command. Truncate is Iaster and delete is little slower because delete
removes the rows one at a time and records an entry in the transaction log. Truncate is a DDL
command. Delete is a DLL command.
Truncate resets the identity oI the table but delete do not reset the identity oI the table.

What is the diIIerence between the having and where?
Having is typically used in a groupby clause when groupby is not used having behaves like a
where clause.
Auto jobs are known as CRON Jobs
SQL server agent plays an important role in the day to day task oI a database administrator. It
will allow to schedule the jobs and to run the scripts.
Bulk Copy is a tool used to copy huge amount oI data Irom the tables.
BULK INSERT. Bulk insert command helps to import a data Iile into a database table in a user
speciIied Iormat.
Copying the table Irom one table to another table.
Two ways to do it.
1. INSERT INTO SELECT.
This method is used when a table is already created in the database earlier and the data is to be
inserted into this table Irom another table.
2. SELECT INTO
This method is used when the table is not created earlier and needs to be created when data Irom
one table is to be inserted into newly created table Irom another table.
The new table is created with same data types as selected columns.
DiIIerence between UNION and UNION ALL

Potrebbero piacerti anche