Sei sulla pagina 1di 23

SQL Server Integration Services an

Introduction - Part 1
Introduction
SQL Server Integration Services (SSIS) is a platform for building high performance data
integration and workflow solutions. It allows creating packages or SSIS packages which are
made up of tasks that can move data from source to destination and if necessary alter it on the
way.

SSIS is basically an ETL (Extraction, Transformation, and Load) tool whose main purpose is to
do extraction, transformation and loading of data but it can be used for several other purposes for
example, to automate maintenance of SQL Server databases, update multidimensional cube data
etc as well.

SSIS is a component of SQL Server 2005/2008 and is successor of DTS (Data Transformation
Services) which had been in SQL Server 7.0/2000. Though from end-user perspective DTS and
SSIS looks similar to each other to some extent, it is not the case in actual. SSIS has been
completely written from the scratch and hence it overcomes the several limitations of DTS.
Though the list of differences between DTS and SSIS is quite large but one thing to note here is
that the internal architecture of SSIS is completely different from DTS. It has segregated the
Data Flow Engine from the Control Flow Engine or SSIS Runtime Engine and hence improves
the performance several folds, more details about architecture and the internals of SSIS is not in
scope of this introductory article and I will cover it in my next article series “SQL Server
Integration Services – An inside View”.

Note:
In this article, here and there I am referring “SSIS 2008” it means SSIS version which comes
with SQL Server 2008 whereas “SSIS 2005” means SSIS version which comes with SQL Server
2005.

Creating an SSIS Package


There are three different ways you can use to create SSIS packages as follows.

• The Import and Export Wizard – Though this is the one of the simplest way to create
SSIS package but it has very limited capability, I mean you cannot define any kind of
transformation in this (though with SSIS 2008, you can only choose to include a Data
Conversion Transformation if there is a mismatch in data type between source and
destination) and used mainly in case of simple data transfer from source to destination.
For more details, refer to “Import and Export” section later in this article.
• The SSIS Designer – The SSIS Designer is hosted inside the Business Intelligence
Development Studio (BIDS) as part of an Integration Services project. It is a graphical
tool that you can use to create and maintain Integration Services packages. It has a
toolbox which contains the various items needed for Control Flow, Data Flow Task as
well as tasks needed for maintenance plans. The number of tasks in SSIS is much larger
than what was available in the DTS. For more details, refer to “SSIS Designer” section
later in this article.
• SSIS API Programming – SSIS provides API object model, which you can use in your
choice of programming language to create SSIS package programmatically. For more
details, refer to “SSIS API Programming” section later in this article.
SSIS Components
As we learnt in the introduction that SSIS allows creating packages or rather SSIS packages
which are made up of tasks that can move data from source to destination and if necessary alter it
on the way. Within SSIS package we can define the workflow of the package and SSIS runtime
engine ensures the tasks inside the package are executed in orderly fashion as defined, in other
words it ensures to maintain the workflow of tasks inside the package. So it’s now time to have a
look on different tasks/components/executables of a package.

Package
A package itself is a collection of tasks and which are executed in an orderly fashion by SSIS
runtime engine. It is an XML file, which can be saved on SQL server or on file system. A
package can be executed by SQL Server Agent Job, DTEXEC command (a command line utility
bundled with SSIS to execute a package, there is another similar utility DTEXECUI which has a
GUI), from BIDS environment or by calling one package by another package (achieves modular
approach). You can use DTUTIL utility to move package from file system to SQL Server or vice
versa or else you can use undocumented sp_dts_getpackage/sp_ssis_getpackage and
sp_dts_putpackage/sp_ssis_putpackage stored procedures which reside in msdb system database.

Control Flow
It handles the main workflow of the package and determines processing sequence within the
package. It consists of containers, different kind of work flow tasks, and precedence constraints.

Control Flow Tasks


A task is an individual unit of work. SSIS provides several inbuilt control flow tasks which
perform a variety of workflow actions. They provide functionality to the package in much the
same way that a method does in programming language. All the inbuilt tasks are operational task
except Data Flow Task. Though there are several dozen inbuilt tasks to use but if required you
can extend them and you can write your custom task using VB/C# etc.

Containers
Container groups variety of package components (including other containers), affecting their
scope, sequence of execution, and mutual interaction. They are used to create logical group of
tasks. There are basically four types of containers in SSIS as given below:
• Task Host Containers - Default container, every task falls into it.
• Sequence Containers - Defines a subset of the overall package control flow.
• For Loop Containers - Defines a repeating control flow in a package.
• ForEach Loop Containers – Loops for collection, enumerates through a collection for
example you will use it when you will have to process each record of a record-set.
Precedence Constraints
Precedence constraints link the items in your package into a logical flow and specify the
conditions upon which the items are executed. It provides ordinal relationship between various
items in the package. It helps manage which order the tasks will execute. It directs the tasks to
execute in a given order.
In other words, it defines links among containers and tasks and evaluates conditions that
determine the sequence in which they are processed. More specifically, they provide transition
from one task or container to another.

The condition can be either Constraint or Expression or both. The constraint can further be
Success (Green Line), Failure (Red Line) and Complete (Blue Line). For example: In the
package image below Script Task 1 will be executed only if the Execute SQL Task completed
successfully. Script Task 2 will be executed irrespective of whether the Execute SQL Task
completed successfully or failed. Script Task 3 will be executed only if the Execute SQL Task
failed.

Apart from the above discussed constraints you can also define some expression as a condition
with precedence constraints which is evaluated at runtime and depending on its value the
transition is decided. For example in the image below, After Task A, Task B will be executed if
the value of X >= Z or Task C will be executed if the value of X < Z. You can also combine the
constraint and expression in a single condition with either AND or OR operator.

Variables
The concept of variable in SSIS is same as the variables in any other programming languages. It
provides temporary storage for parameters whose values can change from one package execution
to another, accommodating package reusability. It means it is used to dynamically configure a
package at runtime. For example if you want to execute the same T-SQL statement or a script
against a different set of connections.
Depending on the place where a variable has been defined, its scope varies. They can be declared
at package level, container level, tasks level, even handlers level etc.

In SSIS, there are two types of variables - System (pre-defined) variables whose values are set by
SSIS (ErrorCode, ErrorDescription, MachineName, PackageName, StartTime etc.) and cannot be
changed and User variable which are created as on need basis at the time of package
development and can be assigned value of the corresponding type.

Note:
An exception applies here, there is a system variable called “Propagate” whose value you can
change from its default value TRUE to FALSE to stop event bubbling from a task to its parent
and grand-parent. Refer to my next article in this series viz. “SQL Server Integration Services -
Features and Properties” which discusses about it in more details.

Connection Managers
A connection manager is a logical representation of a connection. SSIS provides different types
of connection managers which uses different data providers and that enable packages to connect
to a variety of data sources and servers.

A package can have multiple instances of connection managers and one connection manager can
be used by multiple tasks in the package.

A few examples of connection managers are:

• ADO Connection Manager – Connects to ActiveX Data Objects (ADO) objects.


• ADO.NET Connection Manager – Connects to a data source by using a .NET provider.
• OLEDB Connection Manager – Connects to a data source by using an OLE DB provider.
• Flat File Connection Manager – Connect to data in a single flat file.
• FTP Connection Manager – Connect to an FTP server.
By Default, every task which uses a connection manager, during execution it opens a connection,
perform the operation and close the connection before moving to the next task, it means each
task has got its own connection. So let’s consider of a scenario where there are three tasks in
package and they use the same connection manager, during runtime there would be three
connections opened and closed at source. But you don’t want this behavior you want all three
tasks to be executed in a single connection; I mean there should be only one connection open to
source irrespective of how many tasks use this connection. Here comes one property of
connection manager viz. RetainSameConnection for your rescue. The RetainSameConnection
property on the OLE DB Connection Manager enables you to run multiple tasks in a single
connection if the RetainSameConnection property equals TRUE.
In the next article in this series we will look at Data Flow in SQL Server Integration Services.
SQL Server Integration Services an
Introduction - Part 2
Data Flow
The Data Flow Task (DFT), using the SSIS Pipeline engine, manage the flow of data from the
data source adapters to the data destination adapters and let the user do the necessary
transformations, clean, and modify data in the way.

Note:
The Data Flow Tasks (DFT) is not a separate component outside Control Flow rather it is placed
inside a Control Flow only but I have given a separate heading/section for this to give more
emphasis on it as it is the most important task in SSIS.

A DFT can include multiple data flows. If a task copies several sets of data, and if the order in
which the data is copied is not significant, it can be more convenient to include multiple data
flows in a single DFT. In the first image below, the DFT has one data flow with two
transformations in the way before writing to destination whereas in second image the DFT has
two data flows first one has two transformations and second one has three transformations.

Transformations
The transformation changes the data to a desired format. It performs modifications to data,
through a variety of operations, such as aggregation (e.g. averages or sums), merging (of
multiple input data sets), distribution (to different outputs), data type conversion, or reference
table lookups (using exact or fuzzy comparisons) etc. Below are some inbuilt transformations:

• Derived Column Transformation – It creates new column values by applying expressions


to transformation input columns. The result can be added as a new column or inserted
into an existing column as a replacement value.
• Lookup Transformation – It performs lookup by joining data in input columns with
columns in a reference dataset. Usually used in a scenario where you have a subset of
master data set and you want to pull related transaction records.
• Union All Transformation – It combines multiple inputs and gives UNION ALL of these
multiple result-sets.
• Merge Transformation – It combines two sorted datasets into a single sorted dataset and
is similar to the Union All transformations. Use the Union All transformation instead of
the Merge transformation in case if the inputs are not sorted or the combined output does
not need to be sorted or the transformation has more than two inputs.
• Merge Join Transformation – It provides an output that is generated by joining two sorted
datasets using either a FULL, LEFT, or INNER joins.
• Conditional Split Transformation – It can route data rows to different outputs depending
on the content of the data. The implementation of the Conditional Split transformation is
similar to a CASE decision structure in a programming language. The transformation
evaluates expressions, and based on the results, directs the data row to the specified
output. This transformation also provides a default output, so that if a row matches no
expression it is directed to the default output.
• Multicast Transformation – It distributes its input to one or more outputs. This
transformation is similar to the Conditional Split transformation. Both transformations
direct an input to multiple outputs. The difference between the two is that the Multicast
transformation directs every row to every output, and the Conditional Split directs a row
to a single output.
There are several inbuilt transformation tasks available inside SSIS Designer to use though if
required you can extend these transformations and can write your own custom transformations.
Data Paths
Data Path connects data flow components inside a DFT. Though it looks like Precedence
Constraint of Control Flow but it is not the same, it shows the flow of data from one component
of DFT to another component of DFT whereas Precedence Constraint shows the control flow or
ordinal relationship between control flow tasks.

Data Path contains the meta-data of the data flowing through the path, for example what are the
columns, its name, type, size etc.

While debugging you can attach a data viewer to a Data Path to see the data flowing through that
data path.

Note:
The data viewer shows the data of one buffer at a time, you can click on next button to see the
data from next buffer. (I will discuss about the SSIS buffer management in my next article viz.
“SQL Server Integration Services – An inside View”)

Data Source Adapters


Data Source Adapters or simply the Source Adapters facilitate the retrieval of data from various
data sources. It uses connection managers which in turn use different data providers to connect to
heterogeneous sources for example flat file, OLE DB, .NET Framework data providers etc.
Data Destination Adapters
Data Destination Adapters or simply Destination Adapters loads output of the data flow into
target stores, such as flat files, database, or in-memory ADODB record-sets etc. Similar to
Source Data Adapters, It uses connection managers which in turn use different data providers to
connect to heterogeneous destination for example flat file, OLE DB, .NET Framework data
providers etc. Now let’s talk of some of the properties/settings in details, I am taking OLEDB as
an example as it is one of the most used.

• Data Access Mode – It allows to define the method to upload data into the destination.
The fast load option will use BULK INSERT statement instead of INSERT statement as
in case of without specifying it.
• Keep Identity – If selected the identity values of source will be preserved and will be
uploaded the same into the destination table or else destination table will create its own
identity values if there is any column of identity type.
• Keep Nulls – If selected the null values of the source will be preserved and will be
uploaded into the destination table or else if there is any column has default constraint
defined at destination table and NULL value is coming from the source for that column
then in that case, default value will be inserted into the destination table.
• Table Lock – If selected the TABLOCK will acquired on the table during data upload. It
is recommended option if table is not being used by any other application at the time of
data upload as it removes the overhead of lock escalation.
• Check Constraints – If selected the pipeline engine will check the table constraint for
incoming data and fails if it violates it. The recommendation is to uncheck this setting if
constraint checking is not required as it will improve the performance.
• Rows per batch – The blank text box indicates its default value -1, it means all incoming
rows will be considered as one batch. You can specify a nonzero, positive integer to direct
the pipeline engine to break the incoming rows in multiple chunks of N (what you
specify) rows. In other words, it specifies the number of rows in a batch.
• Maximum insert commit size – You can specify the batch size that the OLE DB
destination tries to commit during fast load operations; it actually splits up the chunks of
data as they are inserted into your destination. If you provide a value for this property, the
destination commits rows in batches that are the smaller of (a) the Maximum insert
commit size, or (b) the remaining rows in the buffer that is currently being processed.

Note:
It’s good practice to set the value for the above two settings, because having a large batch or
leaving the default value will negatively affect memory performance specially the tempdb, so
recommendation is to test your scenario and specify optimum values for these settings depending
on your environment, load and pull.

In the next article in this series we will look at the Import and Export Wizard in SQL Server
Integration Services.
SQL Server Integration Services an
Introduction - Part 3
This article is part 3 of a 4 part series that introduces SQL Server Integration Services (SSIS).
This article shows how to use the Import and Export Wizard.

The Import and Export Wizard provides the simplest method of copying data between data
sources and of constructing basic packages. But it has a major limitation; it does not allow doing
the transformation in the way (though with SSIS 2008, you can only choose to include a Data
Conversion Transformation if there is a mismatch in data type between source and destination).
However you can pull data from the source to the staging server, do the transformation over there
and again transfer data from staging to production server, but it’s too much of work and takes
lots of resources.
Let’s run through some example:

Launch the Import and Export Wizard

• On the Start menu, point to All Programs, point to Microsoft SQL Server 2005/2008
(Depending on the SQL Server version you have installed), and then click Import and
Export Data, or
• In SQL Server Management Studio, connect to the Database Engine server type, expand
Databases, right-click a database, point to Tasks, and then click Import Data or Export
data(if you click on Import data, by default destination server details will come up on
which you are performing this operation likewise if you click on Export data by default
source server details will come up automatically on which you are performing this
operation) or
• In a command prompt window, run DTSWizard.exe.
After the welcome screen, the next screen comes up where you will have to specify the source
server name, credential to use and database name as given below:
After clicking next, you will be prompted to enter destination server name, credential to use,
database name(If you want to create new database, you can do so by clicking on New Command
button) as given below:
After clicking next, you will be prompted to choose whether you want data from tables or views
or you want write query to pull the data. Depending on the choice the next screen will vary.

After clicking next (I chose the first option to pull data from tables or views), you will be given a
list of all the available tables and views at source. You need to choose tables or views you want
to pull data from. Edit Mapping button allows you the change the mapping of columns between
source and destination whereas the Preview button allows you to preview top 100 records from
the selected table or view.
After clicking next, you will be prompted to choose two options, first one says do you want to
run the package immediately and the second one says do you want to save the package for later
use, if yes where do you want to save the package either on file system or SQL Server and on
next screen it will ask you for location or server details to save the package depending on your
choice.
After clicking next, and finally Finish button, the import and Export wizard will start transferring
the data and status will be shown as below, on completion you can click on Close button:
SSIS Designer
SSIS Designer is a very rich graphical tool that you can use to create and maintain Integration
Services packages. Using this designer you can construct control flow, data flow in a package
and add even handlers to the package and its objects. It also shows the execution progress during
run-time. It has four permanent TABs and in addition to that, one additional TAB pops up during
execution to show the package progress, as given below:

Control Flow Tab – You construct the control flow in a package on the design surface of the
Control Flow tab. Drag items from Toolbox to the design surface and connect them into a control
flow by clicking the icon for the item, and then dragging the arrow from one item to another.

Data Flow Tab – This is basically used if a package contains a Data flow task; you can add data
flows to the package. You construct the data flows in a package on the design surface of the Data
Flow tab. When you double click on Data Flow Task in Control Flow tab, details of that Data
Flow Task is opened in Data Flow designer surface, here you can define data flows.

Event Handlers Tab – Package and its components have got different events in their execution
life-cycle. You can create event handlers for these events in Even Handlers designer surface. (I
will discuss more about event handlers in my next article viz. “SQL Server Integration Services -
Features and Properties”.)

Package Explorer Tab – A Package Explorer tab displays the contents of the package, In other
words, packages can be complex, including many tasks, connection managers, variables, and
other elements. The explorer view of the package lets you see a complete list of package
elements.

Progress Tab – This tab appears when you execute a package in designer and shows the
execution progress. This tab is changed to Execution Result once you stop executing the
package, which still contains the results of last execution until you close the package or re-run it.

At the bottom there is a control tray of Connection Managers, which will display all the used and
available connection managers for the package.

While executing a package in designer, every task will change its color as given below:

• No color/White Color – It indicates the execution of the task has not started yet.
• Yellow Color – It indicates the execution of task is in progress.
• Green Color – It indicate the execution of the task has completed successfully.
• Red Color – It indicates the execution of the task has completed but it has failed.

Now let’s run through some example of creating SSIS package:

Note:
In the examples below I have taken one task in package for simplicity, in real life a package
might contain several tasks inside it.

Scenario 1
In this I will create a very simple package; it will use Execute Process Task to execute Notepad.

Open a new package; drag Execute Process Task from Toolbox to Control Flow.
Right click on the task, click on Edit and set the relevant properties as shown below:

Now your package is all set to be run, hit F5 key on keyboard and package will start executing,
click on Progress tab to see the execution progress.
Scenario 2
In this example I will create a very simple package; it will use Data Flow Task to pull data from
source to destination.

Open a new package; drag Data Flow Task from Toolbox to Control Flow.

Double click on the data flow task, now the details of it will be open in Data Flow Designer tab.
Drag a source and a destination from the toolbox(Toolbox in Data Flow tab changes its content,
it will now only show source, transformation and destination tasks) to the designer and specify
source and destination details.

While configuring the source, you will have to select one of the available connection managers,
data access method and what are the columns you want to pass through the data path.

While configuring the destination, you will have to select one of the available connection
managers, data access method and the mapping between source and destination columns, though
SSIS is smart enough to do this mapping on its on the basis of similarity of column names and
types between source and destination but if required you can change it.

Scenario 3
In this example I will take the same package created in scenario 2 and add a derived column
(VendorDetails = AccountNumber + “:” + Name) and sort transformation (to sort the incoming
record-set on Name column before uploading) in the way. (You can do one transformation or
multiple transformations in the way from source to destination as per your need).
In the next article in this series we will look at how to use the SSIS API Model.
SQL Server Integration Services an
Introduction - Part 4
Now let’s run through an example of using SSIS API. In this example you will write
code using SSIS API to create package, add a task to it, save it on the file system
(Same what was created in scenario 1 example of SSIS Designer section above).
And later on you will load that package from the file system and execute it,
everything programmatically this time.

SSIS provides API object model, which you can use in your choice of programming
language to create SSIS package programmatically. So what is the need of creating
a package programmatically if you can do that using SSIS Designer? Let’s consider a
scenario:

Though you can create a package with multiple Data Flow Tasks and multiple data
flows in a single data flow task. You cannot change the mapping between source
and destination during runtime. I mean you cannot change the source, destination
and column mapping metadata while executing a package. So now the problem is
what if you want to build a generic loading package that can load data from any
data source to any destination as long as the metadata is known? What if you want
to create self-modifying package?

In the scenario as mentioned above, you can use SSIS API object model to write
code in C#/VB .Net etc to create package programmatically on the fly and can
execute it.

Here are some namespaces/assemblies which you will use frequently while creating
package programmatically.
Now let’s run through an example of using SSIS API. In this example you will write code using
SSIS API to create package, add a task to it, save it on the file system (Same what was created in
scenario 1 example of SSIS Designer section above). And later on you will load that package
from the file system and execute it, everything programmatically this time.

using System;
using Microsoft.SqlServer.Dts.Runtime;
using Microsoft.SqlServer.Dts.Tasks.ExecuteProcess;
namespace SSIS_API_Programming
{
class Program
{
static void Main(string[] args)
{
CreateAndSavePackage();
LoadAndExecutePackage();
}
private static void CreateAndSavePackage()
{
Package pkg = new Package();
pkg.Name = "MySSISAPIExamplePackage";
//Adding ExecuteProcessTask to Package
//STOCK is moniker which is used most often in the
//Microsoft.SqlServer.Dts.Runtime.Executables.Add(System.String) method
//though you can specify a task by name or by ID
Executable exec = pkg.Executables.Add("STOCK:ExecuteProcessTask");
//TaskHost class is a wrapper for every task
TaskHost thExecuteProcessTask = exec as TaskHost;
thExecuteProcessTask.Name = "Execute Process Task";
//Set relevant properties of the task
ExecuteProcess execPro = (ExecuteProcess)thExecuteProcessTask.InnerObject;
execPro.Executable = @"C:\Windows\System32\notepad.exe";
execPro.WorkingDirectory = @"C:\Windows\System32";
Application app = new Application();
//Save the package on file system, you can chooese to save on SQL Server as well
app.SaveToXml(@"D:\ExecuteProcess.dtsx", pkg, null);
}
private static void LoadAndExecutePackage()
{
Package pkg;
Application app;
DTSExecResult pkgResults;
app = new Application();
//Load the package from file system, you can chooese to load from SQL Server as well
pkg = app.LoadPackage(@"D:\ExecuteProcess.dtsx", null);
//Execute the package
pkgResults = pkg.Execute();
Console.WriteLine(pkgResults.ToString());
Console.ReadKey();
}
}
}

Conclusion
In this first article I discussed about SSIS, a platform for building high performance
data integration and workflow solutions. To achieve this, it uses separate data flow
engine from Runtime engine and does multithreading (allowing multiple
executables/data flow to run in parallel). Then I talked of different ways to create
SSIS packages and different kind of SSIS components. In the next article, I am going
to write more about the features and properties of SSIS, Even Logging, Event
Handlers, Transaction Support, Checkpoint Restart-ability and SSIS validation
process. So stay tuned to see the power and capabilities of SSIS.

Potrebbero piacerti anche