Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
L E A R N I N G
P R O D U C T
10777A
Implementing a Data Warehouse with
Microsoft SQL Server 2012
O F F I C I A L
ii
Information in this document, including URL and other Internet Web site references, is subject to change
without notice. Unless otherwise noted, the example companies, organizations, products, domain names,
e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with
any real company, organization, product, domain name, e-mail address, logo, person, place or event is
intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the
user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in
or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical,
photocopying, recording, or otherwise), or for any purpose, without the express written permission of
Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this document. Except as expressly provided in any written license
agreement from Microsoft, the furnishing of this document does not give you any license to these
patents, trademarks, copyrights, or other intellectual property.
The names of manufacturers, products, or URLs are provided for informational purposes only and
Microsoft makes no representations and warranties, either expressed, implied, or statutory, regarding
these manufacturers or the use of the products with any Microsoft technologies. The inclusion of a
manufacturer or product does not imply endorsement of Microsoft of the manufacturer or product. Links
may be provided to third party sites. Such sites are not under the control of Microsoft and Microsoft is not
responsible for the contents of any linked site or any link contained in a linked site, or any changes or
updates to such sites. Microsoft is not responsible for webcasting or any other form of transmission
received from any linked site. Microsoft is providing these links to you only as a convenience, and the
inclusion of any link does not imply endorsement of Microsoft of the site or the products contained
therein.
2012 Microsoft Corporation. All rights reserved.
These license terms are an agreement between Microsoft Corporation and you. Please read them. They apply to
the Licensed Content named above, which includes the media on which you received it, if any. These license
terms also apply to any updates, supplements, internet based services and support services for the Licensed
Content, unless other terms accompany those items. If so, those terms apply.
BY DOWNLOADING OR USING THE LICENSED CONTENT, YOU ACCEPT THESE TERMS. IF YOU DO NOT ACCEPT
THEM, DO NOT DOWNLOAD OR USE THE LICENSED CONTENT.
If you comply with these license terms, you have the rights below.
1.
DEFINITIONS.
a. Authorized Learning Center means a Microsoft Learning Competency Member, Microsoft IT Academy
Program Member, or such other entity as Microsoft may designate from time to time.
b. Authorized Training Session means the Microsoft-authorized instructor-led training class using only
MOC Courses that are conducted by a MCT at or through an Authorized Learning Center.
c. Classroom Device means one (1) dedicated, secure computer that you own or control that meets or
exceeds the hardware level specified for the particular MOC Course located at your training facilities or
primary business location.
d. End User means an individual who is (i) duly enrolled for an Authorized Training Session or Private
Training Session, (ii) an employee of a MPN Member, or (iii) a Microsoft full-time employee.
e. Licensed Content means the MOC Course and any other content accompanying this agreement.
Licensed Content may include (i) Trainer Content, (ii) sample code, and (iii) associated media.
f.
Microsoft Certified Trainer or MCT means an individual who is (i) engaged to teach a training session
to End Users on behalf of an Authorized Learning Center or MPN Member, (ii) currently certified as a
Microsoft Certified Trainer under the Microsoft Certification Program, and (iii) holds a Microsoft
Certification in the technology that is the subject of the training session.
g. Microsoft IT Academy Member means a current, active member of the Microsoft IT Academy
Program.
h. Microsoft Learning Competency Member means a Microsoft Partner Network Program Member in
good standing that currently holds the Learning Competency status.
i.
Microsoft Official Course or MOC Course means the Official Microsoft Learning Product instructorled courseware that educates IT professionals or developers on Microsoft technologies.
j.
Microsoft Partner Network Member or MPN Member means a silver or gold-level Microsoft Partner
Network program member in good standing.
k. Personal Device means one (1) device, workstation or other digital electronic device that you
personally own or control that meets or exceeds the hardware level specified for the particular MOC
Course.
l. Private Training Session means the instructor-led training classes provided by MPN Members for
corporate customers to teach a predefined learning objective. These classes are not advertised or
promoted to the general public and class attendance is restricted to individuals employed by or
contracted by the corporate customer.
m. Trainer Content means the trainer version of the MOC Course and additional content designated
solely for trainers to use to teach a training session using a MOC Course. Trainer Content may include
Microsoft PowerPoint presentations, instructor notes, lab setup guide, demonstration guides, beta
feedback form and trainer preparation guide for the MOC Course. To clarify, Trainer Content does not
include virtual hard disks or virtual machines.
2.
INSTALLATION AND USE RIGHTS. The Licensed Content is licensed not sold. The Licensed Content is
licensed on a one copy per user basis, such that you must acquire a license for each individual that
accesses or uses the Licensed Content.
2.1
Below are four separate sets of installation and use rights. Only one set of rights apply to you.
5. you will remove and irretrievably delete all Licensed Content from all Classroom Devices and
servers at the end of the Authorized Training Session,
6. you will only provide access to the Licensed Content to End Users and MCTs,
7. you will only provide access to the Trainer Content to MCTs, and
8. any Licensed Content installed for use during a training session will be done in accordance
with the applicable classroom set-up guide.
Use of Instructional Components in Trainer Content. You may customize, in accordance with the
most recent version of the MCT Agreement, those portions of the Trainer Content that are logically
associated with instruction of a training session. If you elect to exercise the foregoing rights, you
agree: (a) that any of these customizations will only be used for providing a training session, (b) any
customizations will comply with the terms and conditions for Modified Training Sessions and
Supplemental Materials in the most recent version of the MCT agreement and with this agreement.
For clarity, any use of customize refers only to changing the order of slides and content, and/or
not using all the slides or content, it does not mean changing or modifying any slide or content.
2.2 Separation of Components. The Licensed Content components are licensed as a single unit and you
may not separate the components and install them on different devices.
2.4 Third Party Programs. The Licensed Content may contain third party programs or services. These
license terms will apply to your use of those third party programs or services, unless other terms accompany
those programs and services.
2.5 Additional Terms. Some Licensed Content may contain components with additional terms,
conditions, and licenses regarding its use. Any non-conflicting terms in those conditions and licenses also
apply to that respective component and supplements the terms described in this Agreement.
3.
PRE-RELEASE VERSIONS. If the Licensed Content is a pre-release (beta) version, in addition to the other
provisions in this agreement, then these terms also apply:
a. Pre-Release Licensed Content. This Licensed Content is a pre-release version. It may not contain the
same information and/or work the way a final version of the Licensed Content will. We may change it
for the final version. We also may not release a final version. Microsoft is under no obligation to
provide you with any further content, including the final release version of the Licensed Content.
b. Feedback. If you agree to give feedback about the Licensed Content to Microsoft, either directly or
through its third party designee, you give to Microsoft without charge, the right to use, share and
commercialize your feedback in any way and for any purpose. You also give to third parties, without
charge, any patent rights needed for their products, technologies and services to use or interface with
any specific parts of a Microsoft software, Microsoft product, or service that includes the feedback. You
will not give feedback that is subject to a license that requires Microsoft to license its software,
technologies, or products to third parties because we include your feedback in them. These rights
c. Term. If you are an Authorized Training Center, MCT or MPN, you agree to cease using all copies of the
beta version of the Licensed Content upon (i) the date which Microsoft informs you is the end date for
using the beta version, or (ii) sixty (60) days after the commercial release of the Licensed Content,
whichever is earliest (beta term). Upon expiration or termination of the beta term, you will
irretrievably delete and destroy all copies of same in the possession or under your control.
4.
INTERNET-BASED SERVICES. Classroom Devices located at Authorized Learning Centers physical location
may contain virtual machines and virtual hard disks for use while attending an Authorized Training
Session. You may only use the software on the virtual machines and virtual hard disks on a Classroom
Device solely to perform the virtual lab activities included in the MOC Course while attending the
Authorized Training Session. Microsoft may provide Internet-based services with the software included
with the virtual machines and virtual hard disks. It may change or cancel them at any time. If the
software is pre-release versions of software, some of its Internet-based services may be turned on by
default. The default setting in these versions of the software do not necessarily reflect how the features
will be configured in the commercially released versions. If Internet-based services are included with the
software, they are typically simulated for demonstration purposes in the software and no transmission
over the Internet takes place. However, should the software be configured to transmit over the Internet,
the following terms apply:
a. Consent for Internet-Based Services. The software features described below connect to Microsoft or
service provider computer systems over the Internet. In some cases, you will not receive a separate
notice when they connect. You may switch off these features or not use them. By using these features,
you consent to the transmission of this information. Microsoft does not use the information to identify
or contact you.
b. Computer Information. The following features use Internet protocols, which send to the appropriate
systems computer information, such as your Internet protocol address, the type of operating system,
browser and name and version of the software you are using, and the language code of the device
where you installed the software. Microsoft uses this information to make the Internet-based services
available to you.
Accelerators. When you use click on or move your mouse over an Accelerator, the title and full web
address or URL of the current webpage, as well as standard computer information, and any content
you have selected, might be sent to the service provider. If you use an Accelerator provided by
Microsoft, the information sent is subject to the Microsoft Online Privacy Statement, which is
available at go.microsoft.com/fwlink/?linkid=31493. If you use an Accelerator provided by a third
party, use of the information sent will be subject to the third partys privacy practices.
Automatic Updates. This software contains an Automatic Update feature that is on by default. For
more information about this feature, including instructions for turning it off, see
go.microsoft.com/fwlink/?LinkId=178857. You may turn off this feature while the software is
running (opt out). Unless you expressly opt out of this feature, this feature will (a) connect to
Microsoft or service provider computer systems over the Internet, (b) use Internet protocols to send
to the appropriate systems standard computer information, such as your computers Internet
protocol address, the type of operating system, browser and name and version of the software you
are using, and the language code of the device where you installed the software, and (c)
automatically download and install, or prompt you to download and/or install, current Updates to
the software. In some cases, you will not receive a separate notice before this feature takes effect.
By installing the software, you consent to the transmission of standard computer information and
the automatic downloading and installation of updates.
Auto Root Update. The Auto Root Update feature updates the list of trusted certificate authorities.
you can switch off the Auto Root Update feature.
Customer Experience Improvement Program (CEIP), Error and Usage Reporting; Error Reports. This
software uses CEIP and Error and Usage Reporting components enabled by default that
automatically send to Microsoft information about your hardware and how you use this software.
This software also automatically sends error reports to Microsoft that describe which software
components had errors and may also include memory dumps. You may choose not to use these
software components. For more information please go to
<http://go.microsoft.com/fwlink/?LinkID=196910>.
Digital Certificates. The software uses digital certificates. These digital certificates confirm the
identity of Internet users sending X.509 standard encrypted information. They also can be used to
digitally sign files and macros, to verify the integrity and origin of the file contents. The software
retrieves certificates and updates certificate revocation lists. These security features operate only
when you use the Internet.
Extension Manager. The Extension Manager can retrieve other software through the internet from
the Visual Studio Gallery website. To provide this other software, the Extension Manager sends to
Microsoft the name and version of the software you are using and language code of the device
where you installed the software. This other software is provided by third parties to Visual Studio
Gallery. It is licensed to users under terms provided by the third parties, not from Microsoft. Read
the Visual Studio Gallery terms of use for more information.
IPv6 Network Address Translation (NAT) Traversal service (Teredo). This feature helps existing
home Internet gateway devices transition to IPv6. IPv6 is a next generation Internet protocol. It
helps enable end-to-end connectivity often needed by peer-to-peer applications. To do so, each
time you start up the software the Teredo client service will attempt to locate a public Teredo
Internet service. It does so by sending a query over the Internet. This query only transfers standard
Domain Name Service information to determine if your computer is connected to the Internet and
can locate a public Teredo service. If you
by default standard Internet Protocol information will be sent to the Teredo service at Microsoft at
regular intervals. No other information is sent to Microsoft. You can change this default to use nonMicrosoft servers. You can also switch off this feature using a command line utility named netsh.
Malicious Software Removal. During setup, if you select Get important updates for installation,
the software may check and remove certain malware from your device. Malware is malicious
software. If the software runs, it will remove the Malware listed and updated at
www.support.microsoft.com/?kbid=890830. During a Malware check, a report will be sent to
Microsoft with specific information about Malware detected, errors, and other information about
your device. This information is used to improve the software and other Microsoft products and
services. No information included in these reports will be used to identify or contact you. You may
disable the softwares reporting functionality by following the instructions found at
Microsoft Digital Rights Management. If you use the software to access content that has been
protected with Microsoft Digital Rights Management (DRM), then, in order to let you play the
content, the software may automatically request media usage rights from a rights server on the
Internet and download and install available DRM updates. For more information, see
go.microsoft.com/fwlink/?LinkId=178857.
Microsoft Update Feature. To help keep the software up-to-date, from time to time, the software
connects to Microsoft or service provider computer systems over the Internet. In some cases, you
will not receive a separate notice when they connect. When the software does so, we check your
version of the software and recommend or download updates to your devices. You may not receive
notice when we download the update. You may switch off this feature.
Network Awareness. This feature determines whether a system is connected to a network by either
passive monitoring of network traffic or active DNS or HTTP queries. The query only transfers
standard TCP/IP or DNS information for routing purposes. You can switch off the active query
feature through a registry setting.
Plug and Play and Plug and Play Extensions. You may connect new hardware to your device, either
directly or over a network. Your device may not have the drivers needed to communicate with that
hardware. If so, the update feature of the software can obtain the correct driver from Microsoft and
install it on your device. An administrator can disable this update feature.
Real Simple Syndication (RSS) Feed. This software start page contains updated content that is
supplied by means of an RSS feed online from Microsoft.
Search Suggestions Service. When you type a search query in Internet Explorer by using the Instant
Search box or by typing a question mark (?) before your search term in the Address bar, you will see
search suggestions as you type (if supported by your search provider). Everything you type in the
Instant Search box or in the Address bar when preceded by a question mark (?) is sent to your
search provider as you type it. In addition, when you press Enter or click the Search button, all the
text that is in the search box or Address bar is sent to the search provider. If you use a Microsoft
search provider, the information you send is subject to the Microsoft Online Privacy Statement,
which is available at go.microsoft.com/fwlink/?linkid=31493. If you use a third-party search
provider, use of the information sent will be subject to the third partys privacy practices. You can
turn search suggestions off at any time in Internet Explorer by using Manage Add-ons under the
Tools button. For more information about the search suggestions service, see
go.microsoft.com/fwlink/?linkid=128106.
SQL Server Reporting Services Map Report Item. The software may include features that retrieve
content such as maps, images and other data through the Bing Maps (or successor branded)
application programming interface (the Bing Maps APIs). The purpose of these features is to
create reports displaying data on top of maps, aerial and hybrid imagery. If these features are
included, you may use them to create and view dynamic or static documents. This may be done only
in conjunction with and through methods and means of access integrated in the software. You may
not otherwise copy, store, archive, or create a database of the content available through the Bing
Maps APIs. you may not use the following for any purpose even if they are available through the
Bing Maps APIs:
Any Road Traffic Data or Birds Eye Imagery (or associated metadata).
Your use of the Bing Maps APIs and associated content is also subject to the additional terms and
conditions at http://www.microsoft.com/maps/product/terms.html.
URL Filtering. The URL Filtering feature identifies certain types of web sites based upon predefined
URL categories, and allows you to deny access to such web sites, such as known malicious sites and
sites displaying inappropriate or pornographic materials. To apply URL filtering, Microsoft queries
the online Microsoft Reputation Service for URL categorization. You can switch off URL filtering. For
more information on this feature, see http://go.microsoft.com/fwlink/?LinkId=130980
Web Content Features. Features in the software can retrieve related content from Microsoft and
provide it to you. To provide the content, these features send to Microsoft the type of operating
system, name and version of the software you are using, type of browser and language code of the
device where you run the software. Examples of these features are clip art, templates, online
training, online assistance and Appshelp. You may choose not to use these web content features.
Windows Media Digital Rights Management. Content owners use Windows Media digital rights
management technology (WMDRM) to protect their intellectual property, including copyrights. This
software and third party software use WMDRM to play and copy WMDRM-protected content. If the
software fails to protect the content, content owners may ask Microsoft to revoke the softwares
ability to use WMDRM to play or copy protected content. Revocation does not affect other content.
When you download licenses for protected content, you agree that Microsoft may include a
revocation list with the licenses. Content owners may require you to upgrade WMDRM to access
their content. Microsoft software that includes WMDRM will ask for your consent prior to the
upgrade. If you decline an upgrade, you will not be able to access content that requires the upgrade.
You may switch off WMDRM features that access the Internet. When these features are off, you can
still play content for which you have a valid license.
Windows Media Player. When you use Windows Media Player, it checks with Microsoft for
You can switch off this last feature. For more information, go to
www.microsoft.com/windows/windowsmedia/player/11/privacy.aspx.
Windows Rights Management Services. The software contains a feature that allows you to create
content that cannot be printed, copied or sent to others without your permission. For more
information, go to www.microsoft.com/rms. you may choose not to use this feature
Windows Time Service. This service synchronizes with time.windows.com once a week to provide
your computer with the correct time. You can turn this feature off or choose your preferred time
source within the Date and Time Control Panel applet. The connection uses standard NTP protocol.
Windows Update Feature. You may connect new hardware to the device where you run the
software. Your device may not have the drivers needed to communicate with that hardware. If so,
the update feature of the software can obtain the correct driver from Microsoft and run it on your
device. You can switch off this update feature.
c. Use of Information. Microsoft may use the device information, error reports, and malware reports to
improve our software and services. We may also share it with others, such as hardware and software
vendors. They may use the information to improve how their products run with Microsoft software.
d. Misuse of Internet-based Services. You may not use any Internet-based service in any way that could
harm it or impair anyone elses use of it. You may not use the service to try to gain unauthorized access
to any service, data, account or network by any means.
5.
SCOPE OF LICENSE. The Licensed Content is licensed, not sold. This agreement only gives you some rights
to use the Licensed Content. Microsoft reserves all other rights. Unless applicable law gives you more
rights despite this limitation, you may use the Licensed Content only as expressly permitted in this
agreement. In doing so, you must comply with any technical limitations in the Licensed Content that only
allows you to use it in certain ways. Except as expressly permitted in this agreement, you may not:
install more copies of the Licensed Content on devices than the number of licenses you acquired;
allow more individuals to access the Licensed Content than the number of licenses you acquired;
publicly display, or make the Licensed Content available for others to access or use;
install, sell, publish, transmit, encumber, pledge, lend, copy, adapt, link to, post, rent, lease or lend,
make available or distribute the Licensed Content to any third party, except as expressly permitted
by this Agreement.
reverse engineer, decompile, remove or otherwise thwart any protections or disassemble the
Licensed Content except and only to the extent that applicable law expressly permits, despite this
limitation;
access or use any Licensed Content for which you are not providing a training session to End Users
using the Licensed Content;
access or use any Licensed Content that you have not been authorized by Microsoft to access and
use; or
transfer the Licensed Content, in whole or in part, or assign this agreement to any third party.
6.
RESERVATION OF RIGHTS AND OWNERSHIP. Microsoft reserves all rights not expressly granted to you in
this agreement. The Licensed Content is protected by copyright and other intellectual property laws and
treaties. Microsoft or its suppliers own the title, copyright, and other intellectual property rights in the
Licensed Content. You may not remove or obscure any copyright, trademark or patent notices that
appear on the Licensed Content or any components thereof, as delivered to you.
7.
EXPORT RESTRICTIONS. The Licensed Content is subject to United States export laws and regulations. You
must comply with all domestic and international export laws and regulations that apply to the Licensed
Content. These laws include restrictions on destinations, End Users and end use. For additional
information, see www.microsoft.com/exporting.
8.
LIMITATIONS ON SALE, RENTAL, ETC. AND CERTAIN ASSIGNMENTS. You may not sell, rent, lease, lend or
sublicense the Licensed Content or any portion thereof, or transfer or assign this agreement.
9.
SUPPORT SERVICES. Because the Licensed Content is as is, we may not provide support services for it.
10.
TERMINATION. Without prejudice to any other rights, Microsoft may terminate this agreement if you fail
to comply with the terms and conditions of this agreement. Upon any termination of this agreement, you
agree to immediately stop all use of and to irretrievable delete and destroy all copies of the Licensed
Content in your possession or under your control.
11.
LINKS TO THIRD PARTY SITES. You may link to third party sites through the use of the Licensed Content.
The third party sites are not under the control of Microsoft, and Microsoft is not responsible for the
contents of any third party sites, any links contained in third party sites, or any changes or updates to third
party sites. Microsoft is not responsible for webcasting or any other form of transmission received from
any third party sites. Microsoft is providing these links to third party sites to you only as a convenience,
and the inclusion of any link does not imply an endorsement by Microsoft of the third party site.
12.
ENTIRE AGREEMENT. This agreement, and the terms for supplements, updates and support services are
the entire agreement for the Licensed Content.
13.
APPLICABLE LAW.
a. United States. If you acquired the Licensed Content in the United States, Washington state law governs
the interpretation of this agreement and applies to claims for breach of it, regardless of conflict of laws
principles. The laws of the state where you live govern all other claims, including claims under state
consumer protection laws, unfair competition laws, and in tort.
b. Outside the United States. If you acquired the Licensed Content in any other country, the laws of that
country apply.
14.
LEGAL EFFECT. This agreement describes certain legal rights. You may have other rights under the laws of
your country. You may also have rights with respect to the party from whom you acquired the Licensed
Content. This agreement does not change your rights under the laws of your country if the laws of your
country do not permit it to do so.
15.
DISCLAIMER OF WARRANTY. THE LICENSED CONTENT IS LICENSED "AS-IS," "WITH ALL FAULTS," AND "AS
AVAILABLE." YOU BEAR THE RISK OF USING IT. MICROSOFT CORPORATION AND ITS RESPECTIVE
AFFILIATES GIVE NO EXPRESS WARRANTIES, GUARANTEES, OR CONDITIONS UNDER OR IN RELATION TO
THE LICENSED CONTENT. YOU MAY HAVE ADDITIONAL CONSUMER RIGHTS UNDER YOUR LOCAL LAWS
WHICH THIS AGREEMENT CANNOT CHANGE. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAWS,
MICROSOFT CORPORATION AND ITS RESPECTIVE AFFILIATES EXCLUDE ANY IMPLIED WARRANTIES OR
CONDITIONS, INCLUDING THOSE OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NON-INFRINGEMENT.
16.
LIMITATION ON AND EXCLUSION OF REMEDIES AND DAMAGES. TO THE EXTENT NOT PROHIBITED BY
LAW, YOU CAN RECOVER FROM MICROSOFT CORPORATION AND ITS SUPPLIERS ONLY DIRECT
DAMAGES UP TO USD$5.00. YOU AGREE NOT TO SEEK TO RECOVER ANY OTHER DAMAGES, INCLUDING
CONSEQUENTIAL, LOST PROFITS, SPECIAL, INDIRECT OR INCIDENTAL DAMAGES FROM MICROSOFT
CORPORATION AND ITS RESPECTIVE SUPPLIERS.
Please note: As this Licensed Content is distributed in Quebec, Canada, some of the clauses in this agreement
are provided below in French.
Remarque : Ce le contenu sous licence tant distribu au Qubec, Canada, certaines des clauses dans ce
contrat sont fournies ci-dessous en franais.
EXONRATION DE GARANTIE. Le contenu sous licence vis par une licence est offert tel quel . Toute
utilisation de ce contenu sous licence est votre seule risque et pril. Microsoft naccorde aucune autre garantie
expresse. Vous pouvez bnficier de droits additionnels en vertu du droit local sur la protection dues
consommateurs, que ce contrat ne peut modifier. La ou elles sont permises par le droit locale, les garanties
implicites de qualit marchande, dadquation un usage particulier et dabsence de contrefaon sont exclues.
LIMITATION DES DOMMAGES-INTRTS ET EXCLUSION DE RESPONSABILIT POUR LES DOMMAGES. Vous
pouvez obtenir de Microsoft et de ses fournisseurs une indemnisation en cas de dommages directs uniquement
hauteur de 5,00 $ US. Vous ne pouvez prtendre aucune indemnisation pour les autres dommages, y
compris les dommages spciaux, indirects ou accessoires et pertes de bnfices.
Cette limitation concerne:
tout ce qui est reli au le contenu sous licence , aux services ou au contenu (y compris le code)
figurant sur des sites Internet tiers ou dans des programmes tiers ; et
les rclamations au titre de violation de contrat ou de garantie, ou au titre de responsabilit
stricte, de ngligence ou dune autre faute dans la limite autorise par la loi en vigueur.
Elle sapplique galement, mme si Microsoft connaissait ou devrait connatre lventualit dun tel dommage.
Si votre pays nautorise pas lexclusion ou la limitation de responsabilit pour les dommages indirects,
accessoires ou de quelque nature que ce soit, il se peut que la limitation ou lexclusion ci-dessus ne sappliquera
pas votre gard.
EFFET JURIDIQUE. Le prsent contrat dcrit certains droits juridiques. Vous pourriez avoir dautres droits prvus
par les lois de votre pays. Le prsent contrat ne modifie pas les droits que vous confrent les lois de votre pays
si celles-ci ne le permettent pas.
Revised March 2012
xiv
Acknowledgments
xv
Microsoft Learning would like to acknowledge and thank the following for their contribution towards
developing this title. Their effort at various stages in the development has ensured that you have a good
classroom experience.
Graeme Malcolm is a Microsoft SQL Server subject matter expert and professional content developer at
Content Mastera division of CM Group Ltd. As a Microsoft Certified Trainer, Graeme has delivered
training courses on SQL Server since version 4.2; as an author, Graeme has written numerous books,
articles, and training courses on SQL Server; and as a consultant, Graeme has designed and implemented
business solutions based on SQL Server for customers all over the world.
Geoff Allix is a Microsoft SQL Server subject matter expert and professional content developer at Content
Mastera division of CM Group Ltd. Geoff is a Microsoft Certified IT Professional for SQL Server with
extensive experience in designing and implementing database and BI solutions on SQL Server
technologies, and has provided consultancy services to organizations seeking to implement and optimize
data warehousing and OLAP solutions.
Martin Ellis is a Microsoft SQL Server subject matter expert and professional content developer at Content
Mastera division of CM Group Ltd. Martin is a Microsoft Certified Technical Specialist on SQL Server and
an MCSE. He has been working with SQL Server since version 7.0, as a DBA, consultant and Microsoft
Certified Trainer, and has developed a wide range of technical collateral for Microsoft Corp. and other
technology enterprises.
Chris Testa-ONeil is a Senior Consultant at Coeo (www.coeo.com), a leading provider of SQL Server
Managed Support and Consulting in the UK and Europe. He is also a Microsoft Certified Trainer, Microsoft
Most Valuable Professional for SQL Server, and lead author of Microsoft E-Learning MCTS courses for SQL
Server 2008. Chris has spoken at a range of SQL Server events in the UK, Europe, Australia and the United
States. He is also one of the organizers of SQLBits, SQLServerFAQ and a UK Regional Mentor for SQLPASS.
You can contact Chris at chris@coeo.com, @ctesta_oneill or through his blog at
http://www.coeo.com/sql-server-events/sql-events-and-blogs.aspx.
Contents
Module 1: Introduction to Data Warehousing
Lesson 1: Overview of Data Warehousing
Lesson 2: Considerations for a Data Warehouse Solution
Lab 1: Exploring a Data Warehousing Solution
1-3
1-14
1-28
2-3
2-11
3-3
3-17
3-27
4-3
4-10
4-21
4-38
5-3
5-14
5-21
5-33
5-41
5-51
6-3
6-12
6-21
6-30
7-3
7-9
7-31
7-54
7-73
xvi
8-3
8-9
8-19
8-26
9-3
9-13
9-20
9-29
9-38
10-3
10-10
10-23
10-36
10-46
11-3
11-10
11-21
12-3
12-9
12-19
12-30
13-3
13-8
13-12
13-18
xvii
L1-1
L3-7
L4-13
L5-25
L5-33
L6-37
L7-45
L7-65
L8-81
L9-91
L9-99
L10-105
L11-117
L12-123
L13-129
xviii
xix
This section provides you with a brief description of the course, audience, suggested prerequisites, and
course objectives.
Course Description
This course describes how to implement a BI platform to support information worker analytics. Students
will learn how to create a data warehouse with MicrosoftSQL Server 2012, implement ETL with SQL
Server Integration Services, and validate and cleanse data with SQL Server Data Quality Services and SQL
Server Master Data Services.
Audience
This course is intended for database professionals who need to fulfill a Business (BI) Intelligence Developer
role. They will need to focus on hands-on work creating BI solutions including Data Warehouse
implementation, ETL, and data cleansing. Primary responsibilities include:
Student Prerequisites
This course requires that you meet the following prerequisites:
An awareness of key business priorities such as revenue, profitability, and financial accounting is
desirable.
Course Objectives
After completing this course, students will be able to:
Implement an SSIS solution that supports incremental data warehouse loads and changing data.
Describe how information workers can consume data from the data warehouse.
Course Outline
This section provides an outline of the course:
Module 1, Introduction to Data Warehousing
Module 2, Data Warehouse Hardware
Module 3, Designing and Implementing a Data Warehouse
Module 4, Creating an ETL Solution with SSIS
Module 5, Implementing Control Flow in an SSIS Package
Module 6, Debugging and Troubleshooting SSIS Packages
Module 7, Implementing an Incremental ETL Process
Module 8, Incorporating Data from the Cloud into a Data Warehouse
Module 9, Enforcing Data Quality
Module 10, Using Master Data Services
Module 11, Extending SQL Server Integration Services
Module 12, Deploying and Configuring SSIS Packages
Module 13, Consuming Data in a Data Warehouse
xx
Course Materials
The following materials are included with your kit:
Course Handbook A succinct classroom learning guide that provides all the critical technical
information in a crisp, tightly-focused format, which is just right for an effective in-class learning
experience.
xxi
Lessons: Guide you through the learning objectives and provide the key points that are critical to
the success of the in-class learning experience.
Labs: Provide a real-world, hands-on platform for you to apply the knowledge and skills learned
in the module.
Module Reviews and Takeaways: Provide improved on-the-job reference material to boost
knowledge and skills retention.
Lab Answer Keys: Provide step-by-step lab solution guidance at your finger tips when its
needed.
Modules: Include companion content, such as questions and answers, detailed demo steps and
additional reading links, for each lesson. Additionally, they include Lab Review questions and answers
and Module Reviews and Takeaways sections, which contain the review questions and answers, best
practices, common issues and troubleshooting tips with answers, and real-world issues and scenarios
with answers.
Resources: Include well-categorized additional resources that give you immediate access to the most
up-to-date premium content on TechNet, MSDN, Microsoft Press.
Course evaluation At the end of the course, you will have the opportunity to complete an online
evaluation to provide feedback on the course, training facility, and instructor.
Role
10777-8A-MIA-SQLBI
Application Server
10777-8-MIA-DC1
Domain Controller
MT11-MSL-TMG1
Internet Gateway
Software Configuration
The following software is installed on each VM:
Course Files
There are files associated with the labs in this course. The lab files are located in the folder
D:\10777A\Labfiles\LabXX on the 10777-8A-MIA-SQLBI VM.
Classroom Setup
Each classroom computer will have the same virtual machine configured in the same way.
Course Hardware Level 6+
To ensure a satisfactory student experience, Microsoft Learning requires a minimum equipment
configuration for trainer and student computers in all Microsoft Certified Partner for Learning Solutions
(CPLS) classrooms in which Official Microsoft Learning Product courseware are taught.
xxii
Module 1
Introduction to Data Warehousing
Contents:
Lesson 1: Overview of Data Warehousing
1-3
1-14
1-28
Module Overrview
Data warehousing
g is a solution that organizattions can use tto centralize business data fo
or reporting an
nd
analysis. Impleme
enting a data warehouse
w
solu
ution can provvide a businesss or other orgaanization with
sign
nificant benefitts, including:
The foundatio
on for an ente
erprise business intelligence ((BI) solution.
1-2
Lesson
n1
Overv
view of Data Warehou
W
using
1-3
Describe th
he business pro
oblem that datta warehousess address.
Define a da
ata warehouse.
Describe th
he commonly used
u
data ware
ehouse architeectures.
Identify the
e components of a data ware
ehousing soluttion.
Describe th
he componentss and features of Microsoft
SQL Server
and other Microsoft produ
ucts that
you can use
e in a data warrehousing solu
ution.
1-4
Run
nning a business effectively can
c present a significant
s
cha llenge, particu
ularly as the bu
usiness grows or is
affe
ected by trendss in the busine
esss target ma
arket or the glo
obal economyy. To be successful, a businesss
musst adapt to cha
anging conditiions, which req
quires individu
uals within thee organization to make good
d
wing businesss problems can
strategic and tactical business decisions.
d
How
wever, the follow
n often make
effe
ective business decision making difficult:
Finding the in
nformation req
quired for busin
ness decision m
making is time--consuming an
nd error-pronee. The
need to gathe
er and reconciile data from multiple
m
sourcees results in slo
ow, inefficient decision making
processes tha
at can be further undermined
d through inco
onsistencies beetween duplicate, contradicttory
sources of the
e same inform
mation.
r
these
e problems, it is possible to make
m
effectivee decisions thaat will help the business to be
By resolving
morre successful
both at the sttrategic, executive level and during day-to
o-day businesss operations.
What
W
Is a Data
D
Wareh
house?
1-5
A data warehou
use provides a solution to the problem of d
distributed da ta that preven
nts effective bu
usiness
de
ecision making
g. There are many
m
definition
ns for the term data wareho
ouse, and disaagreements ovver
sp
pecific implem
mentation detaiils, but it is gen
nerally agreed
d that a data w
warehouse is a centralized sto
ore of
bu
usiness data th
hat can be use
ed for reporting
g and analysiss to inform bussiness decision
ns.
Tyypically, a data
a warehouse:
Is optimized
d for read ope
erations that su
upport queryin
ng the data. Th
his is in contraast to a typical online
transaction processing (O
OLTP) database
e that is design
ned to supporrt data insert, u
update, and de
elete
operations, too.
Is loaded with
w new or updated data at regular intervaals.
Da
ata Wareho
ouse Archiitectures
Creating a sin
ngle, central en
nterprise data warehouse fo r all business u
units.
Creating a hu
ub-and-spoke architecture th
hat synchronizzes a central en
nterprise data warehouse wiith
departmental data marts th
hat contain a subset
s
of the d
data warehous e data.
1-6
Componen
C
nts of a Data Wareho
ousing Solution
A data warehou
using solution usually consistts of the follow
wing elementss:
1-7
Data stagin
ng areas. Interm
mediary locations where thee data that is b
being transferred to the dataa
warehouse is stored to prrepare it for im
mport into the data warehou
use and synchrronize data
warehouse loads.
In
n addition, man
ny data warehousing solutio
ons also includ e:
a managemen
nt (MDM). A so
olution that pro
ovides an auth
horitative dataa definition forr
Master data
business en
ntities that multiple systems across the org
ganization use..
Da
ata Wareho
ousing Pro
ojects
1-8
A da
ata warehousing project hass a great deal in
i common wiith any other ITT implementation project, so
o it is
possible to apply most common
nly used metho
odologies, succh as Agile or M
Microsoft Solu
utions Framework
(MS
SF). However, a data warehou
using project often
o
requires a deeper understanding of the key busine
ess
obje
ectives and me
etrics that are used to drive decision
d
makin
ng than other software deve
elopment or
infra
astructure projjects.
A hiigh-level approach to implementing a datta warehousing
g project usuaally includes th
he following ste
eps:
1.
2.
Work with bu
usiness stakeho
olders and info
ormation workkers to determ ine the busine
ess questions to
which the datta warehouse must provide answers. Theyy may include q
questions such
h as:
What are
e our most pro
ofitable produccts or services??
Which sa
ales employeess are meeting their sales targ
gets?
3.
4.
1-9
Identify data sources that contain the data that is required to answer the business questions. These
are commonly relational databases that existing line-of-business applications use, but they can also
include:
Flat files or XML documents that have been extracted from proprietary systems.
Commercially available data that has been purchased from a data supplier such as the Microsoft
Windows Azure Marketplace.
The importance of answering the question in relation to driving key business objectives.
A common approach to prioritizing the business questions that you will address in the data
warehousing solution is to work with key business stakeholders and plot each question on a
quadrant-based matrix like the one shown below. The position of the questions in the matrix helps
you to agree the scope of the data warehousing project.
High importance,
low feasibility
High importance,
high feasibility
Low importance,
low feasibility
Low importance,
high feasibility
If a large number of questions fall into the high importance, high feasibility category, you may want to
consider taking an incremental approach to the project in which you break down the challenge into a
number of sub-projects. Each sub-project tackles the problem of implementing the data warehouse
schema, ETL solution, and data quality procedures for a specific area of the business, starting with the
highest-priority business questions. If you take this incremental approach, you should take care to create
an overall design for dimension and fact tables in early iterations of the solution so that subsequent
additions to the solution can reuse them.
Da
ata Wareho
ousing Pro
oject Roless
A da
ata warehousing project typ
pically involves several roles. These roles in
nclude:
A project man
nager. Coordin
nates project ta
asks and sched
dules and ensu
ures that the p
project is comp
pleted
on time and within
w
budget.
A database ad
dministrator. Designs
D
the ph
hysical architeccture and conffiguration of th
he data warehouse
database. In addition,
a
datab
base administrrators who havve responsibility for data sou
urces that are used
in the data warehousing so
olution must be
e involved in tthe project to p
provide accesss to the data
sources that the
t ETL processs uses.
An ETL develo
oper. Builds the
e ETL workflow
w for the data warehousing solution.
Data steward
ds for each key subject area in
n the data warrehousing solu
ution. Determin
ne data qualityy rules
and validate data
d
before it enters the datta warehouse. Data stewardss are sometime
es referred to as
data governors.
1-11
In addition to ensuring the appropriate assignment of these roles, you should also consider the
importance of executive-level sponsorship of the data warehousing project. The project is significantly
more likely to succeed if a high-profile executive sponsor is seen to actively support the creation of the
data warehousing solution.
SQ
QL Server As
A a Data Warehousi
W
ng Platforrm
SQLL Server includes componentts and featuress that you can use to implem
ment various aarchitectural
elem
ments of a data warehousing
g solution. The
ese componen ts and featurees include:
1-13
In addition, you can use some SQL Server components and other Microsoft products to build an
enterprise BI solution that extends the value of your data warehouse significantly. These components and
products include:
SQL Server Analysis Services. A service for creating multidimensional and tabular analytical data
models for so-called slice and dice analysis, and for implementing data mining models that you can
use to identify trends and patterns in your data.
SQL Server Reporting Services. A solution for creating and distributing reports in a variety of formats
for online viewing or printing.
Microsoft SharePoint Server. A web-based portal through which information workers can consume
reports and other BI deliverables.
Microsoft Excel. The worlds most commonly used spreadsheet and data analysis tool.
Microsoft PowerPivot technologies. A powerful analytical engine that enables analysis of large volumes
of data in Excel and sharing of tabular data models in SharePoint Server.
Microsoft Power View. A data visualization tool that provides an intuitive, interactive experience for
users who need to perform unstructured analysis of data in a BI semantic model.
Lesson 2
Consid
deration
ns for a Data Warehou
W
use Solu
ution
Befo
ore starting a data
d
warehoussing project, th
here are severaal consideratio
ons of which you should be
awa
are. Understanding these con
nsiderations will
w help you to
o create a data warehousing solution that
add
dresses your sp
pecific needs and constraintss.
Thiss lesson describ
bes some of th
he key conside
erations for plaanning a data warehousing ssolution. Afterr
com
mpleting this le
esson, you will be able to:
Data
D
Wareh
house Database and
d Storage
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
1-15
A data warehou
use is a relation
nal database that is optimizeed for reading data for analyysis and reportting.
When
W
you are planning
p
a data warehouse, you
y should takke the followin
ng considerations into accou
unt.
Database
D
Sch
hema
Th
he logical sche
ema of a data warehouse is typically
t
desig ned to denorm
malize the data into a structure that
minimizes
m
the number
n
of JOIN
N operations that
t
are requirred in the querries that are ussed to retrieve
e and
ag
ggregate data. A common approach is to design a star sschema in whi ch numerical m
measures are sstored
in
n fact tables that have foreign keys to multtiple dimension
n tables that ccontain the business entities by
which
w
the meassures can be ag
ggregated. Beffore you desig
gn your data w
warehouse, you
u must know w
which
diimensions you
ur business use
ers need to use
e when aggreg
gating data, w hich measuress need to be an
nalyzed
an
nd at what gra
anularity, and which
w
facts incclude those meeasures. You m
must also plan the keys that will be
ussed to link factts to dimensio
ons carefully, and consider w
whether your data warehouse
e must supporrt the
usse of dimensio
ons that chang
ge over time (fo
or example, haandling dimen
nsion records ffor customers w
who
ch
hange their ad
ddress).
Yo
ou must also consider
c
the ph
hysical implem
mentation of th
he database, b
because this wiill affect the
pe
erformance an
nd manageability of the data
a warehouse. I t is common tto use table paartitioning to
diistribute large fact data acro
oss multiple file
egroups, each on a differentt physical disk.. This can incre
ease
qu
uery performa
ance and enables you to imp
plement a fileg
group-based b
backup strategy that can help
p
re
educe downtim
me in the event of a single-d
disk failure. You
u should also cconsider the aappropriate ind
dexing
sttrategy for you
ur data, and wh
hether to use data
d
compresssion when storring the data.
Note De
esigning a data
a warehouse schema is coveered in more detail in Modulle 3,
Designing
g and Impleme
enting a Data Warehouse.
Hardware
The choice of hardware for your data warehouse solution can make a significant difference to the
performance, manageability, and cost of your data warehouse. The hardware considerations for a data
warehouse include:
You can choose to build your own data warehouse solution by purchasing and assembling individual
components, use a pretested reference architecture, or purchase a hardware appliance that includes
preconfigured components in a ready-to-use package. Factors that influence your choice of hardware
include:
Budget.
Time to solution.
A data warehouse can very quickly become a business-critical part of your overall application
infrastructure, so it is essential to consider how you will ensure its availability. SQL Server includes support
for several high-availability techniques including database mirroring and server clustering. You must
assess these technologies and choose the best one for your individual solution based on:
In addition to a server-level high-availability solution, you must also consider redundancy at the individual
component level for network interfaces and storage arrays.
1-17
The most robust high-availability solution cannot protect your data warehouse from every eventuality, so
you must also plan a suitable disaster recovery solution that includes a comprehensive backup strategy.
Your backup strategy should take into account:
Security
Your data warehouse contains a huge volume of data that is typically commercially sensitive. In addition,
you may want to provide access to some data by all users, but restrict access to some data for a subset of
users.
Considerations for securing your data warehouse include:
The authentication mechanisms that you must support to provide access to the data warehouse.
The permissions that the various users who access the data warehouse will require.
Da
ata Sourcess
You
u must identifyy the data sourrces that provide the data fo
or your data waarehouse, and
d consider the
follo
owing factors when
w
planning
g your solution
n.
You
ur data wareho
ouse may require data from a variety of daata sources. Fo
or each source,, you must con
nsider
how
w your ETL pro
ocess can connect and extracct the required
d data. In manyy cases, your d
data sources w
will be
relational databasses for which you
y can use an
n OLE DB or Op
pen Database Connectivity ((ODBC) provid
der.
How
hat requires a bespoke provvider or for which
wever, some data sources ma
ay use proprie
etary storage th
no provider
p
existss. In this case, you
y must deve
elop a custom provider or deetermine whetther it is possib
ble to
export data from the data sourcce in a format that the ETL p
process can ea sily consume ((such as XML o
or
com
mma-delimited
d text).
Cre
edentials an
nd Permissio
ons
Mosst data sourcess require securre access in the form of userr authenticatio
on and potentiially individuall
perm
missions on th
he data. You must
m
work with the owners off the data sourrces that you u
use in your datta
warehousing solution to establish:
Data Formats
1-19
A data source may store data in a different format. Your solution must take into account issues arising
from this, including:
Conversion of data from one data type to anotherfor example, extracting numeric values from a
text file.
Truncation of data when copying data to a destination that has a limited data length.
Depending on the workload patterns of the business, each data source may have time periods where the
data source is unavailable or the level of usage is such that the additional overhead of a data extraction is
undesirable. When you plan a data warehousing solution, you must work with each data source owner to
determine appropriate data acquisition windows based on:
The workload pattern of the data source, and its resource utilization and capacity levels.
The volume of data to be extracted, and the time that it takes to extract it.
The frequency with which you need to update the data warehouse with fresh data.
If applicable, the time zones in which business users are accessing the data.
Sta
aging
In so
ome data ware
ehousing soluttions, you can transfer data directly from d
data sources to
o the data
warehouse withou
ut any interme
ediary staging. However, in m
many cases, yo
ou should conssider staging
data
a to:
Perform transsformations on
n the data thatt cannot be peerformed durin
ng the data exxtraction or da
ata
flow processe
es.
A relational database.
d
The decision on fo
ormat is based
d on several factors including
g:
1-21
Finally, if a relational database is used as the staging area, you must decide where this database will reside.
Possible choices include:
A dedicated staging database in the same instance of SQL Server as the data warehouse.
A collection of staging tables (perhaps in a dedicated schema) in the data warehouse database.
Factors that you should consider when deciding the location of the staging database include:
The use of Transact-SQL loading techniques that perform better when the staging data and data
warehouse are co-located on the same SQL Server instance.
The server resource overheads that are associated with the staging and data warehouse load
processes.
Required Transformations
Most ETL processes require that the data that is being extracted from data sources is modified to match
the schema of the data warehouse. When you plan an ETL process for a data warehousing solution, you
must examine the source data and destination schema, and identify what transformations are required.
Then you must determine the optimal place within the ETL process to perform these transformations.
Choices for implementing data transformations include:
During the data extraction. For example, by concatenating two fields in a SQL Server data source into
a single field in the Transact-SQL query that is used to extract the data.
In the data flow. For example, by using a Derived Column data transformation task in a SQL Server
Integration Services data flow.
In the staging area. For example, by using a Transact-SQL query to apply default values to null fields
in a staging table.
The performance overhead of the transformation. Typically, it is best to use the approach that has the
least performance overhead. Set-based operations that are performed in Transact-SQL queries usually
perform better than row-based transformations that are applied in a data flow.
The level of support for querying and updating in the data source or staging area. In cases where you
are extracting data from a comma-delimited file and staging it in a raw file, your options to perform
transformations are limited to row-by-row transformations in the data flow.
Dependencies on data that is required for the transformation. For example, you might need to look up
a value in one data source to obtain additional data from another data source. In this case, you must
perform the data transformation in a location where both data sources are accessible.
The complexity of the logic that is involved in the transformation. In some cases, a transformation may
require multiple steps and branches depending on the presence or value of specific data fields. In this
case, it is often easier to apply the transformation by combining several steps in a data flow than it is
to create a Transact-SQL statement to perform the transformation.
Incremental ETL
After the initial load of the data warehouse, you will usually need to incrementally load new or updated
source data into the data warehouse. When you plan your data warehousing solution, you must consider
the following factors that relate to incremental ETL:
How will you identify new or modified records in the data sources?
Do you need to delete records in the data warehouse when corresponding records in the data
sources are deleted? If so, will you physically delete the records, or simply mark them as inactive
(often referred to as a logical delete)?
How will you determine whether a record that is to be loaded into the data warehouse should be a
new record or an update to an existing record?
Are there records in the data warehouse for which historical values must be preserved by creating a
new version of the record instead of updating the existing record?
Note Managing data changes in an incremental ETL process is discussed in more detail in
Module 7, Implementing an Incremental ETL Process.
Data
D
Qualitty and Master Data Managem
ment
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
1-23
Th
he usefulness of
o a data ware
ehouse is large
ely determined
d by the qualityy of the data tthat it containss. For
th
his reason, whe
en you plan a data warehousing project, yyou should dettermine how yyou will ensure
e data
qu
uality and you
u should consid
der the use of a master data managementt solution.
Data
D
Quality
y
To
o validate and enforce the quality
q
of the data
d
in your daata warehousee, it is recomme
ended that bu
usiness
ussers who have knowledge off each subject area that the data warehouse addresses ttake on the rolle of
da
ata steward fo
or that area. A data steward is responsible ffor:
ntifies commo
Building an
nd maintaining
g a knowledge base that iden
on data errors and correction
ns.
Validating data
d
against th
he knowledge base.
Yo
ou can use SQ
QL Server Data Quality Services to provide a data quality solution that helps the dataa
stteward to perfo
orm these task
ks.
Note SQ
QL Server Data Quality Servicces is discussed
d in more detaail in Module 9
9, Enforcing
Data Quality.
It is common for large organizations to have multiple business applications, and in many cases, these
systems perform tasks that are related to the same business entities. For example, an organization may
have an e-commerce application that enables customers to purchase products, and a separate inventory
management system that also stores data about products. A record representing a particular product may
exist in both systems. It can be useful in this scenario to implement a master data management system
that provides an authoritative definition of each business entity (in this example, a particular product) that
you can use across multiple applications to ensure consistency.
In a data warehousing scenario, the use of master data management is especially important because it
ensures that the data in the data warehouse conforms to the agreed definition for the business entities
that will be included in any analysis and reporting solutions that it must support.
You can use SQL Server Master Data Services to implement a master data management solution.
Note SQL Server Master Data Services is discussed in more detail in Module 10, Using
Master Data Services.
Lab Scenario
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
1-25
Th
he labs in this course are bassed on a fictional company n
named Adven ture Works Cyycles that
manufactures
m
and sells cycles and cycling accessories to ccustomers all o
over the world
d. Adventure W
Works
se
ells direct to cu
ustomers throu
ugh an e-commerce websitee, and also thro
ough an intern
national netwo
ork of
re
esellers.
Th
hroughout thiss course, you will
w develop a data warehou
using solution ffor Adventure Works Cycles,,
in
ncluding a data
a warehouse, an
a ETL processs to extract datta from sourcee systems and populate the data
warehouse,
w
a da
ata quality solu
ution, and a master
m
data maanagement sollution.
Th
he lab for this module provid
des a high-levvel overview off the solution tthat you will create in later labs. Use
th
his lab to become familiar wiith the variouss elements of tthe data wareh
housing solutio
on that you will learn
to
o build in laterr modules. Don
nt worry if you
u do not undeerstand the speecific details off how each
co
omponent of the
t solution ha
as been built, you
y will explorre each eleme nt of the soluttion in greater depth
la
ater in the courrse, as describe
ed in the follow
wing table.
Lab
Tasks
Explore
e the complete
e data warehou
using solution that will be developed in th
his course.
Use SQ
QL Server Integration Servicess to implemen
nt data flows th
hat extract, loaad, and transfo
orm
data.
5A
Implem
ment control flo
ow to perform
m sequential an
nd iterative tassks in an ETL so
olution.
5B
7A
(continued)
Lab
Tasks
7B
Modify the ETL process to insert or update data in the data warehouse as appropriate.
9A
Use Data Quality Services to cleanse data before loading it into the data warehouse.
9B
Use Data Quality Services to deduplicate data before loading it into the data warehouse.
10
Use Master Data Services to manage data entity consistency across the enterprise.
11
12
13
Explore business intelligence solutions based on the data warehouse you have created.
The completed lab solution that you will create throughout this course is illustrated in the following
image.
Note The illustration includes a Master Data Services model for product data, a Data
Quality Services task to cleanse data as it is staged, and cloud data sources. These elements
form part of the complete solution for the lab scenario in this course, but they are not
present in this lab.
1-27
Lab 1: Exploring a Da
ata Warrehousing Solu
ution
Exe
ercise 1: Ex
xploring Data
D
Sourcces
Sce
enario
Reseller paym
ments are processed by an acccounting app
plication.
Thiss distribution of
o data has ma
ade it difficult for
f business ussers to answerr key questionss about the ovverall
perfformance of th
he business.
In th
his exercise, yo
ou will examine some of the data sources w
within Adventture Works thaat will be used in the
data
a warehousing
g solution.
2.
3.
4.
5.
6.
7.
8.
9.
1-29
Ensure that the MIA-DC1 and MIA-SQLBI virtual machines are both running, and then log on to
MIA-SQLBI as ADVENTUREWORKS\Student with the password Pa$$w0rd.
Run the Setup Windows Command Script file (Setup.cmd) in the D:\10777A\Labfiles\Lab01\Starter
folder as Administrator.
Use Paint to view the Adventure Works DW Solution.jpg JPEG image in the
D:\10777A\Labfiles\Lab01\Starter folder, and note the data sources in the solution architecture.
Use Microsoft SQL Server Management Studio to open the View Internet Sales.sql Microsoft
SQL Server query file in the D:\10777A\Labfiles\Lab01\Starter folder. Use Windows authentication to
connect to the localhost instance of SQL Server.
Execute the query and examine the results. Note that this data source contains data about customers
and the orders that they have placed through the e-commerce web application.
Use SQL Server Management Studio to open the View Reseller Sales.sql Microsoft SQL Server query
file in the D:\10777A\Labfiles\Lab01\Starter folder.
Execute the query and examine the results. Note that this data source contains data about resellers
and the orders that they have placed through Adventure Works reseller account managers.
Use SQL Server Management Studio to open the View Products.sql Microsoft SQL Server query file
in the D:\10777A\Labfiles\Lab01\Starter folder.
Execute the query and examine the results. Note that this database contains data about products that
Adventure Works sells, and that products are organized into categories and subcategories.
Use SQL Server Management Studio to open the View Employees.sql Microsoft SQL Server query file
in the D:\10777A\Labfiles\Lab01\Starter folder.
Execute the query and examine the results. Note that this database contains data about employees,
including sales representatives.
Examine the comma-delimited text files in the D:\10777A\Accounts folder by opening them in
Microsoft Excel 2010, and note that they contain details of payments that resellers have made.
Use Internet Explorer to view the SharePoint site at http://mia-sqlbi, and examine the Regional
Account Managers list. There is a link to the Regional Account Managers list in the Quick Launch
area of the SharePoint site home page.
In SQL Server Management Studio, in the Object Explorer pane, examine the tables in the Staging
database in the localhost instance of SQL Server (ensure you examine the Staging database, not the
DQS_STAGING_DATA database).
Note that all tables other than dbo.ExtractLog in this database are empty.
Results: After this exercise, you should have viewed data in the InternetSales, ResellerSales, and
Products SQL Server databases; viewed payments data in comma-delimited files; viewed a list of regional
account managers in a SharePoint site; and viewed an empty staging database.
1-31
Now that you are familiar with the data sources in the Adventure Works data warehousing solution, you
will examine the ETL process that is used to stage the data, and then load it into the data warehouse.
Adventure Works uses a solution based on SQL Server Integration Services to perform this ETL process.
The main tasks for this exercise are as follows:
1.
2.
3.
4.
Use Paint to view the Adventure Works DW Solution.jpg JPEG image in the
D:\10777A\Labfiles\Lab01\Starter folder, and note the ETL processes in the solution architecture.
X Task 2: Run
n the ETL sta
aging processs
In the Solutio
on Explorer pan
ne, view the SS
SIS packages tthat this solutio
on contains, and then doubleclick Stage Data.dtsx
D
to op
pen it in the designer. The p
package should
d resemble this.
X Task 3: View
w the staged
d data
X Task 4: Ru
un the ETL data
d
warehouse load pro
ocess
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
1-33
View the co
ontrol flow of the
t Load DW..dtsx packagee, and then run
n the package by clicking Sta
art
Debugging
g on the Debu
ug menu. The package will rrun other packkages to perforrm the tasks in
n the
control flow
w. This may tak
ke several minutes.
Results: After th
his exercise, yo
ou should have
e viewed and run the SQL Seerver Integratiion Services paackages
th
hat perform the ETL process for the Adven
nture Works daata warehousin
ng solution.
2.
Use Paint to view the Adventure Works DW Solution.jpg JPEG image in the
D:\10777A\Labfiles\Lab01\Starter folder, and note the data warehouse in the solution architecture.
Use SQL Server Management Studio to open the Query DW.sql Microsoft SQL Server query file in the
D:\10777A\Labfiles\Lab01\Starter folder.
Use Windows authentication to connect to the localhost instance of SQL Server, and execute the
query in the AWDataWarehouse database.
Execute the query and examine the results. Note that the data warehouse contains the data necessary
to view key business metrics across multiple aspects of the business.
Results: After this exercise, you should have successfully retrieved business information from the data
warehouse.
Modu
ule Reviiew and
d Takeaw
ways
Review
R
Quesstions
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
1..
Why mightt you consider including a staging area in yyour ETL soluttion?
2..
What optio
ons might you consider for performing
p
datta transformattions in an ETLL solution?
3..
Why would
d you assign th
he data steward
d role to a bussiness user rather than a dattabase technollogy
specialist?
For More
e Information
n For more in
nformation abo
out Best Practiices for Data W
Warehousing
with SQL Server 2008 R2
2, see http://g
go.microsoft.co
om/fwlink/?Lin
nkID=246719.
1-35
Module 2
Data Warehouse Hardware
Contents:
Lesson 1: Considerations for Building a Data Warehouse
Lesson 2: Data Warehouse Reference Architectures and Appliances
2-3
2-11
Module Overrview
2-2
Explain how to
t use referencce architecture
es and data waarehouse appliances to creatte a data
warehouse.
Lesson
n1
Consiideratio
ons for Building
B
g a Dataa Wareh
house
2-3
To
o build a data warehouse that meets the requirements
r
o
of your organi zation, it is important that yyou
un
nderstand the characteristicss of typical datta warehouse workloads, ho
ow hardware affects data warehouse
pe
erformance, an
nd what options are available to you for im
mplementing a data wareho
ouse solution.
Th
his lesson desccribes data warehouse workloads and expllains how theyy differ from th
he workloads tthat
trransactional da
atabases handle. It also expla
ains how hardw
ware affects data warehouse
e performance
e, and
de
escribes the ch
hoices that you
u have for buillding a data w
warehouse.
After completin
ng this lesson, you
y will be able to:
Describe da
ata warehouse
e workloads.
Describe th
he typical comp
ponents of a data
d
warehousse system.
Describe th
he consideratio
ons for data wa
arehouse hard
dware.
Describe th
he options for implementing data warehou
use hardware.
Da
ata Wareho
ouse Work
kloads
2-4
A da
ata warehouse
e might contain millions of rows of data, a nd will increasse in size with every data loaad. A
typical data wareh
house query in
nvolves selectin
ng, summarizi ng, grouping, and filtering rrows to return a
he rows in thee database. Forr example, a
rang
ge of data thatt might itself consist
c
of a large subset of th
business analyst might
m
issue a query
q
that retu
urns a summarry of sales for a particular pro
oduct between
n
two
o defined datess. Depending on
o the dates th
hat the analystt chooses, the query might rrequire Microssoft
SQLL Server to acccess hundred
ds of thousands or even milliions of rows. TThis is quite diffferent to the w
way
thatt an online transactional pro
ocessing (OLTP
P) database is g
generally used
d. With OLTP d
databases, mosst
activvity involves th
he addition of new rows, and
d the updating
g or deleting o
of existing row
ws. Users usually
worrk with data in OLTP databasses a few rows at a time; theerefore, adminiistrators must optimize the
data
abase for the retrieval
r
of small numbers off rows, such ass creating nonclustered inde
exes.
The different charracteristics of data
d
warehousse queries requ
uire a differentt approach to hardware and
d
softtware configurration than for OLTP databasses. Generally, you should op
ptimize data w
warehouses forr
sequ
uential disk input/output (I/O) activity, wh
hich involves reeading rows frrom the disk in
n the order thaat
theyy are requested. For example
e, if most querries request daata for ranges of dates, then you can store
e the
data
a in date order, which enables the date to be read from the disk as a ssequence. You
u should also kkeep
the following poin
nts in mind wh
hen considerin
ng data wareho
ouse workload
ds:
Queries typica
ally scan large numbers of ro
ows. Scanning instead of seeeking to retrievve rows is morre
efficient when
n a large number of rows is involved, part icularly when tthose rows are
e stored
sequentially on
o the disk. Fo
or example, in a fact table wh
hich stores row
ws ordered by date, it is posssible
to process qu
ueries for date ranges by acccessing the datta sequentiallyy.
Data warehou
uses contain reelatively static data. The conttents of a dataa warehouse tyypically remain
n
static between each bulk lo
oading of data because userss rarely perform update or d
delete operatio
ons.
Consequentlyy, database fra
agmentation iss minimized an
nd data remain
ns in the same
e sequential order
on the disk, which
w
improves scanning perrformance.
2-5
Nonclustered indexes can decrease performance. Although nonclustered indexes can speed up
queries that return a small number of rows, for queries that return large datasets nonclustered
indexes can reduce response times because of the random I/O scans that their use generates. In
addition, nonclustered indexes require maintenance and must be rebuilt every time you load data,
which adds considerable management and processing overheads, and which can be problematic
when you have very narrow processing windows available.
Partitioning can improve query response times. Partitioning enables faster processing of data because
it reduces contention and can reduce the number of rows included in a table scan. Using partitions
also simplifies management of sets of data in the data warehouse and helps to minimize
fragmentation.
Note Workloads can vary significantly between data warehouses, so it is important to
assess each data warehouse independently and not to assume that the considerations
outlined above will apply in every case.
Da
ata Wareho
ouse Syste
em Archite
ecture
2-6
Cho
oosing the righ
ht componentss for your data
a warehouse iss not just abou
ut purchasing tthe fastest storrage
solu
ution or as much memory ass possible. To build
b
an effect ive data wareh
house solution
n, you must baalance
thesse components together so that
t
a single component do
oes not become a bottleneckk in the system
m and
slow
w down overall throughput. Additionally, you
y must balan
nce the hardw
ware specificatiion for your daata
warehouse againsst the cost of the components. Over-speciffying the hard
dware configurration for yourr
data
a warehouse may
m result in exxpensive, unde
er-utilized harrdware that ex ceeds the requ
uirements for
your data wareho
ouse workload..
Sofftware
Server Hardware
A data warehouse requires appropriate server hardware to manage its workload. In most enterprise
scenarios, a data warehouse is implemented as one or more server nodes in a rack, and includes the
following hardware resources:
2-7
Processors. The number of processor cores and processor speed can be limiting factors where the
total processing capacity is not great enough to handle the throughput from the other components
in the system. However, adding more or faster processors will only improve performance if the other
components of the system can pass data to and from the processors at fast enough speeds.
Memory. Memory aids performance in various ways, for example by enabling SQL Server queries to
be answered from cache, or by enabling join and sort operations to be performed more efficiently.
When there is insufficient memory present, sort and join operations can utilize disk space, which
reduces the available disk capacity and can cause fragmentation.
Storage
While it is possible to host a data warehouse on internal hard-disks in a database server, in enterprise
scenarios it is more common to use a dedicated storage subsystem that includes:
Enclosures. Storage enclosures include on-board disk controllers that manage redundant array of
independent disks (RAID) storage across multiple disks. The server is connected to the storage
enclosures through a direct access connection to a host adapter, or more commonly through a
network connection.
Disk arrays. Each enclosure in a data warehouse system contains multiple hard disks, usually
configured as RAID 10 arrays. The number and speed of disks in a storage array can affect the
performance of the data warehouse. You can choose from a number of disk form factors depending
on your requirements for storage capacity, physical size, and read/write performance. Some disk form
factors to consider include:
Serial Attached SCSI (SAS) magnetic disk drives in large form-factor (LFF) or small form-factor
(SFF). SAS disks offer large storage capacities and sufficient read/write performance for data
warehouse workloads.
Solid State Drive (SSD) a storage device that uses solid state memory instead of a spinning disk.
The lack of moving parts makes SSDs robust and reduces access time when reading data;
increasing overall performance. Additionally, SSDs typically require less power than mechanical
disks. However, the cost-per-gigabyte of SSDs is typically higher than that of SAS disks.
Disks are often the cause of bottlenecks in data warehouses because if the storage system does not
have enough drives, or the drives are not fast enough, throughput to the other components in the
system is lower, and performance will suffer. Additionally, storage requirements for a data warehouse
typically grow considerably over time, so you must plan for extensibility.
Networking
When implementing networking for a data warehouse system, you must consider two network
connections:
2-8
Storage connectivity. The data warehouse server is typically connected to the storage subsystem
through a network connection. In most enterprise data warehouse systems, a fibre channel switch is
used to provide high-speed connectivity between a host bus adapter (HBA) on the server and the
storage enclosures.
External network connectivity. In addition to the internal connection between the data warehouse
server and its storage subsystem, you must consider how you will connect the data warehouse to an
external network so that client applications can connect and use the data warehouse. The type of
network connectivity used for client access to the data warehouse depends on the network topology
of your organizations local area network (LAN), but you should use a networking technology that
provides adequate bandwidth and throughput for the volumes of data that will be loaded into the
data warehouse by the extract, transform, and load (ETL) process, and retrieved by client applications.
Options
O
forr Impleme
enting a Da
ata Wareh
house
2-9
To
o build a data warehouse that meets the reporting
r
and data analysis d
demands of yo
our organizatio
on
re
equires careful planning. A data
d
warehouse
e is not simplyy a modified veersion of a transactional dattabase.
Th
he design conssiderations forr data warehou
uses are quite different to th
hose for OLTP systems. When
de
eciding how to
o approach bu
uilding a data warehouse,
w
yo
ou should conssider several faactors, includin
ng the
avvailable budge
et, the planned
d delivery date
e for the comp
pleted solution
n, and whetherr your organizaation
ha
as individuals who
w have the right skills and
d experience to
o design and b
build a data w
warehouse.
Custom-Buil
C
d Solution
Custom-build so
olutions generrally take the greatest
g
amou
unt of time to ccomplete. The
ey also require the
orrganization to design, assesss, assemble, an
nd test everyth
hing in-house. Therefore, the
e organization
n must
eiither already employ
e
individuals with the necessary
n
skillss, or hire them
m. Although the
e apparent cost of a
cu
ustom-build so
olution might be less than fo
or reference arrchitecture or aappliance-based solutions,
exxtended development times and hiring skiilled individua ls can significaantly increase those costs.
Fu
urthermore, th
here is a risk th
hat despite the
e planning and
d testing that yyou perform, a self-built systtem
might
m
not be ca
apable of meeting the dema
ands placed on
n it. This is partticularly a risk when the indiividuals
in
nvolved have liimited experie
ence with data warehouse im
mplementation
ns and limited knowledge of data
warehouse
w
arch
hitecture.
Reference
R
Arrchitecturess
Th
he purpose of data warehou
use reference architectures
a
iss to minimize tthe risk of failu
ure, reduce costs, and
to
o speed up the
e time to delive
ery for the solution. A refereence architectu
ure is essentiallly a blueprint that
en
nables you to create a data warehouse
w
tha
at is based on a tried and tessted design, re
educing the de
esign
time and level of
o knowledge and
a expertise that an organ ization requirees. Microsoft FFast Track Dataa
Warehouse
W
is a set of referencce architecture
es that are bassed on the SQ L Server platfo
orm. The Fast T
Track
re
eference archittectures use a range of dediccated hardwarre configuratio
ons that are de
esigned to suitt many
diifferent require
ements, enabling companiess to get their d
data warehousse up and runn
ning quickly an
nd in a
co
ost-effective manner.
m
Appliances
A data warehouse appliance is a pre-built system that is designed and optimized for data warehousing.
Appliances include servers, storage hardware, an operating system, and a database management system
(DBMS). Data warehouse appliances can be based on symmetric multiprocessing (SMP) hardware
architectures, or increasingly very powerful, massively parallel processing (MPP) systems that are targeted
at large organizations. Because every component in an appliance is already built and configured, they
offer the simplest implementation experience, but can be a less flexible solution than self-build or
reference architectures.
Lesson
n2
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
2-11
Data Wareho
ouse Re
eference
e Archittecturess and Appliancces
Bu
uilding a data warehouse byy sourcing and
d testing hardw
ware componeents yourself caan be a complex,
exxpensive, and time-consuming process. Re
eference archittectures and d
data warehouse
e appliances simplify
th
he process of choosing
c
data warehouse ha
ardware, helpin
ng you to stayy on budget an
nd on schedule
e, and
to
o create a data
a warehouse th
hat genuinely meets the neeeds of your com
mpany.
After completin
ng this lesson, you
y will be able to:
ng a data ware
Explain the benefits of Fa
ast Track refere
ence architectu
ures for buildin
ehouse.
Describe th
he key featuress of a Parallel Data
D
Warehou
use appliance.
Fasst Track Da
ata Wareh
house
Vallidated Hard
dware Conffigurations
Ballanced Hard
dware
2-13
You can use the Fast Track System Sizing Tool to help you to get a basic understanding of the type of
system that you might require. The Fast Track System Sizing Tool is a Microsoft Excel document into
which you can enter maximum consumption rate (MCR), number of concurrent sessions, and data
capacity requirements values, and it will calculate the approximate number of processor cores and storage
units that are required to satisfy these requirements. MCR is a measure of throughput in MBs per second.
To calculate MCR, you should execute a predefined, read-only, query from the buffer cache and measure
the time it takes to execute the query and the amount of data processed.
For More Information For more information about SQL Server Fast Track Data
Warehousing for Microsoft SQL Server 2008 R2, see http://go.microsoft.com/fwlink
/?LinkID=246719. You can also download the Fast Track System Sizing Tool from this
website.
Da
ata Wareho
ouse Appliiances
Parallel
P
Datta Wareho
ouse Appliances
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
2-15
Fa
ast Track Data Warehouse syystems and appliances that aare based on tthem use a sym
mmetric
multiprocessing
m
g (SMP) archite
ecture. With SM
MP systems, th
he system bus is the limiting
g component that
prevents scaling
g up beyond a certain level. As the numbeer of processorrs and the dataa load increase
es, the
bu
us can become
e overloaded and
a becomes a bottleneck. FFor data wareh
houses that require greater
sccalability than a SMP system can provide, you
y can use an
n enterprise daata warehouse
e appliance baased on
Microsoft
M
SQL Server
S
Parallel Data Warehou
use.
Massively
M
Pa
arallel Proce
essing
Pa
arallel Data Warehouse usess a shared- notthing, massive ly parallel processing (MPP) architecture, w
which
de
elivers improved scalability and
a performan
nce over SMP systems. MPP systems delive
er much better
pe
erformance than SMP serverrs for large datta loads. MPP systems use m
multiple servers, called nodess, which
process queries independently in parallel. Parallel
P
processsing involves d
distributing qu
ueries across th
he
no
odes so that each node proccesses only a part
p of the queery; the results of the partial queries are
co
ombined after processing co
ompletes to cre
eate a single r esult set.
Shared-Noth
hing Archite
ecture
A Parallel Data Warehouse appliance consists of a server that acts as the control node, and multiple
servers that act as compute nodes and storage nodes. Each compute node has its own dedicated
processors, memory, and is associated with a dedicated storage node. A dual InfiniBand network connects
the nodes together, and dual fiber channels link the compute nodes to the storage nodes. The control
node intercepts incoming queries, divides each query into multiple smaller operations, and then passes
these on to the compute nodes to process. Each compute node returns the results of its processing back
to the control node. The control node integrates the data to create a result set, which it then returns to
the client.
Control nodes are housed in a rack called the control rack. There are three other types of nodes that share
this rack with the control node:
Landing Zone nodes, which act as staging areas for data that you load into the data warehouse by
using and extract, transform, and load (ETL) tool.
Compute nodes and storage nodes are housed in a separate rack called the data rack. To scale the
application, you can add more racks as required. Hardware components are duplicated, including control
and compute nodes, to provide redundancy.
You can use a Parallel Data Warehouse appliance as the hub in a hub and spoke configuration, and
populate data marts directly from the data warehouse. Using a hub and spoke configuration enables you
to integrate the appliance with existing data marts or to create local data marts as required. If you use
Fast Track Data Warehouse systems to build the data marts, you can achieve very fast transfers of data
between the hub and the spokes.
For More Information For more information about the Parallel Data Warehouse for
Microsoft SQL Server 2008 R2, see http://go.microsoft.com/fwlink/?LinkID=246722.
Modu
ule Reviiew
Review
R
Quesstions
1..
2..
What are th
he advantagess of using referrence architecttures to createe a data wareh
house?
3..
What are th
he key differen
nces between SMP
S
and MPP
P systems?
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
2-17
Module 3
Designing and Implementing a Data Warehouse
Contents:
Lesson 1: Logical Design for a Data Warehouse
3-3
3-17
3-27
Designing and
a Implementing a Data Warehouse
Module Overrview
3-2
A da
ata warehouse
e provides a ce
entralized sourrce of data forr reporting and
d analysis. In m
most cases, a data
warehouse must store
s
extremely large volume
es of data and
d provide userss with fast resp
ponses to com
mplex
que
eries. It is there
efore imperativve that you implement a datta warehouse b
by using desig
gn principles th
hat
optimize data storage efficiencyy and query pe
erformance.
In th
his module, yo
ou will learn ho
ow to impleme
ent the logical and physical architecture o
of a data wareh
house
base
ed on industryy-proven desig
gn principles.
Afte
er completing this module, you
y will be able to:
Lesson
n1
Logiccal Desig
gn for a Data Wareho
W
ouse
3-3
Th
he logical sche
ema of a data warehouse pla
ays an importaant role in deteermining its efffectiveness ass a
so
ource of reportting and analyytical data. A data
d
warehousee is used prim arily to answer questions ab
bout the
ptimizing the data in this waay
bu
usiness, and must
m
therefore be optimized for data read operations. Op
makes
m
data warrehouses funda
amentally diffe
erent from on line transactio
on processing ((OLTP) databases
ussed by businesss applicationss, which usuallyy need to hand
dle a combinaation of data reead and write
op
perations.
Although there are several da
ata warehouse design metho
odologies, the dimensional m
modeling apprroach
th
hat this lesson describes is an
n industry-proven techniquee for creating eeffective data warehouses.
After completin
ng this lesson, you
y will be able to:
Describe th
he key principle
es of design modeling.
m
Design and
d implement dimension table
es.
Design and
d implement fa
act tables.
Design a sn
nowflake schem
ma for a data warehouse.
w
Design and
d implement a time dimensio
on table.
Designing and
a Implementing a Data Warehouse
Inttroduction
n to Dimen
nsional Mo
odeling
3-4
A da
ata warehouse
e is designed to support repo
orting and anaalysis that answ
wers key questtions about th
he
business. In most cases, the que
estions that business executivves and inform
mation workers ask are concerned
with
h numerical measures (such as sales revenu
ue, cost, profitt, or stock leveel) aggregated by various keyy
aspe
ects, or dimensions, of the business
b
(such as products, cu
ustomers, emp
ployees, or fisccal time period
ds).
For example, it is common for business
b
executives to requeest reports thatt show measurres such as:
Sales revenue
e by salesperso
on
Profit by prod
duct line
Cost by product
Sales revenue
e by customer
Profit by region
Sales revenue
e by a time perriod such as fisscal quarter
In each
e
of these examples,
e
the required
r
inform
mation consistts of a numericcal business m
measure, aggreg
gated
by a different dim
mension of the business. This approach to m
modeling a daata warehouse is called
dim
mensional modeling.
The first step in de
esigning a dim
mensional mod
del for a data w
warehouse is to determine the questions tthat
the business userss want the data warehouse to
wers to. Comp
t provide answ
piling a list of tthese question
ns will
help
p you identify the numerical measures (som
metimes referrred to as fact s) and dimensions that the data
warehouse must support.
s
Star Schem
mas
3-5
When
W
you have
e identified the
e measures and
d dimensions tthat your dataa warehouse m
must support, yyou can
sttart to design the
t logical schema of the da
atabase. A com
mmon techniqu
ue for data waarehouse desig
gn is to
usse a star schem
ma, in which:
Related dim
mensions are grouped
g
into one
o or more diimension tablees.
Related me
easures are gro
ouped into one
e or more factt tables.
Fa
act tables are generally
g
relatted to multiple
e dimension taables, creating a schema thatt can be visuallized
with
w the fact tab
ble at the centter and each dimension
d
tabl e as the point of a star.
Fa
act tables gene
erally store row
ws that contain
n numerical m
measures involvved in a discre
ete business evvent,
su
uch as a sales order
o
or an acccount transacttion. The dimeension tables sttore data abou
ut the business
en
ntities that are
e involved in th
hese events, su
uch as the custtomer or sales person. In add
dition, most daata
warehouses
w
include a dimension table that stores temporral (time-based
d) data, so thaat you can iden
ntify
when
w
a particular fact event occurred.
o
Designing and
a Implementing a Data Warehouse
Co
onsideratio
ons for Dim
mension Ta
ables
3-6
Dim
mension tables contain the business attribu
utes by which u
users may wan
nt to aggregatte the measure
es in
the fact tables. When implemen
nting dimensio
on tables, you should consider two importtant aspects off
dim
mension table design:
d
denorm
malization and keys.
Den
normalization
In most
m
data ware
ehouses, the dimension table
es are often wi
wide, meaning tthat they inclu
ude a potentially
larg
ge number of columns
c
to sup
pport the attributes by which
h users want tto view the datta. In some casses,
this design can lea
ad to a large amount
a
of dup
plication in thee table. For exaample, conside
er a dimension
n
tablle named Dim
mSalesPerson that
t
stores info
ormation abou
ut sales emplo
oyees where eaach sales employee
is ba
ased in a particular store. Im
mplementing th
his dimension as a single dim
mension table may result in the
follo
owing data.
Sa
alesPersonKey
EmployeeN
No
SalesPerrsonName
SttoreName
StoreCity
StoreReg
gion
S1201
Ellen Ad
dams
W
West Seattle
Seattle
Washing
gton
S1343
Jeff Pricce
W
West Seattle
Seattle
Washing
gton
L1214
Don Hall
H
Hollywood
Los Angeless
California
L1567
Jane Do
ow
H
Hollywood
Los Angeless
California
3-7
In an OLTP database, you would typically normalize this table to eliminate redundancy by creating a
separate Store table. You could even further normalize the data by creating separate City and Region
tables. You could then retrieve the salesperson, store name, city, and region in a single query that uses
JOIN clauses to link the tables. The primary advantages to this approach in an OLTP solution are:
Any modifications to a store name, city, or region can be limited to updating a single value in a single
row, rather than updating the same store name or region in multiple rows.
However, join operations can slow query performance, and in a data warehouse solution, query
performance is generally more important than saving disk space. In addition, because data warehouse
workloads generally involve few or no updates, the value of normalizing the data is diminished. For these
reasons, dimension tables in a data warehouse are generally denormalized to optimize query
performance, and therefore usually contain duplicate values.
Keys
Each row in a dimension table is uniquely identified by a primary key. In most data warehousing solutions,
the data for the dimensions in the data warehouse originates in a business application, where it may
already have a key assigned. For example, the sales employee records in the DimSalesPerson table that
was described earlier in this topic may have been extracted from an existing human resources database,
where each employee is identified by a unique employee number. In a data warehouse, keys that are
assigned in the source system are generally referred to as business keys.
It may seem sensible, therefore, to reuse the existing source business key in the data warehouse. However,
the best practice is generally to define a new key, known as a surrogate key, for the rows in the dimension
table. This is for the following reasons:
The dimension table may contain records that originate from multiple source systems. In this case,
there is no guarantee that the source system keys are unique or of compatible data types.
The business key used in the source system could be a complex string or unique identifier (GUID) data
type. Although it is possible to use such values as primary keys in a data warehouse, simple integer
keys usually result in better query performance when joins must be made between fact and
dimension tables. This generally makes it more effective to create a new integer surrogate key than to
use the source business key.
Data warehouses deal with historical data, and you should anticipate potential changes in dimension
attributes. For example, an employee may transfer from the West Seattle store to the Hollywood
store, but your data warehouse must still reflect sales by that employee prior to the transfer as being
related to the West Seattle store, while sales after the transfer should be related to the Hollywood
store. To accomplish this, you need two versions of the salesperson record in the DimSalesPerson
table, and if the employee number was used as the primary key, this would result in a unique key
violation.
Note A dimension that retains historical versions while reflecting updates in source data
as described above is known as a slowly changing dimension. Slowly changing dimensions
are discussed in Module 7, Implementing an Incremental ETL Process.
In most cases, your dimension tables should include a unique surrogate key as the primary key for the
table, with the original business key retained as a column in the dimension table as a second identifier for
the dimension business entity. For this reason, the business key is sometimes referred to as the alternative
key.
Conformed Dimensions
3-8
In most cases, data warehouses include conformed dimensions. A conformed dimension is a dimension
table representing a business entity that has the same meaning for all fact tables. For example, a date
dimension table usually contains values for calendar and fiscal dates that are applicable across the entire
organization.
In some cases, some users of the data warehouse may use different definitions for business entities that
other users. For example, the manufacturing division might require a different definition for a product
than all other departments. In this scenario, you can either create a separate non-conformed product
dimension for the manufacturing division to use, or add additional columns to the product dimension to
satisfy the needs of the manufacturing division while maintaining a single, conformed product dimension.
Considerat
C
ions for Fa
act Tables
3-9
Fa
act tables conttain the busine
ess measures that can be agg
gregated acro
oss dimensionss. When you
im
mplement fact tables, you must consider th
he level of dettail stored in th
he table, the key columns in the
ta
able, the measures in the tab
ble, and any ad
dditional dimeension data thaat needs to be
e stored in the fact
ta
able.
Grain
G
One
O of the mosst important co
onsiderations for
f a fact tablee is the granulaarity, or grain, of the measures that
it contains. For example, conssider a data wa
arehouse in wh
hich sales ordeers must be sto
ored in a fact ttable. A
single order can
n include multiple line items. Therefore, yo
ou can store th
he sales order measures at th
he order
le
evel, or at the line item level.
If you choose to
o define the fa
act table at the
e order level, th
he measures in
n your table m
might look like the
fo
ollowing.
CustomerKey
SalesPerson
nKey
OrderN
No
SalesA
Amount
Shiipping
Discount
1001
$3000
$3
30
$
$6
1002
$1500
$1
10
$
$5
Th
his level of gra
ain enables you
u to include measures
m
that rreflect the totaal sales amoun
nt for the indivvidual
ite
ems in the ord
der, in addition
n to measures such as shippiing and discou
unt that exist aat the order levvel.
However, becau
use the order can
c include mu
ultiple productts, you cannott aggregate th
he measures accross a
product dimenssion.
Defining the fact table at the line item level might result in a table like the following table.
CustomerKey
SalesPersonKey
ProductKey OrderNo
ItemNo
Quantity
SalesAmount
Shipping Discount
45
1001
$100
$10
106
1001
$200
$20
15
1002
$150
$10
$2
$4
$5
Using the line item level of grain enables you to aggregate the sales amount by product, and also include
a quantity measure so that you can aggregate sales volume in terms of the number of units sold. The
order-level measures, such as discount and shipping, are spread proportionally across the line items in the
order.
In practice, many organizations need to analyze data at multiple levels of grain, in which case you should
consider creating multiple fact tables. For example, in the scenario described above, you could create a
detailed sales order fact table, and a sales order summary fact table.
Keys
The primary key of a fact table is usually a composite key that includes the columns containing the
foreign-key references to the dimension tables. In some cases, you may choose to include additional
business key columns in the primary key to ensure uniqueness. For example, in the sales order fact table
discussed previously, the primary key should consist of the foreign-key columns and the OrderNo and
ItemNo columns, because it is possible for a customer to place two identical orders with the same
salesperson.
Measures
The measures in a fact table are usually aggregated across dimensions, and the most common way to
aggregate measures is to use a sum function to add them together. However, when defining a fact table,
you must consider the kinds of measure that it contains and how they can be aggregated. Measures
typically fall into one of three categories:
Additive measures. Measures that can be added together across all dimensions to create a meaningful
summary. For example, in a sales order fact table, a sales amount measure can be totaled across
products, customers, or employees.
Nonadditive measures. Measures that cannot be added together across any dimension. For example, a
sales order fact table might include a measure for profit margin. However, four sales orders that have
a profit margin of 25 percent do not add up to a total profit margin of 100 percent.
Semi-additive measures. Measures that can be summed across some dimensions, but not others. For
example, a bank transactions fact table might contain an account balance measure. The account
balance measure can be added across a customer dimension to calculate the total amount of
customer money deposited, but adding the balances across a time dimension would result in a
meaningless total because the balance for the year would be calculated as the sum of the balances
for January, February, and so on.
Degenerate Dimensions
Sometimes it makes sense for a fact table to contain some dimension attributes. Typically, this is the case
for attributes that would ordinarily belong in a dimension, but no other related attributes exist (so the
dimension would have only one attribute, the business key); or instances where a dimension would have
the same cardinality as the fact (such as the line number of an invoice).
Snowflake Schemas
In
n most cases, a star schema in which fact ta
ables are relat ed to denormalized dimension tables is th
he
op
ptimal design for the data warehouse.
w
How
wever, in somee cases, it can make sense to
o partially or
co
ompletely norm
malize some dimension
d
tables to create w
whats common
nly referred to as a snowflakee
scchema.
Yo
ou should con
nsider a snowflake schema in
n the following
g scenarios:
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
3-11
A subdimen
nsion can be sh
hared between
n multiple dimeensions. For exxample, a data warehouse m
may
contain a dimension
d
table for customers and a dimen
nsion table forr stores. Both ccustomers and
d stores
include som
me dimension attributes thatt relate to theiir geographicaal location, succh as a street aaddress,
a city, a state, a postal code, and a country. You can eensure consisttency of geogrraphic hierarch
hies
across both
h dimensions by
b creating a separate table for the geograaphy dimensio
on and relating
g both
the custom
mer and store dimension
d
tables to the new geography dimension table
e.
A sparse dimension has several different subtypes. For example, consider a dimension table for
products in an organization that sells many different kinds of products. Some products will include
an attribute such as size or color that may not be applicable for some other products. This can result
in a table that contains many null values (commonly described as being sparse). You can reduce
sparseness by creating a generic product dimension table that includes the core attributes that all
products share, and then create a type-specific dimension table for each individual kind of product,
with a relationship to the core product table.
Multiple fact tables of varying grain reference different levels in the dimension hierarchy. For example,
consider a data warehouse that includes a dimension table for salesperson. The salesperson table
may include details of the store at which the salesperson is employed. However, you may want to
create a fact table that includes measures at the individual salesperson grain, and a second fact table
that includes measures at the store grain. In this case, it makes sense to use separate tables for the
salesperson and store dimensions, with a relationship from salesperson to store to enable a reporting
hierarchy that includes both store and salesperson levels.
Time
T
Dimensions
Most
M
data reporting and anallysis includes a temporal asp
pect. For exam ple, it is comm
mon to aggreg
gate
sa
ales over time periods such as
a months, qua
arters, and yeaars. To ensure consistency when comparing
measures
m
across time, most data warehouse
es include a tim
me dimension table.
When
W
you creatte a time dime
ension table, co
onsider the fo llowing guida nce:
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
3-13
Include tem
mporal hierarch
hies. Most business reporting
g and analysis involves drillin
ng down throu
ugh
summarized
d levels of datafor example, viewing salees by year, and
d then drilling into a specificc
quarter, and then drilling
g down into an
n individual mo
onth. Your tim
me dimension ttable should in
nclude
attributes fo
or each hierarchy level at wh
hich your userrs need to sum
mmarize the me
easures.
Consider how you will populate the time dimension table. Unlike most other tables in a data
warehouse, time dimension tables are not usually populated with data that has been extracted from
a source system. Generally, the data warehouse developer populates the time dimension table with
rows at the appropriate granularity. These rows usually consist of a numeric primary key that is
derived from the temporal value (for example 20110101 for January 1, 2011) and a column for each
dimension attribute (such as the date, day of year, day name, month of year, month name, year,
and so on). To generate the rows for the time dimension table, you can use one of the following
techniques:
Create a Transact-SQL script. Transact-SQL includes many date and time functions that you can
use in a loop construct to generate the required attribute values for a sequence of time intervals.
The following Transact-SQL functions are commonly used to calculate date and time values:
DATEPART (datepart, date) returns the numerical part of a date, such as the weekday
number, day of month, month of year, and so on.
DATENAME (datepart, date) returns the string name of a part of the date, such as the
weekday name or month name.
MONTH (date) returns the month number of the year for a given date.
Use Microsoft Excel. Excel includes several functions that you can use to create formulas for
date and time values. You can then use the auto-fill functionality in Excel to quickly create a large
table of values for a sequence of time intervals.
Use a business intelligence (BI) tool to autogenerate a time dimension table. Some BI tools include
time dimension generation functionality that you can use to quickly create a time dimension
table.
Regardless of the technique that you use to populate the time dimension table, you must choose an
appropriate start and end point for the sequence of time intervals stored in the table. If necessary,
you must also consider how you will extend the range of time values stored in the table in the future.
Demonstra
D
ation: Implementing a Data W
Warehouse
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
3-15
1..
2..
3..
4..
5..
6..
he fact table in
This createss a fact table named
n
FactSallesOrders. No
ote that the priimary key of th
ncludes
all dimensio
on foreign keyys and also the
e degenerate d
dimensions forr the order num
mber and line item
number. Also note that th
he grain for th
his table is defi ned at the linee item level.
7..
8..
9.
In the database diagram, click each table while holding down the Ctrl key to select them all, and on
the toolbar, in the Table View drop-down list, click Standard. Then arrange the tables and adjust the
zoom level so that you can see the entire database schema, and then examine the tables, noting the
columns that they contain.
Return to the CreateDW.sql query tab and select the code under the comment POPULATE THE
TIME DIMENSION TABLE, and then click Execute.
This code performs a loop that generates the required dimension attributes for the DimDate table for
a range of dates between 2005 and the current date.
2.
When the query has completed, in Object Explorer, expand Databases, expand DemoDW, expand
Tables, right-click DimDate, and then click Select Top 1000 Rows.
Note that the script has done its work and populated the table with a sequence of dates.
3.
Lesson
n2
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
3-17
Describe th
he best practice
es for the physsical placemen
nt of data in a data warehou
use.
Design effe
ective indexes for
f a data warehouse.
Design an effective
e
partittioning strateg
gy for a data w
warehouse.
Use data co
ompression efffectively in a data
d
warehousee.
Physical Data
a Placeme
ent
RAID 0 (d
disk striping) provides
p
high performance,
p
b
but no data reedundancy, so in the event o
of a
disk failure, the data be
ecomes unavailable.
disk mirroring)) provides a full redundant ccopy of the datta, but does not enable the query
RAID 1 (d
processor to read data from multiple
e disks simulta neously.
RAID 5 (d
disk striping with parity) can provide a low
w-cost solution
n that combine
es data redund
dancy
with physical distributio
on across drive
es.
RAID 10 (a combination of disk stripiing and mirrorring) provides a highly robust, highperforma
ance solution.
3-19
In addition to distributing the data across physical disks, you should also consider allocating large fact
tables to dedicated filegroups.
Separate log files from data files. Although data warehouses have fewer logged transactions than an
OLTP database, the recognized best practice for any database is to separate log files from data files
to prevent them from competing for disk resources. You should apply this best practice to your data
warehouse.
Separate workspace objects from data. Some data warehouses include objects that are dropped
and re-created, such as staging tables or temporary workspace tables. You should allocate these
to a separate filegroup from the fact and dimension data to reduce the risk of table and index
fragmentation in the data warehouse, which can cause deterioration in query performance over time.
Preallocate space and disable autogrow. It may seem prudent to allocate only the physical space that
you need when initially creating the data warehouse and have the space allocation grow dynamically
as needed. However, this can lead to index and table fragmentation, which has a negative effect on
query performance. A better approach is to preallocate the space that you think your data warehouse
will need when it is fully populated, and disable the autogrow feature for the database files. This helps
ensure that your data is stored in contiguous blocks on the physical disk, increasing the efficiency
with which the data can be read.
Ind
dexing
Dim
mension Tab
ble Indexes
Whe
en you are deffining indexes for a dimensio
on table, cons ider the follow
wing guidelines:
Create nonclu
ustered indexe
es on any columns that are ffrequently incl uded in a query filter.
Most data wa
arehouse queries involve a tiime dimension
n, so you should generally co
onsider creatin
ng a
clustered inde
ex on the mosst commonly used
u
time dimeension key in tthe fact table. This is especiaally
true if the facct table is partiitioned by a tim
me dimension
n key so that yo
ou can align in
ndex partitioniing
with table partitioning.
Create additio
onal noncluste
ered indexes on
o individual d imension foreign-key colum
mns that are
frequently inccluded in a query.
Note The SQL Server query optimizer can identify sttar join queries between factt and
dimension ta
ables, and use a variety of te
echniques to in
ncrease the peerformance of this
specific kind
d of query. It is therefore imp
portant to havee appropriate indexes on dimension
key columnss for the optim
mizer to choose
e when selectin
on plan.
ng an executio
Columnstore Indexes
3-21
SQL Server 2012 supports columnstore indexes that are based on xVelocity in-memory technology.
Columnstore indexes consist of data pages that store data from each column in the index on a dedicated
set of pages. Creating a columnstore index on multiple columns in a fact table (or a large dimension table)
can significantly increase query performance.
To create a columnstore index, use the CREATE COLUMNSTORE INDEX statement as shown in the
following code example.
CREATE COLUMNSTORE INDEX cidx_FactSalesOrder ON FactSalesOrder
(CustomerKey,
SalesPersonKey,
ProductKey,
OrderDateKey,
OrderNo,
ItemNo,
Quantity,
Cost,
SalesAmount,
Shipping,
Discount)
The use of columnstore indexes can have a dramatically positive effect on performance, but you should
be aware of the following considerations:
Columnstore indexes are read-only, so you must drop and rebuild the index if you load the table with
new or updated data.
The base object for a columnstore index must be a table (you cannot use a columnstore index to
create an indexed view).
For More Information For more information on Columnstore Indexes for Fast Data
Warehouse Query Processing in SQL Server 11.0, see http://go.microsoft.com/fwlink
/?LinkID=246723.
Partitioning
Parttitioning is a te
echnique that is used to splitt the data in a single table o
or index acrosss multiple
fileg
groups. When you create a data
d
warehousse, you should consider partiitioning large fact tables. In most
case
es, the table sh
hould be partittioned based on
o a date key ffield, such as aan order date in a sales orde
er fact
tablle.
Parttitioning a factt table can pro
ovide the follow
wing benefits:
Improved queery performancce. SQL Server 2012 support s partitioned ttable parallelissm, enabling th
he
query processsor to read data from multip
ple partitions ssimultaneouslyy.
Faster data lo
oading and delletion. In data warehouses
w
w
where a sliding window appro
oach to adding
g and
nto a new parttition and dele
archiving data is used, you can quickly load new data in
ete the data in
n the
oldest partitio
on.
c
a partitiioned fact tablle, you must first create a paartition functio
on that definess the boundariies of
To create
the partitions. Forr example, the following Transact-SQL cod
de example creeates a partitio
on function thaat
defiines three parttitions: one forr values up to and
a including 20081231, on
ne for values higher than
20081231 up to 20091231,
2
and one for values higher than 20101231.
CR
REATE PARTITI
ION FUNCTION pf_OrderDate
eKey(int)
AS
S RANGE LEFT
FO
OR VALUES (20
0081231, 2009
91231)
3-23
After creating the partition function, you need to create a partition scheme that assigns each partition to a
filegroup, as shown in the following code example.
CREATE PARTITION SCHEME ps_OrderDateKey
AS PARTITION pf_OrderDateKey
TO (fg2008,fg2009,fg2010,fg2011)
Note that although the partition function defines three partitions, four filegroups are specified. The fourth
filegroup is an optional parameter that specifies the next filegroup to be used when a new partition is
added during a split operation.
Now you can create the partitioned table, as shown in the following code example.
CREATE TABLE [dbo].[FactSalesOrder]
(
[CustomerKey] [int] NOT NULL,
[ProductKey] [int] NOT NULL,
[OrderDateKey] [int] NOT NULL,
[OrderNo] [int] NOT NULL,
[LineNo] [int] NOT NULL,
[Quantity] [smallint] NULL,
[SalesAmount] [money] NULL,
CONSTRAINT [PK_ FactSalesOrder] PRIMARY KEY CLUSTERED
(
[CustomerKey],[ProductKey],[OrderDateKey],[OrderNo],[LineNo]
)
)
ON ps_OrderDateKey(OrderDateKey)
To implement a sliding window solution for adding and removing data from the fact table, you can use
the Transact-SQL statements in the following code example.
-- Create a new empty partition for new data
ALTER PARTITION FUNCTION pf_OrderDateKey()
SPLIT RANGE (20101231)
GO
-- Switch the first partition in the fact table into a temporary table on the same
filegroup
ALTER TABLE FactSalesOrder SWITCH PARTITION 1
TO FactSalesOrderTemp PARTITION 1
GO
-- Insert temp data into the archive data
INSERT INTO FactSalesOrderArchive
SELECT * FROM FactSalesOrderTemp
-- Merge the deleted partition to remove it from the filegroup
ALTER PARTITION FUNCTION pf_OrderDateKey()
MERGE RANGE (20081231)
GO
-- You can now drop the temp table and remove the fg2008 filegroup from which you
archived the
-- data
Da
ata Compre
ession
Enabling Compression
3-25
You can enable compression for a table, an index, or a partition by specifying the DATA_COMPRESSION
keyword. For example, the Transact-SQL code in the following code example creates a table on a partition
function, specifying page compression for the first partition, row compression for the second partition,
and no compression for any other partitions.
CREATE TABLE [dbo].[FactSalesOrder]
(
[CustomerKey] [int] NOT NULL,
[ProductKey] [int] NOT NULL,
[OrderDateKey] [int] NOT NULL,
[OrderNo] [int] NOT NULL,
[LineNo] [int] NOT NULL,
[Quantity] [smallint] NULL,
[SalesAmount] [money] NULL,
CONSTRAINT [PK_ FactSalesOrder] PRIMARY KEY CLUSTERED
(
[CustomerKey],[ProductKey],[OrderDateKey],[OrderNo],[LineNo]
)
)
ON ps_OrderDateKey(OrderDateKey)
WITH
(
DATA_COMPRESSION = PAGE ON PARTITIONS (1),
DATA_COMPRESSION = ROW ON PARTITIONS (2),
)
Lab
b Scenario
o
In th
his lab, you will start the devvelopment of the
t Adventuree Works data w
warehousing so
olution by creaating
the data warehou
use itself.
An existing
e
data warehouse
w
con
ntains a fact table for resellerr sales, and dim
mension table
es for resellers,
emp
ployees, and products.
p
This enables
e
users to
t query the d
data warehousee and analyze reseller sales b
by
rese
eller, sales reprresentative, and product. Ho
owever, Adventture Works alsso sells produccts directly to
individual custom
mers through an e-commerce
e Web site, and
d the companys executives would like to be
able
e to analyze th
hese Internet sa
ales as well as the reseller saales. To enablee this analysis, you must add a
dim
mension table for customers and
a a fact tablle for Internet sales. The exissting product d
dimension is a
conformed dimen
nsion that can be used by bo
oth the resellerr sales and Inteernet sales facct tables.
Afte
er you have ad
dded the required tables for Internet sales analysis, you m
must refactor tthe data warehouse
by normalizing
n
tw
wo of the star schema
s
dimen
nsions into sno
owflake dimenssions. Specificaally, you must::
Create a hiera
archy of tabless for the produ
uct dimension to separate product subcate
egories and
categories. Making
M
this hierarchy into a snowflake
s
dimeension helps issolate any chaanges to produ
uct
subcategory or
o category records from the
e product dim
mension table.
Create a subd
dimension table for geograp
phic data that iis shared by th
he customer an
nd reseller
dimension tables.
Lab 3:
3 Implementing a Datta Ware
ehouse SSchema
a
Exercise 1: Implemen
nting a Sta
ar Schema
Scenario
Adventure Works Cycles requ
uires a data wa
arehouse to en
nable information workers an
nd executives tto
crreate reports and
a perform an
nalysis of key business
b
meassures. The com
mpany has iden
ntified two setss of
re
elated measure
es that it wants to include in fact tablesssales order meeasures that relate to sales to
o
re
esellers, and sa
ales order measures that rela
ate to Internet sales. These m
measures will b
be aggregated by
product, reseller (in the case of
o reseller sales), and custom
mer (in the casee of Internet saales) dimensio
ons.
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
3-27
Th
he data warehouse has been
n partially com
mpleted, and yo
ou must now aadd the necesssary dimension
n and
fa
act tables to co
omplete a star schema.
Th
he main tasks for this exercisse are as follow
ws:
1..
Prepare the
e lab environm
ment.
2..
View the AW
WDataWareh
house databasse.
3..
4..
5..
View the da
atabase schem
ma.
Start SQL Server Management Studio and connect to the localhost instance of the SQL Server
database engine by using Windows authentication.
Create a new database diagram in the AWDataWarehouse database (creating the required objects
to support database diagrams if prompted). The diagram should include all of the tables in the
database.
In the database diagram, modify the tables so that they are shown in Standard view, and arrange
them so that you can view the partially complete data warehouse schema, which should look similar
to the following diagram.
3-29
Create a new table named DimCustomer in the AWDataWarehouse database. The table should be
based on the following diagram.
Create a new table named FactInternetSales in the AWDataWarehouse database, with foreign-key
references to the DimCustomer and DimProduct tables. The table should be based on the following
diagram.
Add the tables that you have created in this exercise to the database diagram that you created.
Note that when adding tables to a diagram, you need to click Refresh in the Add table dialog
box to see tables you have created or modified since the diagram was initially created.
Keep SQL Server Management Studio open for the next exercise.
Results: After this exercise, you should have a database diagram in the AWDataWarehouse database
that shows a star schema that consists of two fact tables (FactResellerSales and FactInternetSales) and
four dimension tables (DimReseller, DimEmployee, DimProduct, and DimCustomer).
3-31
Having created a star schema, you have identified two dimensions that would benefit from being
normalized to create a snowflake schema. Specifically, you want to create a hierarchy of related tables for
product category, product subcategory, and product, and you want to create a separate geography
dimension table that can be shared between the reseller and customer dimensions.
The main tasks for this exercise are as follows:
1.
2.
3.
In the AWDataWarehouse database, create a new table named DimGeography, and modify the
existing DimCustomer and DimReseller tables to create a shared subdimension as shown here.
Delete the tables that you modified in the previous two tasks from the AWDataWarehouse Schema
diagram (DimProduct, DimReseller, and DimCustomer).
Add the new and modified tables that you created in this exercise to the AWDataWarehouse
Schema diagram and view the revised data warehouse schema, which now includes some snowflake
dimensions. You will need to refresh the list of tables when adding tables and you may be prompted
to update the diagram to reflect foreign-key relationships.
Results: After this exercise, you should have a database diagram in the AWDataWarehouse database
that shows a snowflake schema that contains a dimension consisting of a DimProduct,
DimProductSubcategory, and DimProductCategory hierarchy of tables, and a DimGeography
dimension table that is referenced by the DimCustomer and DimReseller dimension tables.
3-33
The schema for the Adventure Works data warehouse now contains two fact tables and several dimension
tables. However, users need to be able to analyze the measures in the fact table across consistent time
periods. To enable this, you must create a time dimension table.
Users will need to be able to aggregate measures across calendar years (which run from January to
December) and fiscal years (which run from July to June). Your time dimension must include the following
attributes:
Day number of week (for example 1 for Sunday, 2 for Monday, and so on)
Day name of week (for example Sunday, Monday, Tuesday, and so on)
Month number of year (for example, 1 for January, 2 for February, and so on)
Calendar quarter (for example, 1 for dates in January, February, and March)
Calendar year
Calendar semester (for example, 1 for dates between January and June)
Fiscal quarter (for example, 1 for dates in July, August, and September)
Fiscal year
Fiscal semester (for example, 1 for dates between July and December)
2.
3.
4.
In the AWDataWarehouse database, create a new table named DimDate that contains the required
time dimension attributes.
Add foreign-key columns in the FactInternetSales and FactResellerSales tables to relate sales order
dates and sales ship dates to the DimDate table. Leave the existing OrderDate and ShipDate
columns in the fact tables as degenerate dimensions.
Create nonclustered indexes on the foreign-key columns that you have added to the fact tables.
If preferred, you can use the DimDate.sql Transact-SQL query file in D:\10777A\Labfiles\Lab03\Starter
to create the time dimension table and modify the fact tables.
Delete the tables that you modified in the previous task from the AWDataWarehouse Schema
diagram (FactInternetSales and FactResellerSales).
Add the new and modified tables that you created in this exercise to the AWDataWarehouse
Schema diagram and view the revised data warehouse schema, which now includes a time
dimension.
Populate the table with appropriate values for a date range spanning from January 1, 2000 to the
current date. You can create a Transact-SQL script to do this, or you can use Excel if you wish.
Results: After this exercise, you should have a database that contains a DimDate dimension table that is
populated with date values from January 1, 2000 to the current date.
Modu
ule Reviiew and
d Takeaw
ways
Review
R
Quesstions
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
3-35
1..
Why should
d you favor a star
s schema th
hat has denorm
malized dimension tables ove
e
er a snowflake
schema where the dimen
nsions are impllemented as m
multiple related
d tables?
2..
What is the
e grain of a facct table, and why
w is it importtant?
3..
If your data
a warehouse in
ncludes staging
g tables, why sshould you allo
o a different file
egroup
ocate them to
than the file
egroup for datta tables?
Module 4
Creating an ETL Solution with SSIS
Contents:
Lesson 1: Introduction to ETL with SSIS
4-3
4-10
4-21
4-38
Module Overrview
4-2
Lesson
n1
Introd
duction
n to ETLL with SS
SIS
4-3
While
W
you can implement an ETL solution by
b using severaal tools and teechnologies, SSSIS is the primaary ETL
to
ool for SQL Serrver. Before yo
ou use it to imp
plement an ETTL solution, it is important to
o understand ssome of
itss key features and compone
ents.
Th
his lesson desccribes options for implementting an ETL so
olution, and theen introduces SSIS.
After completin
ng this lesson, you
y will be able to:
Describe th
he options for ETL.
Describe th
he key featuress of SSIS.
Describe th
he high-level architecture of an SSIS projecct.
Op
ptions for ETL
E
4-4
The Import an
nd Export Data
a Wizard. This wizard is inclu
uded with the SQL Server maanagement too
ols,
and provides a simple way to create an SSIS-based dataa transfer solu
ution. You shou
uld consider using
the Import an
nd Export Data
a Wizard when
n your ETL solu
ution requires only a few, sim
mple data tran
nsfers
that do not in
nclude any com
mplex transforrmations in thee data flow.
What
W
Is SSIIS?
4-5
SS
SIS is an extensible platform for building complex
c
ETL so
olutions. SSIS i s included with SQL Server aand
co
onsists of a Windows service that manag
ges the execut ion of ETL worrkflows, and se
everal tools an
nd
co
omponents forr developing those workflow
ws. The SSIS se rvice is installeed when you sselect Integrattion
Se
ervices on the
e Feature Sele
ection page off the SQL Serveer setup wizar d.
Note Aftter you have in
nstalled SSIS, you
y can use th
he DCOM Conffiguration tool
(Dcomcnfg
g.exe) to grantt permission to
o specific user s to use the SQ
QL Server Integ
gration
Services 11
1.0 service.
Th
he SSIS Windo
ows service is primarily
p
a con
ntrol flow engin
ne that manag
ges the executtion of task wo
orkflows.
Ta
ask workflows are defined in
n packages, wh
hich you can e xecute on dem
mand or at sch
heduled times. When
yo
ou are develop
ping an SSIS pa
ackage, the task workflow iss referred to ass the control flo
ow of the packkage.
SSIS Design
ner. A graphica
al design interfface for develo
oping SSIS solu
utions in the M
Microsoft Visuaal
Studio de
evelopment en
nvironment. Tyypically, you sttart the SQL Seerver Data Too
ols application to
access this environment.
4-6
An SSIS
S
solution usually
u
consistss of one or mo
ore SSIS projeccts, each contaaining one or m
more SSIS packkages.
SSIS Projects
SSIS Packages
The
T SSIS De
esign Environment
4-7
Yo
ou can use SQ
QL Server Data Tools to develop SSIS projeccts and packag
ges. SQL Serve
er Data Tools iss based
on
n Microsoft Visual Studio an
nd provides a graphical
g
deveelopment envirronment for business intellig
gence
(B
BI) solutions. When
W
you creatte an Integratiion Services prroject, the design environme
ent includes th
he
fo
ollowing eleme
ents:
Solution Exp
plorer. A pane in the SQL Server Data Too ls user interfacce where you ccan create and
d view
project-leve
el resources, in
ncluding param
meters, packag
ges, data conn
nection manag
gers, and otherr shared
objects. A solution
s
can co
ontain multiple
e projects, in w
which case eacch project is sh
hown in Solutio
on
Explorer.
4-8
The Variables pane. A list of variables used in a package. You can display this pane by clicking the
Variables button at the upper right of the design surface.
The SSIS Toolbox pane. A collection of components that you can add to a package control flow or
data flow. You can display this pane by clicking the SSIS Toolbox button at the upper right of the
design surface or by clicking SSIS Toolbox on the SSIS menu. Note that this pane is distinct from the
standard Visual Studio Toolbox pane.
Upgrading
U
from Prev
vious Versions
4-9
Th
here is no dire
ect upgrade pa
ath for DTS pacckages to SQLL Server 2012 SSSIS packages, and you cann
not run
a DTS package in the SQL Serrver 2012 SSIS runtime engin
ne. To upgradee a DTS-based
d solution to w
work
with
w SQL Serverr 2012, you mu
ust re-create th
he solution byy using the lateest SSIS tools aand componen
nts, or
usse the SSIS Pacckage Migratio
on Wizard in SQL
S Server 20005 or 2008 to p
perform an intterim upgrade
e of the
DTS package to
o SQL Server 20
005 or 2008 fo
ormat, and theen upgrade to the SQL Serve
er 2012 formatt.
Yo
ou can run SSIS packages that were built by
b using SQL SServer 2005 orr SQL Server 20
008 in the SQLL Server
20
012 SSIS runtim
me engine by using the DTS
SEXEC tool. Ho
owever, you will not be able to take advanttage of
th
he new projectt-level deploym
ment to the SS
SIS catalog cap
pabilities of SQ
QL Server 2012
2. To upgrade SSSIS
pa
ackages that were
w
built by using
u
SQL Server 2005 or SQ L Server 2008 to the SQL Se
erver 2012 form
mat, use
th
he SSIS Packag
ge Migration Wizard
W
in SQL Server
S
2012.
Scripts
SS
SIS packages can
c include scrript tasks to pe
erform custom
m actions. In preevious release
es of SSIS, you could
im
mplement scrip
pted actions byy including a Microsoft
M
ActivveX Script taask (written in Microsoft Visu
ual
Ba
asic Scripting
g Edition (VBS
Script)) or a Scrript task (writteen for the .NETT Visual Studio
o for Applications, or
VSA runtime) in
n a control flow
w. In SQL Serve
er 2012, the AcctiveX Script taask is no longe
er supported, and any
VBScript-based custom logic must be replaced. In additio
on, the SQL Server 2012 Script task uses th
he Visual
Sttudio Tools forr Applications (VSTA) runtim
me, which diffe rs in some dettails from the V
VSA runtime that was
ussed in previous releases. Wh
hen you use the
e SSIS Packagee Migration W
Wizard to upgraade a package
e that
in
ncludes a Scrip
pt component, the script is au
utomatically u
updated for thee VSTA runtim
me.
Lesson 2
Explorring Sou
urce Datta
4-10 Creating an
a ETL Solution with SSIS
Now
w that you und
derstand the basic
b
architectu
ure of SSIS, you
u can start to plan the data flows in your EETL
solu
ution. Howeverr, before you start
s
implemen
nting an ETL p rocess, you should explore tthe existing daata in
the sources that your
y
solution will
w use. By gaining a thoroug
gh knowledgee of the source
e data on whicch
your ETL solution will be based, you can desig
gn the most efffective SSIS daata flows for trransferring the
e data
and anticipate data quality issue
es that you ma
ay need to res olve in your SSSIS packages.
Thiss lesson discusses the value of
o exploring so
ource data, and
d describes techniques for e
examining and
d
proffiling source data.
Afte
er completing this lesson, yo
ou will be able to:
Examine an extract
e
of data from a data so
ource.
Profile source
e data by using
g the Data Pro
ofiling SSIS taskk.
Why
W Explore Source Data?
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
4-11
Th
he design and integrity of yo
our data wareh
house ultimateely rely on thee data that it co
ontains. Before
e you
ca
an design an appropriate
a
ETL process to populate the daata warehousee, you must haave a thorough
h
kn
nowledge of th
he source data
a that your solution will conssume.
Sp
pecifically, you
u need to unde
erstand:
How to inte
erpret data vallues and codess. For examplee, does a valuee of 1 in an InS
Stock column in a
Products table mean tha
at the company has a single unit in stock, o
or does 1 simp
ply indicate the value
true, mea
aning that therre is an unspeccified quantity of units in sto
ock?
The relation
nships between business enttities, and how
w those relation
nships are modeled in the data
sources.
In
n addition to understanding the data modeling of the bu
usiness entities, you also nee
ed to examine
e source
da
ata to help ide
entify:
Column datta types and leengths for speccific attributes that will be inccluded in data flows. For example,
what maxim
mum lengths exist
e
for string values? What formats are used to indicate
e date, time, and
numeric values?
me and sparsen
ness. For example, how manyy rows of saless transactions aare typically re
ecorded
Data volum
in a single trading
t
day? Are
A there any attributes
a
that frequently co ntain null valu
ues?
Exa
amining So
ource Data
a
You
u can explore source
s
data byy using several tools and tech
hniques. The fo
ollowing list describes some
e of
the approaches th
hat you can usse to extract da
ata to examinee:
4-12 Creating an
a ETL Solution with SSIS
Creating an SSIS
S
package with
w a data flow
w that extractss a sampling o
of data or a row
w count for a
specific data source.
Afte
er extracting th
he sample data
a, you need to
o examine it. O
One of the mosst effective ways to do this iss to
extrract the sample
e data in a format that you can
c open in M
Microsoft Excel
, such as com
mma-delimited
d text,
and then use the functionality of
o Excel to exp
plore the data. Using Excel, yyou can:
Apply column
n filters to help
p identify the range
r
of valuees used in particular column..
Search the da
ata for specific string values.
Demonstra
D
ation: Explo
oring Sourrce Data
X Task 1: Usse the Imporrt and Exporrt Data Wizaard to extracct a sample of data
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
4-13
1..
2..
In the D:\10
0777A\Demofiles\Mod04 folder, right-clicck Setup.cmd, and then clickk Run as
administra
ator.
3..
4..
5..
6..
On the Cho
oose a Data Source page, se
elect the follow
wing options, and then clickk Next:
Authentication: Use
e Windows Autthentication
Databa
ase: ResellerSa
ales
7.
On the Choose a Destination page, select the following options, and then click Next:
Unicode: Unselected
Format: Delimited
Note The text qualifier is used to enclose text values in the exported data. This is required
because some European address formats include a comma, and these must be distinguished
from the commas that are used to separate each column value in the exported text file.
8.
On the Specify Table Copy or Query page, select Write a query to specify the data to transfer,
and then click Next.
9.
On the Provide a Source Query page, enter the following Transact-SQL code, and then click Next.
SELECT TOP 500 * FROM Resellers
10. On the Configure Flat File Destination page, select the following options, and then click Next:
11. On the Save and Run Package page, select only Run immediately, and then click Next.
12. On the Complete the Wizard page, click Finish.
13. When the data extraction has completed successfully, click Close.
4-15
1.
2.
Click any cell that contains data, on the Home tab of the ribbon, click Format as Table, and then
select a table style for the data.
3.
In the Format As Table dialog box that is displayed, ensure that the range of cells that contain the
data is selected and that the My table has headers check box is selected, and then click OK.
4.
Adjust the column widths so that you can see all of the data.
5.
View the drop-down filter list for the CountryRegionCode column, and note the range of values in
this column. Then select only FR, and then click OK.
6.
Note that the table is filtered to show only the resellers in France. Note also that many of the
addresses include a comma. If no text qualifier had been selected in the Import and Export Data
Wizard, these commas would have created additional columns in these rows, making the data difficult
to examine as a table.
7.
8.
9.
Note that this formula shows the earliest year that a store in this sample data was opened.
Pro
ofiling Sou
urce Data
4-16 Creating an
a ETL Solution with SSIS
In addition to exa
amining samples of source data, you can u
use the Data Prrofiling task in
n an SSIS packaage to
obta
ain statistics ab
bout the data. This can help you understa nd the structu
ure of the data that you will
extrract and identify columns wh
here null or miissing values a re likely. Profiling your source data can he
elp
you plan effective
e data flows for your ETL pro
ocess.
You
u can specify multiple
m
profile
e requests in a single instanc e of the Data Profiling task. The following
g kinds
of profile
p
request are available:
Column Leng
gth Distributiion reports the
e range of leng
gths for string
g values in a co
olumn.
Column Valu
ue Distributio
on reports the groupings of d
distinct valuess in a column.
Functional Dependency
D
determines
d
whether the valu e of a column is dependent on the value o
of
other column
ns in the same table.
Use the following procedure to collect and view data profile statistics:
1.
2.
Add an ADO.NET connection manager for each data source that you want to profile.
3.
Add the Data Profiling task to the control flow of the package.
4.
The file or variable to which the resulting profile statistic should be written.
5.
6.
4-17
De
emonstration: Using the Data Profiling
P
T
Task
4-18 Creating an
a ETL Solution with SSIS
a.
Ensure th
hat the MIA-DC
C1 and MIA-SQLBI virtual m
machines are bo
oth running, aand then log o
on to
MIA-SQLLBI as ADVENT
TUREWORKS\\Student with
h the password
d Pa$$w0rd.
b.
c.
When yo
ou are prompte
ed to confirm, click Yes, and
d then wait forr the batch file to complete.
2.
3.
In the Solutio
on Explorer pan
ne, create a ne
ew ADO.NET cconnection maanager with the following settings:
4.
Log on to
t the server: Use Windows Authenticatio
on
In the SSIS To
oolbox pane, in
n the Common section, dou
uble-click Data
a Profiling Ta
ask to add it to
o the
Control Flow surface. (Alterrnatively, you can
c drag the taask icon to thee Control Flow
w surface.)
Note
If the
e SSIS Toolboxx pane is not visible, on the SSSIS menu, click SSIS Toolb
box.
5.
6.
4-19
7.
In the File Connection Manager Editor dialog box, in the Usage type drop-down list, click Create
file.
8.
In the File box, type D:\10777A\Demofiles\Mod04\Reseller Sales Data Profile.xml, and then
click OK.
9.
In the Data Profiling Task Editor dialog box, on the General tab, set OverwriteDestination to
True.
10. In the Data Profiling Task Editor dialog box, on the Profile Requests tab, in the Profile Type dropdown list, select Column Statistics Profile Request, and then click the RequestID column.
11. In the Request Properties pane, set the following property values. Do not click OK when finished:
ConnectionManager: localhost.ResellerSales
TableOrView: [dbo].[SalesOrderHeader]
Column: OrderDate
12. In the row under the Column Statistics Profile Request, add a Column Length Distribution
Profile Request profile type with the following settings:
ConnectionManager: localhost.ResellerSales
TableOrView: [dbo].[Resellers]
Column: AddressLine1
13. Add a Column Null Ratio Profile Request profile type with the following settings:
ConnectionManager: localhost.ResellerSales
TableOrView: [dbo].[Resellers]
Column: AddressLine2
14. Add a Value Inclusion Profile Request profile type with the following settings:
ConnectionManager: localhost.ResellerSales
SubsetTableOrView: [dbo].[SalesOrderHeader]
SupersetTableOrView: [dbo].[PaymentTypes]
InclusionColumns:
Subset side Columns: PaymentType
Superset side Columns: PaymentTypeKey
InclusionThresholdSetting: None
SupersetColumnsKeyThresholdSetting: None
MaxNumberOfViolations: 100
15. In the Data Profiling Task Editor dialog box, click OK.
16. On the Debug menu, click Start Debugging.
1.
When the Data Profiling task has completed, with the package still running, double-click the Data
Profiling task, and then click Open Profile Viewer.
2.
Maximize the Data Profile Viewer window and under the [dbo].[SalesOrderHeader] table, click
Column Statistics Profiles. Then review the minimum and maximum values for the OrderDate
column.
3.
Under the [dbo].[Resellers] table, click Column Length Distribution Profiles and click the
AddressLine1 column to view the statistics. Click the bar chart for any of the column lengths, and
then click the Drill Down button (at the right-edge of the title area for the middle pane) to view the
source data that matches the selected column length.
4.
Close the Data Profile Viewer window, and then in the Data Profiling Task Editor dialog box, click
Cancel.
5.
On the Debug menu, click Stop Debugging, and then close SQL Server Data Tools, saving your
changes if you are prompted.
6.
Click Start, click All Programs, click Microsoft SQL Server 2012, click Integration Services, and
then click Data Profile Viewer to start the stand-alone Data Profile Viewer tool.
7.
Click Open, and open Reseller Sales Data Profile.xml in the D:\10777A\Demofiles\Mod04 folder.
8.
Under the [dbo].[Resellers] table, click Column Null Ratio Profiles and view the null statistics for
the AddressLine2 column. Select the AddressLine2 column, and then click the Drill Down button to
view the source data.
9.
Under the [dbo].[SalesOrderHeader] table, click Inclusion Profiles and review the inclusion
statistics for the PaymentType column. Select the inclusion violation for the payment type value of 0,
and then click the Drill Down button to view the source data.
Note The PaymentTypes table includes two payment types, using the value 1 for invoicebased payments and 2 for credit account payments. The Data Profiling task has revealed
that for some sales, the value 0 is used, which may indicate an invalid data entry or may be
used to indicate some other kind of payment that does not exist in the PaymentTypes
table.
Lesson
n3
Imple
ementin
ng Data
a Flow
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
4-21
After you have thoroughly exxplored the datta sources for your data warrehousing solu
ution, you can start to
im
mplement an ETL
E process by using SSIS. Th
his ETL processs will consist off one or more SSIS packagess, with
ea
ach package containing one
e or more Data
a Flow tasks. D
Data flow is at tthe core of anyy SSIS-based EETL
so
olution, so its important to understand
u
ho
ow you can usee the componeents of an SSISS data flow pip
peline to
exxtract, transforrm, and load data.
d
Th
his lesson desccribes the vario
ous componen
nts that are useed to implemeent a data flow
w, and provide
es some
gu
uidance for op
ptimizing data flow performa
ance.
After completin
ng this lesson, you
y will be able to:
Create a co
onnection man
nager.
Add a Data
a Flow task to a package con
ntrol flow.
Add a destiination to a da
ata flow.
Add transfo
ormations to a data flow.
Co
onnection Managers
4-22 Creating an
a ETL Solution with SSIS
To extract
e
or load
d data, an SSIS package mustt be able to co
onnect to the d
data source orr destination. In an
SSIS
S solution, you define data connections byy creating a co nnection manager for each data source or
desttination that iss used in the workflow.
w
A connection man ager encapsulates the follow
wing informatiion,
which is used to make
m
a connecction to the da
ata source or d
destination:
The connectio
on string used to locate the data source. FFor a relational database, the
e connection sstring
includes the network
n
name
e of the databa
ase server and the name of tthe database. FFor a file, the ffile
name and path must be spe
ecified.
The credentia
als used to acccess the data so
ource.
You
u can create co
onnection man
nagers at the project
p
level orr the package level:
4-23
Package-level connection managers exist only within the package in which they are defined. Both
project-level and package-level connection managers that are used by a package are shown in its
Connection Managers pane in the SSIS Designer.
To create a package-level connection manager, right-click in the Connection Managers pane and
choose the kind of connection manager that you want to create. Alternatively, create a new
connection manager in the Properties dialog box of a task, source, destination, or transformation.
Note When you create a new connection manager, SQL Server Data Tools enables you to
select connection details that you have created previouslyeven if the connection
managers that relate to those connection details do not exist in the current project or have
been deleted.
4-24 Creating an
a ETL Solution with SSIS
A pa
ackage defines a control flow
w for actions that
t
the SSIS ru
untime enginee is to perform
m. A package co
ontrol
flow
w can contain several
s
different tasks, and in
nclude compleex branching aand iteration, b
but the core taask in
any ETL control flo
ow is the Data
a Flow task.
To include a data flow in a pack
kage control flow, drag the D
Data Flow taskk from the SSISS Toolbox pane and
drop it on the Control Flow surfface. Alternativvely, you can d
double-click th
he Data Flow ttask icon in the
e SSIS
Too
olbox pane and
d the task will be added to th
he design surfface. After you
u have added a Data Flow task to
the control flow, you
y can renam
me it and set itss properties in
n the Propertiees pane.
Note This module focuses on the Data
a Flow task. Otther control flo
ow tasks will b
be
discussed in detail in Module 5, Implem
menting Contro
ol Flow in an SSSIS Package.
To define
d
the pipe
eline for the Data
D
Flow task, double-click tthe task. SSIS D
Designer will d
display a desig
gn
surfface onto whicch you can add
d data flow com
mponents. Altternatively, you
u can click the
e Data Flow taab in
SSIS
S Designer and
d then select th
he Data Flow task
t
that you w
want to edit in the drop-dow
wn list that is
disp
played at the to
op of the desig
gn surface.
Data
D
Sources
Th
he starting point for a data flow
f
is a data source.
s
A data source definittion includes:
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
4-25
The column
ns that are inclluded in the output from thee data source and passed to
o the next com
mponent
in the data flow pipeline.
Th
he following ta
able describes the kinds of data
d
source thaat SSIS supporrts.
Databases
ADO.NET
Any database to wh
hich an ADO.N ET data provid
der is installed.
OLE DB
CDC Source
Files
Excel
Flat file
XML
Raw file
Other sourcess
Script component
Custom
In addition to the sources listed in the table, you can download the following additional sources from the
Microsoft Web site.
Oracle Source
SAP BI Source
Teradata Source
To add a data source for SQL Server, Excel, a flat file, or Oracle to a data flow, drag the Source Assistant
icon from the Favorites section of the SSIS Toolbox pane to the design surface and use the wizard to
select or create a connection manager for the source. For other data sources, drag the appropriate icon
from the Other Sources section of the SSIS Toolbox pane to the design surface and then double-click the
data source on the design surface to define the connection, data, and output columns for the data source.
By default, the output from a data source is represented as an arrow at the bottom of the data source
icon on the design surface. To create a data flow, you simply drag this arrow and connect it to the next
component in the data flow pipeline, which could be a destination or a transformation.
Data
D
Destin
nations
A connectio
on manager fo
or the data sto
ore where the d
data is to be in
nserted.
The table or
o view into wh
hich the data must
m
be inserteed (where supported).
Th
he following ta
able describes the kinds of destination
d
thaat SSIS supportts.
Databases
ADO.NET
Any datab
base for which
h an ADO.NET data provider is installed.
OLE DB
Any datab
base for which
h an OLE DB prrovider is instaalled.
SQL Server
SQL Server Co
ompact
Files
Excel
A Microso
oft Excel workb
book.
Flat file
A text file.
Raw file
An SSIS-sp
pecific binary fformat file.
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
4-27
(continued)
SQL Server Analysis Services
Dimension processing
Partition processing
Rowsets
DataReader
Recordset
Other sources
Script component
Custom
To add a SQL Server, Excel, or Oracle destination to a data flow, drag the Destination Assistant icon from
the Favorites section of the SSIS Toolbox pane to the design surface and use the wizard to select or
create a connection manager for the destination. For other kinds of destination, drag the appropriate icon
from the Other Destinations section of the SSIS Toolbox pane to the design surface.
After you have added a destination to the data flow, connect the output from the previous component in
the data flow to the destination, double-click the destination, and then edit it to define:
The connection manager and destination table (if relevant) to be used when loading the data.
The column mappings between the input columns and the columns in the destination.
Data
D
Transfformations
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
4-29
Data transforma
ations enable you
y to perform
m operations o
on rows of datta as they passs through the d
data
flo
ow pipeline. Transformations have both in
nputs and outp
puts.
Th
he following ta
able lists the trransformations that SSIS inc ludes.
Row transform
mationsupdate column va
alues or createe new column
ns for each row
w in the data flow
Character Ma
ap
Ap
pplies string functions to col umn values, su
uch as converssion from lowe
ercase
to uppercase.
Copy Column
n
Creates a copy of
o a column an
nd adds it to th
he data flow.
Data Converssion
Co
onverts data off one type to aanotherfor eexample, nume
erical values to
o
strrings.
Derived Colum
mn
Ad
dds a new colu
umn based on an expression
n. For example,, you could use an
expression to mu
ultiply a Quan
ntity column b
by a UnitPrice column to cre
eate a
ne
ew TotalPrice column.
Export Colum
mn
Import Colum
mn
Re
eads data from
m a file and add
ds it as a colum
mn in the dataa flow.
OLE DB Comm
mand
Ru
uns a Structure
ed Query Lang uage (SQL) co
ommand for eaach row in the data
flo
ow.
(continued)
Rowset transformationscreate new rowsets
Aggregate
Sort
Percentage Sampling
Row Sampling
Pivot
Unpivot
Conditional Split
Multicast
Union All
Merge
Merge Join
Joins two sorted inputs to create a single output based on a FULL, LEFT, or
INNER join operation.
Lookup
Looks up columns in a data source by matching key values in the input, and
creates an output for matched rows and a second output for rows with no
matching value in the lookup data source.
Cache
CDC Splitter
Splits inserts, updates, and deletes from a CDC source into separate data flows.
CDC is discussed in Module 7: Implementing an Incremental ETL Process.
RowCount
Counts the rows in the data flow and writes the result to a variable.
(continued)
BI transformationsperform BI tasks
4-31
Slowly Changing
Dimension
Fuzzy Grouping
Fuzzy Lookup
Term Extraction
Term Lookup
Runs a data mining prediction query against the input to predict unknown
column values.
Data Cleansing
Applies a Data Quality Services knowledge base to data as it flows through the
pipeline.
Custom Component
To add a transformation to a workflow, drag the transformation from the Common or Other Transforms
section of the SSIS Toolbox pane to the design surface, and then connect the required inputs to the
transformation. Double-click the transformation to configure the specific operation that it will perform,
and then define the columns that will be included in the outputs from the transformation.
For More Information For more information about Integration Services Transformations,
see http://go.microsoft.com/fwlink/?LinkID=246724.
Op
ptimizing Data
D
Flow Performance
4-32 Creating an
a ETL Solution with SSIS
Optimize queeries. Select only the rows and columns thaat you need to
o reduce the ovverall volume of
data in the da
ata flow.
Configure Data Flow task prroperties. Use the following properties of tthe Data Flow task to optimize
performance::
DefaultB
BufferSize and
d DefaultBuffferMaxRows. Configuring th
he size of the buffers that th
he
data flow
w uses can sign
nificantly impro
ove performan
nce. When theere is sufficientt memory available,
you shou
uld try to achie
eve a small num
mber of large buffers withou
ut incurring an
ny disk paging. The
default values for these
e properties arre 10 MB and 110,000 rows reespectively.
EngineThreads. Setting the numberr of threads avaailable to the Data Flow taskk can improve
execution
n performance
e, particularly in
i packages w here the MaxC
ConcurrentEx
xecutables
property has been set to
t enable para
allel execution of the packag
ges tasks acrosss multiple
processors.
RunInOp
ptimizedMode. Setting a Da
ata Flow task tto run in optim
mized mode in
ncreases
performa
ance by removving any colum
mns or compon
nents that are not required ffurther downsttream
in the data flow.
Demonstra
D
ation: Implementing a Data Flo
ow
X Task 1: Co
onfigure a data source
1..
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
4-33
Ensure that the MIA--DC1 and MIA-SQLBI virtual machines are both running, and then log
g on to
MIA-SQ
QLBI as ADVEN
NTUREWORK
KS\Student wiith the passwo
ord Pa$$w0rd.
In the D:\10777A\De
D
emofiles\Mod0
04 folder, rightt-click Setup.ccmd, and then
n click Run as
administrator.
When you
y are promp
pted to confirm
m, click Yes, an
nd then wait ffor the batch fiile to complete
e.
2..
Use SQL Se
erver Managem
ment Studio to
o connect to th
he localhost d
database engin
ne instance and
d view
the contentts of the Product, ProductS
Subcategory, and ProductC
Category table
es in the Products
database, and
a the DimPrroduct table in
n the DemoDW
W database (w
which should b
be empty).
3..
4..
5..
Log on
n to the serve
er: Use Window
ws Authenticattion
Select or enter a da
atabase name
e: Products
6.
In the SSIS Toolbox pane, in the Favorites section, double-click Data Flow Task to add it to the
Control Flow surface. (Alternatively, you can drag the task icon to the Control Flow surface.)
Note If the SSIS Toolbox pane is not visible, on the SSIS menu, click SSIS Toolbox.
7.
Rename the Data Flow task to Extract Products, and then double-click it to switch to the Data Flow
tab.
8.
In the SSIS Toolbox pane, in the Favorites section, double-click Source Assistant to add a source to
the Data Flow surface. (Alternatively, you can drag the Source Assistant icon to the Data Flow
surface.)
9.
In the Source Assistant - Add New Source dialog box, in the list of types, click SQL Server; in the
list of connection managers, click localhost.Products; and then click OK.
10. Rename the OLE DB source Products, and then double-click it to edit its settings.
11. In the OLE DB Source Editor dialog box, on the Connection Manager tab, view the list of available
tables and views in the drop-down list.
12. Change the data access mode to SQL Command, and then enter the Transact-SQL in the following
code example.
SELECT ProductKey, ProductName FROM Product
13. Click Build Query to open the Query Builder dialog box.
14. In the Product table, select the ProductSubcategoryKey, StandardCost, and ListPrice columns, and
then click OK.
15. In the OLE DB Source Editor dialog box, click Preview to see a preview of the data, and then click
Close to close the preview.
16. In the OLE DB Source Editor dialog box, click the Columns tab, view the list of external columns that
the query has returned and the output columns that the data source has generated, and then click
OK.
In the SSIS Toolbox pane, in the Common section, double-click Derived Column to add a Derived
Column transformation to the Data Flow surface, and then position it under the Products source.
(Alternatively, you can drag the Derived Transformation icon to the Data Flow surface.)
2.
3.
Select the Products source, and then drag the output arrow to the Derived Column transformation.
4.
Double-click the Derived Column transformation to edit its settings, and then in the Derived
Column Name box, type Profit.
5.
Ensure that <add as new column> is selected in the Derived Column box.
6.
Expand the Column folder, and then drag the ListPrice column to the Expression box.
7.
In the Expression box, after [ListPrice], type a minus sign (), and then drag the StandardCost
column to the Expression box to create the following expression.
[ListPrice]-[StandardCost]
8.
Click the Data Type box, ensure that it is set to currency [DT_CY], and then click OK.
4-35
1.
In the SSIS Toolbox pane, in the Common section, double-click Lookup to add a Lookup
transformation to the Data Flow surface, and then position it under the Calculate Profit
transformation. (Alternatively, you can drag the Lookup icon to the Data Flow surface.)
2.
3.
Select the Calculate Profit transformation, and then drag the output arrow to the Lookup Category
transformation.
4.
5.
In the Lookup Transformation Editor dialog box, on the General tab, in the Specify how to
handle rows with no matching entries list, select Redirect rows to no match output.
6.
In the Lookup Transformation Editor dialog box, on the Connection tab, ensure that the
localhost.Products connection manager is selected, and then click Use results of an SQL query.
7.
8.
Click Preview to view the product category data, note that it includes a ProductSubcategoryKey
column, and then click Close to close the preview.
9.
In the Lookup Transformation Editor dialog box, on the Columns tab, in the Available Input
Columns list, drag ProductSubcategoryKey to ProductSubCategoryKey in the Available Lookup
Columns list.
10. Select the ProductSubcategoryName and ProductCategoryName columns to add them as new
columns to the data flow, and then click OK.
In Solution Explorer, create a new OLE DB connection manager with the following settings:
2.
In the SSIS Toolbox pane, in the Favorites section, double-click Destination Assistant to add a
destination transformation to the Data Flow surface, and then position it under the Lookup Category
transformation. (Alternatively, you can drag the Destination Assistant icon to the Data Flow surface.)
3.
In the Destination Assistant - Add New Destination dialog box, in the list of types, click SQL
Server; in the list of connection managers list, click localhost.DemoDW; and then click OK.
4.
5.
Select the Lookup Category transformation, and then drag the output arrow to the DemoDW
destination.
6.
In the Input Output Selection dialog box, in the Output list, click Lookup Match Output, and then
click OK.
7.
Double-click the DemoDW destination to edit its settings, and then in the Name of the table or the
view list, click [dbo].[DimProduct].
8.
In the OLE DB Destination Editor dialog box, on the Mappings tab, note that input columns are
automatically mapped to destination columns with the same name.
9.
In the Available Input Columns list, drag the ProductKey column to the ProductAltKey column in
the Available Destination Columns list, and then click OK.
10. In the SSIS Toolbox pane, in the Other Destinations section, double-click Flat File Destination to
add a destination transformation to the Data Flow surface, and then position it to the right of the
Lookup Category transformation. (Alternatively, you can drag the Flat File Destination icon to the
Data Flow surface.)
11. Rename the flat file destination Uncategorized Products.
12. Select the Lookup Category transformation, and then drag the output arrow to the Uncategorized
Products destination. The Lookup No Match Output output is automatically selected.
13. Double-click the Uncategorized Products destination to edit its settings, and then click New to
create a new flat file connection manager for delimited text values.
14. Name the new connection manager Unmatched Products and specify the file name
D:\10777A\Demofiles\Mod04\UnmatchedProducts.csv.
15. In the Flat File Destination Editor dialog box, click the Mappings tab, note that the input columns
are mapped to destination columns with the same names, and then click OK.
16. Right-click the Data Flow design surface, and then click Execute Task. Observe the data flow as it
runs, noting the number of files transferred along each path.
17. When the data flow has completed, on the Debug menu, click Stop Debugging.
18. Close Visual Studio, saving your changes if you are prompted.
19. In Excel, open the UnmatchedProducts.csv flat file and note that there were no unmatched products.
20. Use SQL Server Management Studio to view the contents of the DimProduct table in the DemoDW
database, and note that the product data has been transferred.
Lab Scenario
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
4-37
In
n this lab, you will
w start the development
d
of
o the ETL soluttion for the Ad
dventure Workks data wareho
ouse.
In
nitially, you will focus on the core task of th
he ETL solution
n, which is to eextract data frrom data sourcces so
th
hat it can be lo
oaded into the data warehou
use.
In
n this lab, you will
w focus on the extraction of
o customer an
nd sales orderr data from the
e InternetSale
es
da
atabase used by
b the compan
nys e-commerce site, which
h you must load into the Staging database
e. This
da
atabase contains customer data
d
(in a table
e named Custo
omers), and saales order dataa (in tables named
Sa
alesOrderHea
ader and SalessOrderDetail). You will extraact sales orderr data at the line item level o
of
granularity, and
d the total sale
es amount for each
e
sales ord er line item m
must be calculated by multipllying
th
he unit price of the product purchased by the quantity o
ordered. Addittionally, the sales order data
in
ncludes only th
he ID of the product purchassed, so your daata flow must look up the de
etails of each p
product
in
n a separate Prroducts datab
base.
Lab 4: Implem
menting
g Data Flow
F
in aan SSIS Packag
ge
Exe
ercise 1: Ex
xploring Source
S
Datta
Sce
enario
4-38 Creating an
a ETL Solution with SSIS
You
u have designe
ed a data warehouse schema
a for Adventurre Works Cyclees, and now yo
ou must design
n an
ETL process to po
opulate it with data from various source syystems. Before creating the EETL solution, yo
ou
have decided to examine
e
the so
ource data so that
t
you can u
understand it b
better.
Specifically, you want
w
to:
The prop
portion of null values for the second line o
of a customerss address.
If the sale
es orders data includes orde
ers with a paym
ment type code that is not p
present in the ttable
of known
n payment typ
pes.
Prepare the la
ab environmen
nt.
2.
3.
Profile source
e data.
4-39
Ensure that the MIA-DC1 and MIA-SQLBI virtual machines are both running, and then log on to
MIA-SQLBI as ADVENTUREWORKS\Student with the password Pa$$w0rd.
Run the Setup Windows Command Script file (Setup.cmd) in the D:\10777A\Labfiles\Lab04\Starter
folder as Administrator.
Use the Import and Export Data Wizard to extract a sample of customer data from the InternetSales
database on the localhost instance of SQL Server to a comma-delimited flat file.
Your sample should consist of the first 1,000 records in the Customers table.
You should use a text qualifier because some string values in the table may contain commas.
After you have extracted the sample data, use Excel to view it.
Note You may observe some anomalies in the data, such as invalid gender codes and
multiple values for the same country or region. The purpose of examining the source data is
to identify as many of these problems as possible, so that you can resolve them in the
development of the extract, transform, and load (ETL) solution. You will address the
problems in this data in later labs.
Add an ADO.NET connection manager that uses Windows authentication to connect to the
InternetSales database on the localhost instance of SQL Server.
Use a Data Profiling task to generate the following profile requests for data in the InternetSales
database:
Column statistics for the OrderDate column in the SalesOrderHeader table. You will use this
data to find the earliest and latest dates on which orders have been placed.
Column length distribution for the AddressLine1 column in the Customers table. You will use
this data to determine the appropriate column length to allow for address data.
Column null ratio for the AddressLine2 column in the Customers table. You will use this data to
determine how often the second line of an address is null.
Value inclusion for matches between the PaymentType column in the SalesOrderHeader table
and the PaymentTypeKey column in the PaymentTypes table. Do not apply an inclusion
threshold and set a maximum limit of 100 violations. You will use this data to find out if any
orders have payment types that are not present in the table of known payment types.
View the report that the Data Profiling task generates in the Data Profile Viewer.
Results: After this exercise, you should have a comma-separated text file that contains a sample of
customer data, and a data profile report that shows statistics for data in the InternetSales database.
Now that you have explored the source data in the InternetSales database, you are ready to start
implementing data flows for the ETL process. A colleague has already implemented data flows for reseller
sales data, and you plan to model your Internet sales data flows on those.
The main tasks for this exercise are as follows:
1.
2.
3.
4.
5.
Open the Extract Reseller Data.dtsx package and examine its control flow. Note that it contains two
Data Flow tasks.
On the Data Flow tab, view the Extract Resellers task and note that it contains a source named
Resellers and a destination named Staging DB.
Examine the Resellers source, noting the connection manager that it uses, the source of the data, and
the columns that its output contains.
Examine the Staging DB destination, noting the connection manager that it uses, the destination
table for the data, and the mapping of input columns to destination columns.
Right-click anywhere on the Data Flow design surface, click Execute Task, and then observe the data
flow as it runs, noting the number of rows transferred.
When the data flow has completed, stop the debugging session.
Add a new package to the project and name it Extract Internet Sales Data.dtsx.
Add a Data Flow task named Extract Customers to the new packages control flow.
Create a new project-level OLE DB connection manager that uses Windows authentication to connect
to the InternetSales database on the localhost instance of SQL Server.
In the Extract Customers data flow, add a source that uses the connection manager that you created
for the InternetSales database, and name it Customers.
Configure the Customers source to extract all columns from the Customers table in the
InternetSales database.
4-41
Add a destination that uses the existing localhost.Staging connection manager to the Extract
Customers data flow, and then name it Staging DB.
Connect the output from the Customers source to the input of the Staging DB destination.
Configure the Staging DB destination to load data into the Customers table in the Staging
database.
Ensure that all columns are mapped, and in particular that the CustomerKey input column is mapped
to the CustomerBusinessKey destination column.
Your completed data flow should look like the following image.
Right-click anywhere on the Data Flow design surface, click Execute Task, and then observe the data
flow as it runs, noting the number of rows transferred.
When the data flow has completed, stop the debugging session.
Results: After this exercise, you should have an SSIS package that contains a single Data Flow task, which
extracts customer records from the InternetSales database and inserts them into the Staging database.
You have implemented a simple data flow to transfer customer data to the staging database. Now you
must implement a data flow for Internet sales records. The new data flow must add a new column that
contains the total sales amount for each line item (which is derived by multiplying the list price by the
quantity of units purchased), and use a product key value to find additional product data in a separate
Products database. Once again, you will model your solution on a data flow that a colleague has already
implemented for reseller sales data.
The main tasks for this exercise are as follows:
1.
2.
3.
4.
5.
6.
7.
Open the Extract Reseller Data.dtsx package and examine its control flow. Note that it contains two
Data Flow tasks.
On the Data Flow tab, view the Extract Reseller Sales task.
Examine the Reseller Sales source, noting the connection manager that it uses, the source of the
data, and the columns that its output contains.
Examine the Calculate Sales Amount transformation, noting the expression that it uses to create
a new derived column.
Examine the Lookup Product Details transformation, noting the connection manager and query
that it uses to look up product data, and the column mappings that it uses to match data and
add rows to the data flow.
Examine the Staging DB destination, noting the connection manager that it uses, the destination
table for the data, and the mapping of input columns to destination columns.
Right-click anywhere on the Data Flow design surface, click Execute Task, and then observe the data
flow as it runs, noting the number of rows transferred.
When the data flow has completed, stop the debugging session.
Open the Extract Internet Sales Data.dtsx package, and then add a new Data Flow task named
Extract Internet Sales to its control flow.
Connect the pre-existing Extract Customers Data Flow task to the new Extract Internet Sales task.
4-43
Add a source that uses the existing localhost.InternetSales connection manager to the Extract
Internet Sales data flow, and then name it Internet Sales.
In the D:\10777A\Labfiles\Lab04\Starter\Ex3 folder, configure the Internet Sales source to use the
InternetSales.sql query to extract Internet sales records.
Add a Derived Column transformation named Calculate Sales Amount to the Extract Internet Sales
data flow.
Connect the output from the InternetSales source to the input of the Calculate Sales Amount
transformation.
Configure the Calculate Sales Amount transformation to create a new column named SalesAmount
that contains the UnitPrice column value multiplied by the OrderQuantity column value.
Add a Lookup transformation named Lookup Product Details to the Extract Internet Sales data
flow.
Connect the output from the Calculate Sales Amount transformation to the input of the Lookup
Product Details transformation.
Use the localhost.Products connection manager and the Products.sql query in the
D:\10777A\Labfiles\Lab04\Starter\Ex3 folder to retrieve product data.
Add all lookup columns other than ProductKey to the data flow.
Add a flat file destination named Orphaned Sales to the Extract Internet Sales data flow. Then
redirect non-matching rows from the Lookup Product Details transformation to the Orphaned
Sales destination, which should save any orphaned records in a comma-delimited file named
Orphaned Internet Sales.csv in the D:\10777A\ETL folder.
Add a destination that uses the existing localhost.Staging connection manager that you created to
the Extract Customers data flow, and name it Staging DB.
Connect the match output from the Lookup Product Details transformation to the input of the
Staging DB destination.
Configure the Staging DB destination to load data into the InternetSales table in the Staging
database. Ensure that all columns are mapped. In particular, ensure that the *Key input columns are
mapped to the *BusinessKey destination columns.
Your completed data flow should look like the following image.
Right-click anywhere on the Data Flow design surface, click Execute Task, and then observe the data
flow as it runs, noting the number of rows transferred.
When the data flow has completed, stop the debugging session.
Results: After this exercise, you should have a package that contains a Data Flow task that includes a
Derived Column transformation and a Lookup transformation.
Modu
ule Reviiew and
d Takeaw
ways
Review
R
Quesstions
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
4-45
1..
How could you determine the range off OrderDate vvalues in a dataa source to plaan a time dime
ension
table in a data
d
warehouse
e?
2..
3..
Module 5
Implementing Control Flow in an SSIS Package
Contents:
Lesson 1: Introduction to Control Flow
5-3
5-14
5-21
5-33
5-41
5-51
Module Overrview
5-2
Con
ntrol flow in SQ
QL Server Integ
gration Service
es (SSIS) packaages enables yo
ou to impleme
ent complex extract,
tran
nsform, and loa
ad (ETL) solutio
ons that comb
bine multiple ttasks and workkflow logic. By learning how to
imp
plement contro
ol flow, you can design robust ETL processses for a data w
warehousing solution that
coordinate data fllow operationss with other au
utomated taskks.
Afte
er completing this module, you
y will be able to:
Implement co
ontrol flow witth tasks and prrecedence con
nstraints.
Create dynam
mic packages that include va
ariables and paarameters.
Lesson
n1
Introd
duction
n to Con
ntrol Flo
ow
5-3
Control flow in an SSIS package consists of one or more ttasks, usually eexecuted as a ssequence base
ed on
precedence con
nstraints that define
d
a workfllow. Before yo
ou can implement a control fflow, you need
d to
kn
now what task
ks are available
e and how to define
d
a workfllow sequence using precede
ence constrain
nts. You
also need to understand how
w you can use multiple
m
packaages to create complex ETL ssolutions.
After completin
ng this lesson, you
y will be able to:
Describe th
he control flow
w tasks provide
ed by SSIS.
Define a wo
orkflow for tassks by using prrecedence con
nstraints.
Co
ontrol Flow
w Tasks
A co
ontrol flow con
nsists of one or
o more tasks. SSIS
S
includes tthe following ccontrol flow taasks that you ccan
use in a package.
Data Flow Tasks
Datta Flow
Enca
apsulates a data flow that trransfers data frrom a source tto a destinatio
on.
Database Tasks
5-4
Datta Profiling
Gen
nerates statisticcal reports bassed on a data ssource.
Bullk Insert
Inse
erts data into a data destinattion in a bulk load operation
n.
Exe
ecute SQL
Exe
ecute T-SQL
CDC Control
FTP
P
XM
ML
We
eb Service
Calls a method on
n a specific we b service.
Sen
nd Mail
Send
ds an email message.
Process Execution
n Tasks
Exe
ecute Package
Exe
ecute Process
WM
MI Tasks
WM
MI Data Reade
er
WM
MI Event Watcher
Mon
nitors a specifiic WMI event.
(continued)
Custom Logic Tasks
Script
Custom Task
5-5
Transfer Database
Transfer Error
Messages
Transfers custom error messages from one SQL Server instance to another.
Transfer Jobs
Transfers SQL Agent jobs from one SQL Server instance to another.
Transfer Logins
Transfer Master
Stored Procedures
Transfers stored procedures in the master database from one SQL Server
instance to another.
Transfers database objects such as tables and views from one SQL Server instance
to another.
Analysis Services
Processing
Check Database
Integrity
History Cleanup
Deletes out of date history data for SQL Server maintenance operations.
Maintenance
Cleanup
Notify Operator
Rebuild Index
Reorganize Index
Shrink Database
Update Statistics
Updates value distribution statistics for tables and views in a SQL Server
database.
To add a task to a control flow, drag it from the SSIS Toolbox to the control flow design surface. Then
double-click the task on the design surface to configure its settings.
Pre
ecedence Constraint
C
ts
5-6
A co
ontrol flow usu
ually defines a sequence of tasks
t
to be exeecuted. You deefine the seque
ence by conne
ecting
task
ks with precede
ence constrain
nts. These preccedence constrraints evaluatee the outcome
e of a task to
dete
ermine the flow
w of execution
n.
Con
ntrol Flow Conditions
C
You
u can define prrecedence con
nstraints for on
ne of three con
nditions:
Success the
e execution flo
ow to be follow
wed when a ta sk completes ssuccessfully. In
n the control flow
designer, succcess constraintts are shown as
a green arrow
ws.
By using
u
these conditional precedence constrraints, you can define a conttrol flow that e
executes tasks based
on conditional
c
log
gic. For examp
ple, you could create a contrrol flow with th
he following stteps:
1.
2.
3.
5-7
You can connect multiple precedence constraints to a single task. For example, a control flow might
include two Data Flow tasks, and a Send Mail task that you want to use to notify an administrator if
something goes wrong. To accomplish this, you could connect a failure precedence constraint from each
of the Data Flow tasks to the Send Mail task. However, you need to determine whether the notification
should be sent if either one of the Data Flow tasks fails, or only if both Data Flow tasks fail.
By default, when multiple precedence constraints are connected to a single task, a logical AND operation
is applied to the precedence condition, meaning that all of the precedence constraints must evaluate to
True in order to execute the connected task. In the example above, this means that the Send Mail task
would only be executed if both Data Flow tasks failed. In the control flow designer, logical AND
constraints are shown as solid arrows.
You can double-click a precedence constraint to edit it and configure it to use a logical OR operation, in
which case the connected task is executed if any of the connections evaluates to True. Setting the
constraints in the example above to use a logical OR operation would result in the Send Mail task being
executed if either (or both) of the Data Flow tasks failed. In the control flow designer, logical AND
constraints are shown as dotted arrows.
Gro
ouping an
nd Annotattions
5-8
As your
y
control flo
ows become more
m
complex,, it can becom e difficult to in
nterpret the co
ontrol flow surrface.
The SSIS Designerr includes two features that can
c help SSIS d
developers wo
ork more efficiently.
Gro
ouping Task
ks
You
u can group multiple tasks on
n the design surface in ordeer to manage tthem as a single unit. A task
grouping is a desiign time only feature
f
and ha
as no effect on
n runtime behaavior. With a g
grouped set off tasks,
you can:
To create
c
a group
p of tasks, selecct the tasks you want to gro up by draggin
ng around them
m or clicking tthem
while holding the CTRL key, and
d then right-click any of the selected taskss and click Gro
oup.
Adding Annottations
You
u can add anno
otations to the
e design surfacce to documen
nt your workflo
ow. An annotaation is a text-b
based
note
e that you can
n use to describ
be important features
f
of you
ur package deesign. To add aan annotation, rightclick
k the design su
urface and clicck Add Annota
ation. Then tyype the annotaation text.
Note You can add annotations to the Control Flow d
design surfacee, the Data Flow design
surface, and the Event Han
ndler design su
urface.
Demonstra
D
ation: Implementing Control FFlow
X Task 1: Ad
dd tasks to a control flo
ow
5-9
1..
2..
In the D:\10
0777A\Demofiles\Mod05 folder, run Setu p.cmd as Adm
ministrator, and
d then double
e-click
ControlFlo
owDemo.sln to
o open the solution in SQL SServer Data To
ools.
3..
4..
5..
Source
eConnection: A new connecction with a Ussage type of C
Create folder,, and a Folderr value
of D:\1
10777A\Demo
ofiles\Mod05
5\Demo.
Source
eConnection: Demo
6.
7.
8.
From the SSIS Toolbox pane, drag a File System Task to the control flow surface. Then double-click
the File System Task and configure the following settings:
UseDirectoryIfExists: True
SourceConnection: Demo
From the SSIS Toolbox pane, drag a File System Task to the control flow surface. Then double-click
the File System Task and configure the following settings:
DestinationConnection: Demo
OverwriteDestination: True
SourceConnection: A new connection with a Usage type of Existing file, and a File value of
D:\10777A\Demofiles\Mod05\Demo.txt.
From the SSIS Toolbox pane, drag a Send Mail Task to the control flow surface. Then double-click
the Send Mail Task and configure the following settings:
SmtpConnection (on the Mail tab): Create a new SMTP connection manager with a Name
property of Local SMTP Server and an SMTP Server property of localhost. Use the default
values for all other settings.
Select the Delete Files task and drag its green arrow to the Delete Folder task. Then connect the
Delete Folder task to the Create Folder task and the Create Folder task to the Copy File task.
2.
Connect each of the file system tasks to the Send Failure Notification task.
3.
Right-click the connection between Delete Files and Delete Folder and click Completion.
4.
Right-click the connection between Delete Folder and Create Folder and click Completion.
5.
Click each of the connections between the file system tasks and the Send Failure Notification task
while holding the Ctrl key and press F4. Then in the Properties pane, set the Value property to
Failure.
6.
Click anywhere on the control flow surface to clear the current selection, and then double-click any of
the red constraints connected to the Send Failure Notification task. Then in the Precedence
Constraint Editor dialog box, in the Multiple constraints section, select Logical OR. One constraint
must evaluate to True, and click OK. Note that all connections to the Send Failure Notification
task are now dotted to indicate that a logical OR operation is applied.
5-11
7.
Right-click the control flow surface next to the Send Failure Notification task and click Add
Annotation. The type Send an email message if a task fails.
8.
Select the Delete Files and Delete Folder tasks, then right-click either of them and click Group. Drag
the group to rearrange the control flow so you can see that the Delete Folder task is still connected
to the Create Folder task.
9.
On the Debug menu, click Start Debugging to run the package, and note that the Delete Files and
Delete Folder tasks failed because the specified folder did not previously exist. This caused the Send
Failure Notification task to be executed.
10. You can view the email message that was sent by the Send Failure Notification task in the
C:\inetpub\mailroot\Drop folder. Double-click it to open it with Microsoft Outlook.
11. In SQL Server Data Tools, on the Debug menu, click Stop Debugging, and then run the package
again. This time all of the file system tasks should succeed because the folder was created during the
previous execution. Consequently, the Send Failure Notification task is not executed.
Using Multip
ple Packag
ges
Create reusab
ble units of wo
orkflow that ca
an be used mu ltiple times in a single ETL p
process.
Separate data
a extraction wo
orkflows to suit data acquisi tion windows.
You
u can execute each
e
package independentlyy, and you can
n also use the EExecute Package task to run one
package from ano
other.
Creating
C
a Package Template
T
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-13
SS
SIS developerss often need to
o create multip
ple similar pac kages. To makke the develop
pment process more
effficient, you ca
an use the follo
owing procedu
ure to create a package tem
mplate, which yyou can reuse tto
crreate multiple packages with
h pre-defined objects and seettings.
1..
2..
Create a pa
ackage that inccludes the elem
ments you wan
nt to reuse. Th
hese elements can include:
Connecction Managers
Tasks
Event Handlers
H
Parame
eters and Varia
ables
Save the pa
ackage to the DataTransforrmationItemss folder on you
ur developmen
nt workstation.
By defa
ault, this folder is located at C:\Program Fiiles (x86)\Micrrosoft Visual Sttudio 10.0
\Comm
mon7\IDE\Priva
ateAssemblies\ProjectItems\\DataTransform
mationProject..
3..
4..
Change the
e Name and ID
D properties of
o the new pac kage to avoid naming conflicts.
Lesson 2
Creatin
ng Dyna
amic Pa
ackagess
You
u can use variables, paramete
ers, and expresssions to makee your SSIS pacckages more d
dynamic. For
exam
mple, rather th
han hard codin
ng a database connection sttring or file patth in a data so
ource, you can
crea
ate a package that sets the value
v
dynamica
ally at run timee. This producces a more flexxible and reusaable
solu
ution and helps mitigate diffferences betwe
een the develo
opment and prroduction environments.
Thiss lesson describ
bes how you can
c create variables and paraameters, and u
use them in exxpressions.
Afte
er completing this lesson, yo
ou will be able to:
Create variab
bles in an SSIS solution.
s
Create param
meters in an SSIS solution.
Use expressio
ons in an SSIS solution.
s
Variables
V
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-15
Yo
ou can use varriables to store
e values that a control flow u
uses at run tim
me. Variable values can chang
ge as
th
he package is executed
e
to re
eflect run time conditions. Fo
or example, a vvariable used tto store a file p
path
might
m
change depending
d
on the specific se
erver on which the package iis running. You
u can use variaables to:
Store an ite
erator or enum
merator value for
f a loop.
Set input an
nd output parameters for a SQL query.
Implement conditional lo
ogic in an exprression.
SS
SIS packages can
c contain two kinds of variiables: user varriables and sysstem variables.
User
U
Variable
es
Yo
ou can define user variables to store dynamic values thaat your contro l flow uses. To create a variaable,
view the Variables pane in SSIS Designer an
nd click the Ad
dd Variable bu
utton. For each
h user variable
e, you
ca
an specify the following prop
perties:
Scope: The
e scope of the variable.
v
Varia
ables can accesssible through out the whole
e package, or sscoped
to a particu
ular container or
o task. You ca
annot set the sscope in the V
Variable pane, iit is determine
ed by
the object that
t
is selected
d when you cre
eate the variab
ble.
Raise Change Event: Causes an event to be raised when the variable value changes. You can then
implement an event handler to perform some custom logic.
System Variables
System variables store information about the running package and its objects, and are defined in the
System namespace. Some useful system variables include:
Parameters
P
s
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-17
Yo
ou can use parrameters to pa
ass values to a project or pacckage at run tiime. When you
u define a parameter,
yo
ou can set a de
efault value, which
w
can be ovverridden wheen the packagee is executed iin a production
en
nvironment. Fo
or example, yo
ou could use a parameter to specify a dataabase connection string for a data
so
ource, and use
e one value during developm
ment, and a diffferent value w
when the proje
ect is deployed
d to a
production environment.
Pa
arameters havve three kinds of
o value:
When
W
the proje
ect is deployed
d to an SSIS Ca
atalog, adminisstrators can deefine multiple environments for the
project and spe
ecify server deffault paramete
er values for eaach environmeent.
SS
SIS supports tw
wo kinds of pa
arameter:
Project para
ameters, which
h are defined at
a the project level and can be used in anyy packages witthin the
project.
Package pa
arameters, which are scoped at the packag
ge level and arre only availab
ble within the p
package
for which th
hey are define
ed.
Note Parameters are only
o
supported
d in the projecct deploymentt model. When
n the legacy
deployment model is ussed, you can se
et dynamic pacckage propertties by using p
package
ment is discusse
ed in Module 12: Deploying and Configuriing SSIS
configurattions. Deploym
Packages.
Exp
pressions
SSIS
S provides a ricch expression language
l
that you can use tto set values fo
or numerous e
elements in an SSIS
package, including:
Properties
Conditional Split
S
transformation criteria
Precedence constraint
c
cond
ditions
Expressions are ba
ased on Integrration Servicess expression syyntax, which usses similar funcctions and
keyw
words to comm
mon programming languages like Microso
oft C#. Expresssions can inclu
ude variables aand
para
ameters, enabling you to sett values dynam
mically based o
on specific run
n time conditio
ons.
Demonstra
D
ation: Using
g Variable
es and Paraameters
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-19
1..
2..
In the D:\10
0777A\Demofiles\Mod05 folder, double-cclick VariablessAndParameters.sln to ope
en the
solution in SQL Server Da
ata Tools.
3..
4..
On the View
w menu, click Other Windo
ows, and click Variables.
5..
In the Varia
ables pane. Click the Add Va
ariable button
n and add a vaariable with the
e following
properties:
Name:: fName
Data ty
ype: String
Value: Demo1.txt
2.
In the Project.params [Design] window, click the Add Parameter button and add a parameter with
the following properties:
3.
Name: folderPath
Value: D:\10777A\Demofiles\Mod05\Files\
Sensitive: False
Required: True
1.
On the Control Flow.dtsx package design surface, in the Connection Managers pane, click the
Demo.txt connection manager and press F4.
2.
In the Properties pane, in the Expressions property box, click the ellipsis () button. Then in the
Property Expressions Editor dialog box, in the Property box, select ConnectionString and in the
Expression box, click the ellipsis button.
3.
In the Expression Builder dialog box, expand the Variables and Parameters folder, and drag the
$Project::folderPath parameters to the Expression box. Then in the Expression box, type a plus (+)
symbol. Then drag the User::fName variable to the Expression box to create the following
expression.
@[$Project::folderPath]+[@User::fName]
4.
In the Expression Builder dialog box, click Evaluate Expression and verify that the expression
produces the result D:\10777A\Demofiles\Mod05\Files\Demo1.txt. Then click OK to close the
Expression Builder dialog box, and in the Property Expressions Editor dialog box, click OK.
5.
Run the project, and when it has completed, stop debugging. Ignore the failure of the Delete Files
and Delete Folders tasks if the demo folder did not previously exist.
6.
View the contents of the D:\10777A\Demofiles\Mod05\Demo folder and verify that Demo1.txt has
been copied.
Lesson
n3
Using
g Containers
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-21
Yo
ou can create containers in SSIS
S
packages to group relatted tasks togeether or define iterative proccesses.
Using containerrs in packages helps you create complex w
workflows and create a hieraarchy of executtion
sccopes that you
u can use to manage package behavior.
Th
his lesson desccribes the kind
ds of containerrs that are avaiilable and how
w to use them in an SSIS package
co
ontrol flow.
After completin
ng this lesson, you
y will be able to:
Describe th
he types of con
ntainer availab
ble in an SSIS p
package.
Use a For Loop containerr to repeat a process until a sspecific condittion is met.
Use a Forea
ach Loop container to proce
ess items in an enumerated ccollection.
Inttroduction
n to Contaiiners
SSIS
S packages can
n contain the following
f
kindss of container::
Task contain
ners: Each conttrol flow task has
h its own im plicit containeer.
Sequence co
ontainers: You can group tassks and other containers into
o a sequence ccontainer. Thiss
creates an exe
ecution hierarchy and enables you to set p
properties at tthe container level that apply to
all elements within
w
the container.
ntainers can be
e start or endp
points for prece
edence constrraints and you can nest containers within o
other
Con
containers.
Sequence Containers
C
s
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-23
Yo
ou can use a sequence conta
ainer to group
p tasks and oth
her containers together and define a subse
et of the
pa
ackage contro
ol flow. By using a sequence container, you
u can:
Disable a lo
ogical subset of
o the package
e for debuggin
ng purposes.
Create a sco
ope for variables.
Manage tra
ansactions at a granular leve
el.
To
o create a sequ
uence container, drag the Se
equence Conttainer icon fro
om the SSIS To
oolbox pane to
o the
de
esign surface. Then drag the
e tasks and oth
her containers you want to in
nclude in the ssequence into the
se
equence conta
ainer.
Note In the design environment, the
e sequence co
ontainer behavves similarly to a grouped
set of task
ks. However, un
nlike a group, a sequence co
ontainer exists at run time an
nd its
propertiess can affect the
e behavior of the
t control flow
w.
De
emonstration: Using a Sequencce Contain
ner
1.
Ensure that th
he MIA-DC1 and MIA-SQLBII virtual machiines are both rrunning, and then log on to
MIA-SQLBI ass ADVENTURE
EWORKS\Stud
dent with the password Pa$
$$w0rd.
2.
In the D:\10777A\Demofile
es\Mod05 folde
er, double-clicck SequenceContainer.sln tto open the
solution in SQ
QL Server Data
a Tools.
3.
4.
5.
Drag a Seque
ence Containe
er from the SS
SIS Toolbox to the control flo
ow design surfface.
6.
7.
Click and drag around the Delete Files and Delete Follder tasks to sselect them bo
oth, and then d
drag
them both into the sequence container.
5-25
8.
Drag a precedence constraint from the sequence container to Create Folder. Then right-click the
precedence constraint and click Completion.
9.
Drag a precedence constraint from the sequence container to Send Failure Notification. Then rightclick the precedence constraint and click Failure.
10. Run the package and view the results. Then stop debugging.
11. Click the sequence container and press F4. Then in the Properties pane, set the Disable property to
True.
12. Run the package and note that neither of the tasks in the sequence container is executed. Then stop
debugging.
You
u can use a Forr Loop contain
ner to repeat a portion of thee control flow until a specificc condition is met.
For example, you could run a ta
ask a specified number of tim
mes.
ng
Con
nceptually, a Fo
or Loop container behaves similarly
s
to a F or loop constrruct in commo
on programmin
lang
guages such ass Microsoft C#
#. A For Loop container
c
uses the following expression-baased propertie
es to
dete
ermine the number of iterations it perform
ms:
An evaluation
n expression th
hat typically evvaluates a cou nter variable in order to exitt the loop whe
en it
matches a spe
ecific value.
An iteration expression
e
thatt typically mod
difies the valuee of a counter variable.
To use
u a For Loop
p container in a control flow,, drag the For Loop Contain
ner icon from the SSIS Toolb
box to
the control flow surface, and then double-clicck it to set the expression prroperties required to control the
num
mber of loop itterations. Then
n drag the task
ks and contain ers you want tto repeat into the For Loop
container on the control
c
flow su
urface.
Demonstra
D
ation: Using
g a For Loop Contaiiner
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-27
1..
2..
In the D:\10
0777A\Demofiles\Mod05 folder, double-cclick ForLoopC
Container.sln to open the so
olution
in SQL Servver Data Tools..
3..
4..
On the View
w menu, click Other Windo
ows, and click Variables. Theen add a variaable with the
following properties:
p
Name:: counter
Data ty
ype: int32
Value: 0
5..
6..
7..
InitExp
pression: @co
ounter = 1
EvalEx
xpression: @co
ounter < 4
Assign
nExpression: @counter
@
= @counter
@
+1
8.
Double-click the Execute Process task and set the following properties, then click OK.
Expressions (on the Expressions tab): Use the Property Expressions Editor to set the following
expression for the Arguments property:
@[$Project::folderPath] + "Demo" + (DT_WSTR,1)@[User::counter] + ".txt"
Note This expression concatenates the folderPath parameter (which has a default value
of "D:\10777A\Demofiles\Mod05\Files\"), the literal text Demo, the value of the counter
variable (converted to a 1-character string using the DT_WSTR data type cast), and the
literal text .txt. Because the For Loop is configured to start the counter with a value of 1
and loop until it is no longer less than 4, this will result in the following arguments for the
Notepad executable:
D:\10777A\Demofiles\Mod05\Files\Demo1.txt
D:\10777A\Demofiles\Mod05\Files\Demo2.txt
D:\10777A\Demofiles\Mod05\Files\Demo3.txt
9.
Drag a precedence constraint from the For Loop container to the Sequence container and rearrange
the control flow if necessary.
10. Run the package, and note that the For Loop starts Notepad three times, opening the text file with
the counter variable value in its name (Demo1.txt, Demo2.txt, and Demo3.txt). Close Notepad each
time it opens, and when the execution is complete, stop debugging.
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-29
Yo
ou can use a Foreach
F
Loop container
c
to pe
erform an iteraative process o
on each item in an enumeraated
co
ollection. SSIS supports the following
f
enum
merators in a FForeach Loop container:
ADO You
u can use this enumerator
e
to
o loop through
h elements of aan ADO objectt, for example records
in a Record
dset.
ADO.NET Schema
S
Rowsset You can use this enum erator to itera te through ob
bjects in an AD
DO.NET
schema, forr example tablles in a datasett or rows in a ttable.
Item You
u can use this enumerator
e
to iterate throug
gh a property collection for an SSIS objectt.
SMO You
u can use this enumerator
e
to
o iterate throug
gh a collection
n of SQL Serve
er Managemen
nt
Objects.
To
o use a Foreacch Loop contaiiner in a contro
ol flow:
1..
Drag the Fo
oreach Loop Container
C
icon from the SSIIS Toolbox to tthe control flo
ow surface.
2..
3..
4..
Drag the ta
asks you want to perform du
uring each iteraation into the Foreach Loop
p container and
d
configure their propertiess appropriatelyy to reference the collection
n value variable
e.
De
emonstration: Using a Foreach
h Loop Con
ntainer
1.
Ensure that th
he MIA-DC1 and MIA-SQLBII virtual machiines are both rrunning, and then log on to
MIA-SQLBI ass ADVENTURE
EWORKS\Stud
dent with the password Pa$
$$w0rd.
2.
In the D:\10777A\Demofile
es\Mod05 folde
er, double-clicck ForeachLoo
opContainer.ssln to open the
solution in SQ
QL Server Data
a Tools.
3.
4.
5.
6.
7.
8.
9.
5-31
10. Create a success precedence constraint from the Create Folder task to the Foreach Loop container,
and a failure precedence constraint from the Foreach Loop container to the Send Failure
Notification task.
11. Run the package, closing each instance of Notepad as it opens. When the package execution has
completed, stop debugging and verify that the D:\10777A\Demofiles\Mod05\Demo folder contains
each of the files in the D:\10777A\Demofiles\Mod05\Files folder.
Lab
b Scenario
o
In th
his lab, you will continue to develop the ET
TL solution forr the Adventurre Works data warehouse. Y
You
have created data
a flows to extra
act customer and
a sales orde r data and loaad it into the sttaging database.
Now
w you must en
ncapsulate thesse data flows in a control flo
ow that executtes the data flo
ows and notifie
es an
ure occurs.
ope
erator by e-ma
ail when the da
ata flows succe
eed, or if a failu
As well
w as the Inte
ernet and reseller sales data that the solutiion currently p
processes, Advventure Works has
an accounts
a
system that records payments fro
om resellers. D
Details of thesee payments are
e exported to
com
mma-delimited
d files, and you
u need to inclu
ude this data in
n the ETL soluttion. Because tthe location an
nd file
nam
mes of these files may change in the future
e, you must creeate a packagee that can be e
easily adapted
d. You
have decided to use
u a project-le
evel paramete
er for the foldeer path and a vvariable for the
e file name so that
your package can
n determine the complete file
e path dynam ically at run-time when load
ding the payments
data
a into the stag
ging database.
Havving used precedence constrraints to define
e a control flow
w for the custo
omer and Internet sales orde
er
data
a flows, you no
ow want to combine the datta flows into a discrete sequeence so that th
hey can be
configured as a unit. The tasks to
t send an e-m
mail notificatio
on will remain outside of the
e sequence and
d will
be executed
e
on su
uccess or failurre of the seque
ence as a who
ole.
Fina
ally, you have created
c
a conttrol flow that lo
oads data from
m a single payyments file, butt the accountss
system actually exxports a file for each countryy where Adven
nture Works haas resellers. Yo
ou must modiffy the
control flow to ite
erate through all of the paym
ments files in t he folder and load them all into the stagin
ng
data
abase.
Lab 5A:
5 Impllementiing Con
ntrol Flo
ow in an
n SSIS P
Package
e
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-33
Yo
ou have imple
emented data flows
f
to extracct data and loaad it into a staging database
e as part of the
e ETL
process for your data warehousing solution
n. Now you waant to coordinaate these data flows by
im
mplementing a control flow that
t
notifies an operator of tthe outcome o
of the process.
Th
he main tasks for this exercisse are as follow
ws:
1..
Prepare the
e lab environm
ment.
2..
Examine an
n existing pack
kage.
3..
4..
Ensure the MIA-DC1 and MIA-SQLBI virtual machinees are both run
nning, and then log on to
MIA-SQLBI as ADVENTU
UREWORKS\Sttudent with th
he password P
Pa$$w0rd.
X Task 2: Ex
xamine an ex
xisting pack
kage
Examine the settings for the precedence constraint connecting the Extract Resellers task to the Send
Failure Notification task to determine the conditions under which this task will be executed.
Examine the settings for the Send Mail tasks, noting that they both use the Local SMTP Server
connection manager.
On the Debug menu, click Start Debugging to run the package, and observe the control flow as the
task executes. Then, when the task has completed, on the Debug menu, click Stop Debugging.
In the C:\inetpub\mailroot\Drop folder, double-click the most recent file to open it in Microsoft
Outlook. Then read the email message and close Microsoft Outlook.
Open the Extract Internet Sales Data.dtsx package and examine its control flow.
Add a Send Mail task to the control flow, configure it with the following settings, and create a
precedence constraint that runs this task if the Extract Internet Sales task succeeds.
SmtpConnection: A new SMTP Connection Manager named Local SMTP Server that connects
to the localhost SMTP server
From: ETL@adventureworks.msft
To: Student@adventureworks.msft
Priority: Normal
Add a second Send Mail task to the control flow, configure it with the following settings, and create a
precedence constraint that runs this task if either the Extract Customers or Extract Internet Sales
task fails.
SmtpConnection: The Local SMTP Server connection manager you created previously.
From: ETL@adventureworks.msft
To: Student@adventureworks.msft
Priority: High
5-35
Set the ForceExecutionResult property of the Extract Customers task to Failure. Then run the
package and observe the control flow.
When package execution is complete, stop debugging and verify that the failure notification email
message has been delivered to the C:\inetpub\mailroot\Drop folder. You can double-click the email
message to open it in Microsoft Outlook.
Set the ForceExecutionResult property of the Extract Customers task to None. Then run the
package and observe the control flow.
When package execution is complete, stop debugging and verify that the success notification email
message has been delivered to the C:\inetpub\mailroot\Drop folder.
Close the AdventureWorksETL project when you have completed the exercise.
Results: After this exercise, you should have a control flow that sends an email message if the Extract
Internet Sales task succeeds, or sends an email message if either the Extract Customers or Extract
Internet Sales fails.
You need to enhance your ETL solution to include the staging of payments data that is generated in
comma-separated value (CSV) format from a financial accounts system. You have implemented a simple
data flow that reads data from a CSV file and loads it into the staging database, and you must now modify
the package to construct the folder path and file name for the CSV file dynamically at run time instead of
relying on a hard-coded name in the settings of the Data Flow task.
The main tasks for this exercise are as follows:
1.
2.
Create a variable.
3.
Create a parameter.
4.
View the contents of the D:\10777A\Accounts folder and note the files it contains. In this exercise, you
will modify an existing package to create a dynamic reference to one of these files.
Open the Extract Payment Data.dtsx package and examine its control flow. Note that it contains a
single Data Flow task named Extract Payments.
View the Extract Payments data flow and note that it contains a flat file source named Payments
File, and an OLE DB destination named Staging DB.
View the settings of the Payments File source and note that it uses a connection manager named
Payments File.
In the Connection Managers pane, double-click Payments File, and note that it references the
Payments.csv file in the D:\10777A\Labfiles\Lab05A\Starter\Ex2 folder. This file has the same
data structure as the payments file in the D:\10777A\Accounts folder.
On the Execution Results tab, find the following line in the package execution log.
[Payments File [2]] Information: The processing of the file
D:\10777A\Labfiles\Lab05A\Starter\Ex2\Payments.csv has started
Name: fName
Name: AccountsFolderPath
Value: D:\10777A\Accounts\
Sensitive: False
Required: True
5-37
Select the Payments File connection manager and view its properties in the Properties pane.
Click the ellipsis () button for the Expressions property to open the Property Expressions Editor
dialog box, and then set the ConnectionString property to an expression that concatenates the
AccountsFolderPath parameter and fName variable. Your expression should look like the following.
@[$Project::AccountsFolderPath]+ @[User::fName]
Run the package and view the execution results to verify that the data in the
D:\10777A\Accounts\Payments - US.csv file was loaded.
Close SQL Server Data Tools when you have completed the exercise.
Results: After this exercise, you should have a package that loads data from a text file based on a
parameter that specifies the folder path where the file is stored, and a variable that specifies the file name.
You have created a control flow that loads Internet Sales data and sends a notification email message to
indicate whether the process succeeded or failed. You now want to encapsulate the data flow tasks for
this control flow in a sequence container so you can manage them as a single unit.
You have also successfully created a package that loads payments data from a single CSV file based on a
dynamically derived folder path and file name. Now you must extend this solution to iterate through all of
the files in the folder and import data from each of them.
The main tasks for this exercise are as follows:
1.
2.
Open the Extract Internet Sales Data.dtsx package and modify its control flow so that:
The Extract Customers and Extract Internet Sales tasks are contained in a Sequence container
named Extract Customer Sales Data.
The Send Failure Notification task is executed if the Extract Customer Sales Data container
fails.
The Send Success Notification task is executed if the Extract Customer Sales Data container
succeeds.
5-39
Run the package to verify that it successfully completes both Data Flow tasks in the sequence and
then executes the Send Success Notification task.
Add a Foreach Loop container to the control flow and drag the existing Extract Payments Data Flow
task into the Foreach Loop container.
Configure the Foreach Loop container with the following settings on the Collection tab of the
Foreach Loop Editor dialog box:
Folder: C:\
Files: *.*
Note The value you specify for the Folder property will be overridden by the expression
you have set for the Directory property.
The Collection tab of the Foreach Loop Editor dialog box should look like the following.
On the Variable Mappings tab, add the fname user variable and map it to index 0 (which represents
the file name and extension value retrieved by the Foreach File Enumerator).
Run the package and count the number of times the Foreach loop is executed.
When execution has completed, stop debugging and view the execution results to verify that all files
in the D:\10777A\Accounts folder were processed.
Close SQL Server Data Tools when you have completed the exercise.
Results: After this exercise, you should have a package that encapsulates two data flow tasks in a
sequence container, and another package that uses a Foreach loop to iterate through the files in a folder
specified in a parameter and uses a Data Flow task to load their contents into a database.
Lesson
n4
Mana
aging Co
onsistency
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
SS
SIS solutions are generally used to transferr data from on
ne location to another. Often
n, the overall SSSIS
so
olution can incclude multiple data flows and
d operations, aand it may be important to ensure that th
he
process always results in data that is a consistent state, evven if some paarts of the proccess fail.
Th
his lesson discusses techniqu
ues for ensurin
ng data consisttency when paackages fail.
After completin
ng this lesson, you
y will be able to:
Configure failure
f
behavio
or for tasks, containers, and p
packages.
Use checkp
points to restarrt failed packag
ges.
5-41
Co
onfiguring Failure Be
ehavior
An SSIS
S
package control
c
flow ca
an contain nessted hierarchiees of containerrs and tasks, an
nd you can use
e the
follo
owing properties to control how a failure in
i one elemen
nt of the contro
ol flow determ
mines the overaall
outccome of the package.
FailPackageO
OnFailure: Wh
hen set to True
e, the failure o
of the task or ccontainer results in the failurre of
the package in
i which it is defined.
d
The de
efault value fo r this propertyy is False.
FailParentOn
nFailure: When set to True, the failure of the task or co ntainer resultss in the failure of its
container. If the
t item with this
t property iss not in a conttainer, then its parent is the package; in which
case this prop
perty has the same
s
effect as the FailPackaageOnFailure property. Whe
en setting this
property on a package thatt is executed by
b an Execute P
Package task in another pacckage, a value of
True causes the
t calling pacckage to fail if this package ffails. The defau
ult value for th
his property is False.
You
u can use these
e properties to
o achieve fine-g
grained contro
ol of package behavior in th
he event of an error
thatt causes a task
k to fail.
Using
U
Transactions
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-43
Supported
d this executa
able will enlist in a transactio
on if its parentt is participatin
ng in one.
SS
SIS Transaction
ns rely on the Microsoft Disttributed Transaaction Coordin
nator (MSDTC)), a system
co
omponent that coordinates transactions across multiple data sources. An error will o
occur if an SSISS
pa
ackage attemp
pts to start a trransaction whe
en the MSDTC
C service is nott running.
SSIS supports multiple concurrent transactions within a single hierarchy of packages, containers, and tasks;
but it does not support nested transactions. To understand how multiple transactions behave in a
hierarchy, consider the following facts:
If the child container includes a task with a TransactionOption value of Supported, the task will not
participate in the existing transaction.
If the child container contains a task with a TransactionOption value of Required, the task will start
a new transaction. However, the new transaction is unrelated to the existing transaction, and the
outcome of one transaction will have no effect on the other.
Demonstra
D
ation: Using
g a Transa
action
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-45
1..
2..
In the D:\10
0777A\Demofiles\Mod05 folder, right-clicck Setup.cmd and click Run as administrrator.
Click Yes when
w
prompted
d to confirm th
he action.
3..
4..
In the D:\10
0777A\Demofiles\Mod05 folder, double-cclick Transactiions.sln to ope
en the solution in
SQL Server Data Tools.
5..
6..
Run the package and notte that the SQLL Command taask fails. Then stop debugging.
7..
8..
In the D:\10
0777A\Demofiles\Mod05 folder, run Setu p.cmd as Adm
ministrator agaain to reset the
e
database.
9..
In SQL Servver Data Tools,, click anywherre on the conttrol flow surfacce and press F4
4. Then in the
Properties pane,
p
set the TransactionO
T
ption propertty to Required
d.
10. Click the Copy Products task, and in the Properties pane, set the FailPackageOnFailure property to
True and ensure the TransactionOption property is set to Supported.
11. Repeat the previous step for the Update Prices task.
12. Run the package and note that the SQL Command task fails again. Then stop debugging.
13. In SQL Server Management Studio, view the contents of the dbo.StagingTable and
dbo.ProductionTable tables in the DemoDW database, noting that dbo.ProductionTable is empty,
even though the Copy Products task succeeded. The transaction has rolled back the changes to the
production table because the Update Prices task failed.
14. In SQL Server Data Tools, double-click the Update Prices task and change the SQL Statement
property to UPDATE ProductionTable SET Price = 100. Then click OK.
15. Run the package and note that all tasks succeed. Then stop debugging.
16. In SQL Server Management Studio, view the contents of the dbo.StagingTable and
dbo.ProductionTable tables in the DemoDW database, noting that dbo.ProductionTable now
contains products with valid prices.
Using
U
Checckpoints
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-47
Another way yo
ou can manage
e data consiste
ency is to use ccheckpoints. C
Checkpoints en
nable you to re
estart
a failed package
e after the issu
ue that caused it to fail has b
been resolved. Any tasks that were previou
usly
co
ompleted succcessfully are ignored, and the
e execution reesumes at the p
point in the co
ontrol flow where the
pa
ackage failed. While checkpo
oints do not offer
o
the same level of atomic consistency as a transactio
on, they
ca
an provide a useful solution when a contro
ol flow includees a long-runn
ning or resourcce-intensive task that
yo
ou do not wish
h to repeat unnecessarily, such as downloaading a large ffile from an FT
TP server.
Checkpoints wo
ork by saving information ab
bout work in p rogress to a ch
heckpoint file. When a failed
d
pa
ackage is resta
arted, the checckpoint file is used
u
to identiffy where in thee control flow to resume exe
ecution.
To
o enable a pacckage to use checkpoints, yo
ou must set thee following properties of the
e package:
Checkpoin
ntFileName: Th
he full file path
h where you w
want to save th
he checkpoint file.
SaveCheck
kpoints: A Boo
olean value use
ed to specify w
whether or nott the package should save
checkpoint information to the checkpo
oint file.
Checkpoin
ntUsage: An en
numeration wiith one of the following valu
ues:
Never:: The package will never use a checkpoint file to resumee execution and will always b
begin
execution with the first task in the control flow.
De
emonstration: Using a Checkpo
oint
1.
Ensure that th
he MIA-DC1 and MIA-SQLBII virtual machiines are both rrunning, and then log on to
MIA-SQLBI ass ADVENTURE
EWORKS\Stud
dent with the password Pa$
$$w0rd.
2.
In the D:\10777A\Demofile
es\Mod05 folde
er, right-click SSetup.cmd an
nd click Run ass administrattor.
Click Yes whe
en prompted to
t confirm the action.
3.
4.
In the D:\10777A\Demofile
es\Mod05 folde
er, double-clicck Products.cssv to open it w
with Microsoft Excel,
and note thatt it contains de
etails for three
e more produccts. Then close Excel.
5.
In the D:\10777A\Demofile
es\Mod05 folde
er, double-clicck Checkpointts.sln to open the solution in
SQL Server Data Tools.
6.
7.
Checkpo
ointFileName:: D:\10777A\D
Demofiles\Mod
d05\Checkpoin
nt.chk
Checkpo
ointUsage: IfExxists
SaveChe
eckpoints: True
8.
9.
Run the package, and note that the data flow task fails . Then stop deebugging.
5-49
10. In the D:\10777A\Demofiles\Mod05 folder, note that a file named Checkpoint.chk has been created,
and that the file system tasks that succeeded have created a folder named Data and copied the
Products.csv file into it.
11. In SQL Server Data Tools, view the Data Flow tab for the Load to Staging Table task, and doubleclick the Derive Columns transformation. Then change the expression for the NewPrice column to
100 and click OK.
12. View the Control Flow tab, and then run the package. Note that the Create Folder and Copy File
tasks, which succeeded previously, are not re-executed. Only the Load to Staging Table task is
executed.
13. Stop debugging, and verify that the Checkpoint.chk file has been deleted now that the package has
been executed successfully.
14. In SQL Server Management Studio, view the contents of the dbo.StagingTable, and note that it now
contains data about six products.
Lab
b Scenario
o
In th
his lab, you will continue to develop the Adventure
A
Worrks ETL solutio
on.
A similar problem
m exists with the reseller saless data extractio
on, but in this case you want to take a diffferent
app
proach. The data flow that exxtracts the rese
eller data has tthe potential tto be a long-ru
unning operation,
and you want to avoid
a
repeatin
ng it in the eve
ent that the sub
bsequent dataa flow to extracct reseller sales
orde
ers fails. To acccomplish this, you intend to use a checkpo
oint so that if the reseller sales order extraaction
failss, you can reso
olve the proble
em and restart the package w
without having
g to re-extractt the reseller data.
Lab 5B:
5 Using Transsactionss and Ch
heckpoints
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-51
Yo
ou have create
ed an SSIS pacckage that usess two data flow
ws to extract, ttransform, and
d load Internett sales
da
ata. You now want
w
to ensure
e that package
e execution alw
ways results in a consistent d
data state, so tthat if
an
ny of the data flows fail, no data
d
is loaded.
Th
he main tasks for this exercisse are as follow
ws:
1..
Prepare the
e lab environm
ment.
2..
View the da
ata in the data
abase.
3..
4..
Implement a transaction.
5..
Observe tra
ansaction beha
avior.
Ensure the MIA-DC1 and MIA-SQLBI virtual machinees are both run
nning, and then log on to
MIA-SQLBI as ADVENTU
UREWORKS\Sttudent with th
he password P
Pa$$w0rd.
Start SQL Server Management Studio and connect to the localhost database engine instance by
using Windows authentication.
In the Staging database, view the contents of the dbo.Customers and dbo.InternetSales tables to
verify that they are both empty.
Open the Extract Internet Sales Data.dtsx package and examine its control flow.
Run the package, noting that the Extract Customers task succeeds, but the Extract Internet Sales
task fails. When execution is complete, stop debugging.
In SQL Server Management Studio, verify that the InternetSales table is still empty, but the
Customers table now contains customer records.
In SQL Server Management Studio, execute the following Transact-SQL query to reset the staging
tables.
TRUNCATE TABLE Staging.dbo.Customers
Configure the Extract Customer Sales Data sequence container in the Extract Internet Sales
Data.dtsx package so that it requires a transaction.
Ensure that the Extract Customers and Extract Internet Sales tasks both support transactions, and
configure them so that if they fail, their parent also fails.
Run the Extract Internet Sales Data.dtsx package, noting once again that the Extract Customers
task succeeds, but the Extract Internet Sales task fails. Note also that the Extract Customer Sales
Data sequence container fails. When execution is complete, stop debugging.
In SQL Server Management Studio, verify that both the InternetSales and Customers tables are
empty.
View the data flow for the Extract Internet Sales task, and modify the expression in the Calculate
Sales Amount derived column transformation to remove the text / (OrderQuantity %
OrderQuantity). The completed expression should match the following code sample.
UnitPrice * OrderQuantity
Run the Extract Internet Sales Data.dtsx package, noting that the Extract Customers and Extract
Internet Sales tasks both succeed. When execution is complete, stop debugging.
In SQL Server Management Studio, verify that both the InternetSales and Customers tables contain
data.
Close the AdventureWorksETL project when you have completed the exercise.
Results: After this exercise, you should have a package that uses a transaction to ensure that all data flow
tasks succeed or fail as an atomic unit of work.
5-53
You have created an SSIS package that uses two data flows to extract, transform, and load reseller sales
data. You now want to ensure that if any task in the package fails, it can be restarted without re-executing
the tasks that had previously succeeded.
The main tasks for this exercise are as follows:
1.
2.
3.
Implement checkpoints.
4.
Use SQL Server Management Studio to view the contents of the dbo.Resellers and
dbo.ResellerSales tables in the Staging database on the localhost database engine instance.
Open the Extract Reseller Data.dtsx package and examine its control flow.
Run the package, noting that the Extract Resellers task succeeds, but the Extract Reseller Sales task
fails. When execution is complete, stop debugging.
In SQL Server Management Studio, verify that the ResellerSales table is still empty, but the Resellers
table now contains reseller records.
In SQL Server Management Studio, execute the following Transact-SQL query to reset the staging
tables.
TRUNCATE TABLE Staging.dbo.Resellers
CheckpointFileName: D:\10777A\ETL\CheckPoint.chk
CheckpointUsage: IfExists
SaveCheckpoints: True
Configure the properties of the Extract Resellers and Extract Reseller Sales tasks so that if they fail,
the package also fails.
View the contents of the D:\10777A\ETL folder and verify that no file named CheckPoint.chk exists.
Run the Extract Reseller Sales Data.dtsx package, noting once again that the Extract Resellers task
succeeds, but the Extract Reseller Sales task fails. When execution is complete, stop debugging.
View the contents of the D:\10777A\ETL folder and verify that a file named CheckPoint.chk has been
created.
In SQL Server Management Studio, verify that the ResellerSales table is still empty, but the Resellers
table now contains reseller records.
View the data flow for the Extract Reseller Sales task, and modify the expression in the Calculate
Sales Amount derived column transformation to remove the text / (OrderQuantity %
OrderQuantity). The completed expression should match the following code sample.
UnitPrice * OrderQuantity
Run the Extract Reseller Sales Data.dtsx package, noting the Extract Resellers task is not
re-executed, and package execution starts with the Extract Reseller Sales task, which failed on the
last attempt. When execution is complete, stop debugging.
In SQL Server Management Studio, verify that the ResellerSales table now contains data.
Close SQL Server Data Tools when you have completed the exercise.
Results: After this exercise, you should have a package that uses checkpoints to enable execution to be
restarted at the point of failure on the previous execution.
Modu
ule Reviiew and
d Takeaw
ways
Review
R
Quesstions
10777A: Im
mplementing a Data Warehouse with Miccrosoft SQL Server 20012
5-55
1..
2..
3..