Sei sulla pagina 1di 304

MCT USE ONLY.

STUDENT USE PROHIBITED


O F F I C I A L M I C R O S O F T L E A R N I N G P R O D U C T

20776A
Performing Big Data Engineering on
Microsoft Cloud Services
MCT USE ONLY. STUDENT USE PROHIBITED
ii Performing Big Data Engineering on Microsoft Cloud Services

Information in this document, including URL and other Internet Web site references, is subject to change
without notice. Unless otherwise noted, the example companies, organizations, products, domain names,
e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with
any real company, organization, product, domain name, e-mail address, logo, person, place or event is
intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the
user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in
or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical,
photocopying, recording, or otherwise), or for any purpose, without the express written permission of
Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this document. Except as expressly provided in any written license
agreement from Microsoft, the furnishing of this document does not give you any license to these
patents, trademarks, copyrights, or other intellectual property.

The names of manufacturers, products, or URLs are provided for informational purposes only and
Microsoft makes no representations and warranties, either expressed, implied, or statutory, regarding
these manufacturers or the use of the products with any Microsoft technologies. The inclusion of a
manufacturer or product does not imply endorsement of Microsoft of the manufacturer or product. Links
may be provided to third party sites. Such sites are not under the control of Microsoft and Microsoft is not
responsible for the contents of any linked site or any link contained in a linked site, or any changes or
updates to such sites. Microsoft is not responsible for webcasting or any other form of transmission
received from any linked site. Microsoft is providing these links to you only as a convenience, and the
inclusion of any link does not imply endorsement of Microsoft of the site or the products contained
therein.
© 2018 Microsoft Corporation. All rights reserved.

Microsoft and the trademarks listed at


https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/en-us.aspx are trademarks of the
Microsoft group of companies. All other trademarks are property of their respective owners

Product Number: 20776A

Part Number (if applicable): X21-57491

Released: 10/2018
MCT USE ONLY. STUDENT USE PROHIBITED
MICROSOFT LICENSE TERMS
MICROSOFT INSTRUCTOR-LED COURSEWARE

These license terms are an agreement between Microsoft Corporation (or based on where you live, one of its
affiliates) and you. Please read them. They apply to your use of the content accompanying this agreement which
includes the media on which you received it, if any. These license terms also apply to Trainer Content and any
updates and supplements for the Licensed Content unless other terms accompany those items. If so, those terms
apply.

BY ACCESSING, DOWNLOADING OR USING THE LICENSED CONTENT, YOU ACCEPT THESE TERMS.
IF YOU DO NOT ACCEPT THEM, DO NOT ACCESS, DOWNLOAD OR USE THE LICENSED CONTENT.

If you comply with these license terms, you have the rights below for each license you acquire.

1. DEFINITIONS.

a. “Authorized Learning Center” means a Microsoft IT Academy Program Member, Microsoft Learning
Competency Member, or such other entity as Microsoft may designate from time to time.

b. “Authorized Training Session” means the instructor-led training class using Microsoft Instructor-Led
Courseware conducted by a Trainer at or through an Authorized Learning Center.

c. “Classroom Device” means one (1) dedicated, secure computer that an Authorized Learning Center owns
or controls that is located at an Authorized Learning Center’s training facilities that meets or exceeds the
hardware level specified for the particular Microsoft Instructor-Led Courseware.

d. “End User” means an individual who is (i) duly enrolled in and attending an Authorized Training Session
or Private Training Session, (ii) an employee of a MPN Member, or (iii) a Microsoft full-time employee.

e. “Licensed Content” means the content accompanying this agreement which may include the Microsoft
Instructor-Led Courseware or Trainer Content.

f. “Microsoft Certified Trainer” or “MCT” means an individual who is (i) engaged to teach a training session
to End Users on behalf of an Authorized Learning Center or MPN Member, and (ii) currently certified as a
Microsoft Certified Trainer under the Microsoft Certification Program.

g. “Microsoft Instructor-Led Courseware” means the Microsoft-branded instructor-led training course that
educates IT professionals and developers on Microsoft technologies. A Microsoft Instructor-Led
Courseware title may be branded as MOC, Microsoft Dynamics or Microsoft Business Group courseware.

h. “Microsoft IT Academy Program Member” means an active member of the Microsoft IT Academy
Program.

i. “Microsoft Learning Competency Member” means an active member of the Microsoft Partner Network
program in good standing that currently holds the Learning Competency status.

j. “MOC” means the “Official Microsoft Learning Product” instructor-led courseware known as Microsoft
Official Course that educates IT professionals and developers on Microsoft technologies.

k. “MPN Member” means an active Microsoft Partner Network program member in good standing.
MCT USE ONLY. STUDENT USE PROHIBITED
l. “Personal Device” means one (1) personal computer, device, workstation or other digital electronic device
that you personally own or control that meets or exceeds the hardware level specified for the particular
Microsoft Instructor-Led Courseware.

m. “Private Training Session” means the instructor-led training classes provided by MPN Members for
corporate customers to teach a predefined learning objective using Microsoft Instructor-Led Courseware.
These classes are not advertised or promoted to the general public and class attendance is restricted to
individuals employed by or contracted by the corporate customer.

n. “Trainer” means (i) an academically accredited educator engaged by a Microsoft IT Academy Program
Member to teach an Authorized Training Session, and/or (ii) a MCT.

o. “Trainer Content” means the trainer version of the Microsoft Instructor-Led Courseware and additional
supplemental content designated solely for Trainers’ use to teach a training session using the Microsoft
Instructor-Led Courseware. Trainer Content may include Microsoft PowerPoint presentations, trainer
preparation guide, train the trainer materials, Microsoft One Note packs, classroom setup guide and Pre-
release course feedback form. To clarify, Trainer Content does not include any software, virtual hard
disks or virtual machines.

2. USE RIGHTS. The Licensed Content is licensed not sold. The Licensed Content is licensed on a one copy
per user basis, such that you must acquire a license for each individual that accesses or uses the Licensed
Content.

2.1 Below are five separate sets of use rights. Only one set of rights apply to you.

a. If you are a Microsoft IT Academy Program Member:


i. Each license acquired on behalf of yourself may only be used to review one (1) copy of the Microsoft
Instructor-Led Courseware in the form provided to you. If the Microsoft Instructor-Led Courseware is
in digital format, you may install one (1) copy on up to three (3) Personal Devices. You may not
install the Microsoft Instructor-Led Courseware on a device you do not own or control.
ii. For each license you acquire on behalf of an End User or Trainer, you may either:
1. distribute one (1) hard copy version of the Microsoft Instructor-Led Courseware to one (1) End
User who is enrolled in the Authorized Training Session, and only immediately prior to the
commencement of the Authorized Training Session that is the subject matter of the Microsoft
Instructor-Led Courseware being provided, or
2. provide one (1) End User with the unique redemption code and instructions on how they can
access one (1) digital version of the Microsoft Instructor-Led Courseware, or
3. provide one (1) Trainer with the unique redemption code and instructions on how they can
access one (1) Trainer Content,
provided you comply with the following:
iii. you will only provide access to the Licensed Content to those individuals who have acquired a valid
license to the Licensed Content,
iv. you will ensure each End User attending an Authorized Training Session has their own valid licensed
copy of the Microsoft Instructor-Led Courseware that is the subject of the Authorized Training
Session,
v. you will ensure that each End User provided with the hard-copy version of the Microsoft Instructor-
Led Courseware will be presented with a copy of this agreement and each End User will agree that
their use of the Microsoft Instructor-Led Courseware will be subject to the terms in this agreement
prior to providing them with the Microsoft Instructor-Led Courseware. Each individual will be required
to denote their acceptance of this agreement in a manner that is enforceable under local law prior to
their accessing the Microsoft Instructor-Led Courseware,
vi. you will ensure that each Trainer teaching an Authorized Training Session has their own valid
licensed copy of the Trainer Content that is the subject of the Authorized Training Session,
MCT USE ONLY. STUDENT USE PROHIBITED
vii. you will only use qualified Trainers who have in-depth knowledge of and experience with the
Microsoft technology that is the subject of the Microsoft Instructor-Led Courseware being taught for
all your Authorized Training Sessions,
viii. you will only deliver a maximum of 15 hours of training per week for each Authorized Training
Session that uses a MOC title, and
ix. you acknowledge that Trainers that are not MCTs will not have access to all of the trainer resources
for the Microsoft Instructor-Led Courseware.

b. If you are a Microsoft Learning Competency Member:


i. Each license acquired on behalf of yourself may only be used to review one (1) copy of the Microsoft
Instructor-Led Courseware in the form provided to you. If the Microsoft Instructor-Led Courseware is
in digital format, you may install one (1) copy on up to three (3) Personal Devices. You may not
install the Microsoft Instructor-Led Courseware on a device you do not own or control.
ii. For each license you acquire on behalf of an End User or Trainer, you may either:
1. distribute one (1) hard copy version of the Microsoft Instructor-Led Courseware to one (1) End
User attending the Authorized Training Session and only immediately prior to the
commencement of the Authorized Training Session that is the subject matter of the Microsoft
Instructor-Led Courseware provided, or
2. provide one (1) End User attending the Authorized Training Session with the unique redemption
code and instructions on how they can access one (1) digital version of the Microsoft Instructor-
Led Courseware, or
3. you will provide one (1) Trainer with the unique redemption code and instructions on how they
can access one (1) Trainer Content,
provided you comply with the following:
iii. you will only provide access to the Licensed Content to those individuals who have acquired a valid
license to the Licensed Content,
iv. you will ensure that each End User attending an Authorized Training Session has their own valid
licensed copy of the Microsoft Instructor-Led Courseware that is the subject of the Authorized
Training Session,
v. you will ensure that each End User provided with a hard-copy version of the Microsoft Instructor-Led
Courseware will be presented with a copy of this agreement and each End User will agree that their
use of the Microsoft Instructor-Led Courseware will be subject to the terms in this agreement prior to
providing them with the Microsoft Instructor-Led Courseware. Each individual will be required to
denote their acceptance of this agreement in a manner that is enforceable under local law prior to
their accessing the Microsoft Instructor-Led Courseware,
vi. you will ensure that each Trainer teaching an Authorized Training Session has their own valid
licensed copy of the Trainer Content that is the subject of the Authorized Training Session,
vii. you will only use qualified Trainers who hold the applicable Microsoft Certification credential that is
the subject of the Microsoft Instructor-Led Courseware being taught for your Authorized Training
Sessions,
viii. you will only use qualified MCTs who also hold the applicable Microsoft Certification credential that is
the subject of the MOC title being taught for all your Authorized Training Sessions using MOC,
ix. you will only provide access to the Microsoft Instructor-Led Courseware to End Users, and
x. you will only provide access to the Trainer Content to Trainers.
MCT USE ONLY. STUDENT USE PROHIBITED
c. If you are a MPN Member:
i. Each license acquired on behalf of yourself may only be used to review one (1) copy of the Microsoft
Instructor-Led Courseware in the form provided to you. If the Microsoft Instructor-Led Courseware is
in digital format, you may install one (1) copy on up to three (3) Personal Devices. You may not
install the Microsoft Instructor-Led Courseware on a device you do not own or control.
ii. For each license you acquire on behalf of an End User or Trainer, you may either:
1. distribute one (1) hard copy version of the Microsoft Instructor-Led Courseware to one (1) End
User attending the Private Training Session, and only immediately prior to the commencement
of the Private Training Session that is the subject matter of the Microsoft Instructor-Led
Courseware being provided, or
2. provide one (1) End User who is attending the Private Training Session with the unique
redemption code and instructions on how they can access one (1) digital version of the
Microsoft Instructor-Led Courseware, or
3. you will provide one (1) Trainer who is teaching the Private Training Session with the unique
redemption code and instructions on how they can access one (1) Trainer Content,
provided you comply with the following:
iii. you will only provide access to the Licensed Content to those individuals who have acquired a valid
license to the Licensed Content,
iv. you will ensure that each End User attending an Private Training Session has their own valid licensed
copy of the Microsoft Instructor-Led Courseware that is the subject of the Private Training Session,
v. you will ensure that each End User provided with a hard copy version of the Microsoft Instructor-Led
Courseware will be presented with a copy of this agreement and each End User will agree that their
use of the Microsoft Instructor-Led Courseware will be subject to the terms in this agreement prior to
providing them with the Microsoft Instructor-Led Courseware. Each individual will be required to
denote their acceptance of this agreement in a manner that is enforceable under local law prior to
their accessing the Microsoft Instructor-Led Courseware,
vi. you will ensure that each Trainer teaching an Private Training Session has their own valid licensed
copy of the Trainer Content that is the subject of the Private Training Session,
vii. you will only use qualified Trainers who hold the applicable Microsoft Certification credential that is
the subject of the Microsoft Instructor-Led Courseware being taught for all your Private Training
Sessions,
viii. you will only use qualified MCTs who hold the applicable Microsoft Certification credential that is the
subject of the MOC title being taught for all your Private Training Sessions using MOC,
ix. you will only provide access to the Microsoft Instructor-Led Courseware to End Users, and
x. you will only provide access to the Trainer Content to Trainers.

d. If you are an End User:


For each license you acquire, you may use the Microsoft Instructor-Led Courseware solely for your
personal training use. If the Microsoft Instructor-Led Courseware is in digital format, you may access the
Microsoft Instructor-Led Courseware online using the unique redemption code provided to you by the
training provider and install and use one (1) copy of the Microsoft Instructor-Led Courseware on up to
three (3) Personal Devices. You may also print one (1) copy of the Microsoft Instructor-Led Courseware.
You may not install the Microsoft Instructor-Led Courseware on a device you do not own or control.

e. If you are a Trainer.


i. For each license you acquire, you may install and use one (1) copy of the Trainer Content in the
form provided to you on one (1) Personal Device solely to prepare and deliver an Authorized
Training Session or Private Training Session, and install one (1) additional copy on another Personal
Device as a backup copy, which may be used only to reinstall the Trainer Content. You may not
install or use a copy of the Trainer Content on a device you do not own or control. You may also
print one (1) copy of the Trainer Content solely to prepare for and deliver an Authorized Training
Session or Private Training Session.
MCT USE ONLY. STUDENT USE PROHIBITED
ii. You may customize the written portions of the Trainer Content that are logically associated with
instruction of a training session in accordance with the most recent version of the MCT agreement.
If you elect to exercise the foregoing rights, you agree to comply with the following: (i)
customizations may only be used for teaching Authorized Training Sessions and Private Training
Sessions, and (ii) all customizations will comply with this agreement. For clarity, any use of
“customize” refers only to changing the order of slides and content, and/or not using all the slides or
content, it does not mean changing or modifying any slide or content.

2.2 Separation of Components. The Licensed Content is licensed as a single unit and you may not
separate their components and install them on different devices.

2.3 Redistribution of Licensed Content. Except as expressly provided in the use rights above, you may
not distribute any Licensed Content or any portion thereof (including any permitted modifications) to any
third parties without the express written permission of Microsoft.

2.4 Third Party Notices. The Licensed Content may include third party code tent that Microsoft, not the
third party, licenses to you under this agreement. Notices, if any, for the third party code ntent are included
for your information only.

2.5 Additional Terms. Some Licensed Content may contain components with additional terms,
conditions, and licenses regarding its use. Any non-conflicting terms in those conditions and licenses also
apply to your use of that respective component and supplements the terms described in this agreement.

3. LICENSED CONTENT BASED ON PRE-RELEASE TECHNOLOGY. If the Licensed Content’s subject


matter is based on a pre-release version of Microsoft technology (“Pre-release”), then in addition to the
other provisions in this agreement, these terms also apply:

a. Pre-Release Licensed Content. This Licensed Content subject matter is on the Pre-release version of
the Microsoft technology. The technology may not work the way a final version of the technology will
and we may change the technology for the final version. We also may not release a final version.
Licensed Content based on the final version of the technology may not contain the same information as
the Licensed Content based on the Pre-release version. Microsoft is under no obligation to provide you
with any further content, including any Licensed Content based on the final version of the technology.

b. Feedback. If you agree to give feedback about the Licensed Content to Microsoft, either directly or
through its third party designee, you give to Microsoft without charge, the right to use, share and
commercialize your feedback in any way and for any purpose. You also give to third parties, without
charge, any patent rights needed for their products, technologies and services to use or interface with
any specific parts of a Microsoft technology, Microsoft product, or service that includes the feedback.
You will not give feedback that is subject to a license that requires Microsoft to license its technology,
technologies, or products to third parties because we include your feedback in them. These rights
survive this agreement.

c. Pre-release Term. If you are an Microsoft IT Academy Program Member, Microsoft Learning
Competency Member, MPN Member or Trainer, you will cease using all copies of the Licensed Content on
the Pre-release technology upon (i) the date which Microsoft informs you is the end date for using the
Licensed Content on the Pre-release technology, or (ii) sixty (60) days after the commercial release of the
technology that is the subject of the Licensed Content, whichever is earliest (“Pre-release term”).
Upon expiration or termination of the Pre-release term, you will irretrievably delete and destroy all copies
of the Licensed Content in your possession or under your control.
MCT USE ONLY. STUDENT USE PROHIBITED
4. SCOPE OF LICENSE. The Licensed Content is licensed, not sold. This agreement only gives you some
rights to use the Licensed Content. Microsoft reserves all other rights. Unless applicable law gives you more
rights despite this limitation, you may use the Licensed Content only as expressly permitted in this
agreement. In doing so, you must comply with any technical limitations in the Licensed Content that only
allows you to use it in certain ways. Except as expressly permitted in this agreement, you may not:
• access or allow any individual to access the Licensed Content if they have not acquired a valid license
for the Licensed Content,
• alter, remove or obscure any copyright or other protective notices (including watermarks), branding
or identifications contained in the Licensed Content,
• modify or create a derivative work of any Licensed Content,
• publicly display, or make the Licensed Content available for others to access or use,
• copy, print, install, sell, publish, transmit, lend, adapt, reuse, link to or post, make available or
distribute the Licensed Content to any third party,
• work around any technical limitations in the Licensed Content, or
• reverse engineer, decompile, remove or otherwise thwart any protections or disassemble the
Licensed Content except and only to the extent that applicable law expressly permits, despite this
limitation.

5. RESERVATION OF RIGHTS AND OWNERSHIP. Microsoft reserves all rights not expressly granted to
you in this agreement. The Licensed Content is protected by copyright and other intellectual property laws
and treaties. Microsoft or its suppliers own the title, copyright, and other intellectual property rights in the
Licensed Content.

6. EXPORT RESTRICTIONS. The Licensed Content is subject to United States export laws and regulations.
You must comply with all domestic and international export laws and regulations that apply to the Licensed
Content. These laws include restrictions on destinations, end users and end use. For additional information,
see www.microsoft.com/exporting.

7. SUPPORT SERVICES. Because the Licensed Content is “as is”, we may not provide support services for it.

8. TERMINATION. Without prejudice to any other rights, Microsoft may terminate this agreement if you fail
to comply with the terms and conditions of this agreement. Upon termination of this agreement for any
reason, you will immediately stop all use of and delete and destroy all copies of the Licensed Content in
your possession or under your control.

9. LINKS TO THIRD PARTY SITES. You may link to third party sites through the use of the Licensed
Content. The third party sites are not under the control of Microsoft, and Microsoft is not responsible for
the contents of any third party sites, any links contained in third party sites, or any changes or updates to
third party sites. Microsoft is not responsible for webcasting or any other form of transmission received
from any third party sites. Microsoft is providing these links to third party sites to you only as a
convenience, and the inclusion of any link does not imply an endorsement by Microsoft of the third party
site.

10. ENTIRE AGREEMENT. This agreement, and any additional terms for the Trainer Content, updates and
supplements are the entire agreement for the Licensed Content, updates and supplements.

11. APPLICABLE LAW.


a. United States. If you acquired the Licensed Content in the United States, Washington state law governs
the interpretation of this agreement and applies to claims for breach of it, regardless of conflict of laws
principles. The laws of the state where you live govern all other claims, including claims under state
consumer protection laws, unfair competition laws, and in tort.
MCT USE ONLY. STUDENT USE PROHIBITED
b. Outside the United States. If you acquired the Licensed Content in any other country, the laws of that
country apply.

12. LEGAL EFFECT. This agreement describes certain legal rights. You may have other rights under the laws
of your country. You may also have rights with respect to the party from whom you acquired the Licensed
Content. This agreement does not change your rights under the laws of your country if the laws of your
country do not permit it to do so.

13. DISCLAIMER OF WARRANTY. THE LICENSED CONTENT IS LICENSED "AS-IS" AND "AS
AVAILABLE." YOU BEAR THE RISK OF USING IT. MICROSOFT AND ITS RESPECTIVE
AFFILIATES GIVES NO EXPRESS WARRANTIES, GUARANTEES, OR CONDITIONS. YOU MAY
HAVE ADDITIONAL CONSUMER RIGHTS UNDER YOUR LOCAL LAWS WHICH THIS AGREEMENT
CANNOT CHANGE. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAWS, MICROSOFT AND
ITS RESPECTIVE AFFILIATES EXCLUDES ANY IMPLIED WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.

14. LIMITATION ON AND EXCLUSION OF REMEDIES AND DAMAGES. YOU CAN RECOVER FROM
MICROSOFT, ITS RESPECTIVE AFFILIATES AND ITS SUPPLIERS ONLY DIRECT DAMAGES UP
TO US$5.00. YOU CANNOT RECOVER ANY OTHER DAMAGES, INCLUDING CONSEQUENTIAL,
LOST PROFITS, SPECIAL, INDIRECT OR INCIDENTAL DAMAGES.

This limitation applies to


o anything related to the Licensed Content, services, content (including code) on third party Internet
sites or third-party programs; and
o claims for breach of contract, breach of warranty, guarantee or condition, strict liability, negligence,
or other tort to the extent permitted by applicable law.

It also applies even if Microsoft knew or should have known about the possibility of the damages. The
above limitation or exclusion may not apply to you because your country may not allow the exclusion or
limitation of incidental, consequential or other damages.

Please note: As this Licensed Content is distributed in Quebec, Canada, some of the clauses in this
agreement are provided below in French.

Remarque : Ce le contenu sous licence étant distribué au Québec, Canada, certaines des clauses
dans ce contrat sont fournies ci-dessous en français.

EXONÉRATION DE GARANTIE. Le contenu sous licence visé par une licence est offert « tel quel ». Toute
utilisation de ce contenu sous licence est à votre seule risque et péril. Microsoft n’accorde aucune autre garantie
expresse. Vous pouvez bénéficier de droits additionnels en vertu du droit local sur la protection dues
consommateurs, que ce contrat ne peut modifier. La ou elles sont permises par le droit locale, les garanties
implicites de qualité marchande, d’adéquation à un usage particulier et d’absence de contrefaçon sont exclues.

LIMITATION DES DOMMAGES-INTÉRÊTS ET EXCLUSION DE RESPONSABILITÉ POUR LES


DOMMAGES. Vous pouvez obtenir de Microsoft et de ses fournisseurs une indemnisation en cas de dommages
directs uniquement à hauteur de 5,00 $ US. Vous ne pouvez prétendre à aucune indemnisation pour les autres
dommages, y compris les dommages spéciaux, indirects ou accessoires et pertes de bénéfices.
Cette limitation concerne:
• tout ce qui est relié au le contenu sous licence, aux services ou au contenu (y compris le code)
figurant sur des sites Internet tiers ou dans des programmes tiers; et.
• les réclamations au titre de violation de contrat ou de garantie, ou au titre de responsabilité
stricte, de négligence ou d’une autre faute dans la limite autorisée par la loi en vigueur.
MCT USE ONLY. STUDENT USE PROHIBITED
Elle s’applique également, même si Microsoft connaissait ou devrait connaître l’éventualité d’un tel dommage. Si
votre pays n’autorise pas l’exclusion ou la limitation de responsabilité pour les dommages indirects, accessoires
ou de quelque nature que ce soit, il se peut que la limitation ou l’exclusion ci-dessus ne s’appliquera pas à votre
égard.

EFFET JURIDIQUE. Le présent contrat décrit certains droits juridiques. Vous pourriez avoir d’autres droits
prévus par les lois de votre pays. Le présent contrat ne modifie pas les droits que vous confèrent les lois de votre
pays si celles-ci ne le permettent pas.

Revised July 2013


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services iv
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services v

Acknowledgements
Microsoft Learning would like to acknowledge and thank the following for their contribution towards
developing this title. Their effort at various stages in the development has ensured that you have a good
classroom experience.

John Sharp – Content Developer


John Sharp is a developer and author working for Content Master. He specializes in distributed systems
and cloud-based solutions. He has authored books on various technologies for Microsoft Press, and has
spent a lot of time working with the Microsoft patterns & practices group writing guidance for
implementing large-scale systems in the cloud.

David Coombes – Content Developer


David has many years’ experience in the writing and designing of training courses, technical guides, and
whitepapers. David’s current technology focus is the Microsoft Azure environment, having been in
involved in Microsoft cloud technologies from the early days of MOS/BPOS; recent projects have included
Azure-based resource deployments, and the use of machine learning and other data analysis tools. David
also has many years’ experience with networking and infrastructure technologies.

Santosh Kumar Veluguri – Content Developer


Santosh Veluguri has many years' experience in architecting and implementing solutions for several
enterprise programs and projects around the world focusing on Data and Business Intelligence areas.
Santosh has implemented over a dozen Azure based solutions and has trained many architects and
developers on Azure technologies. Santosh's current focus includes Machine Learning, Big Data, Internet
of Things, Artificial Intelligence and Cognitive Services.

Wren Mott – Content Developer


Wren Mott has spent the last two decades working with the Microsoft stack from traditional on-premises
technologies to modern cloud based solutions. Having worked with Azure since before its debut almost
eight years ago he has stayed close to its growth in both breadth and service offering working on a
number of projects throughout Europe for some of the largest companies in the world. Recently he has
focused on data science and what the power of Azure can bring to its pursuit with cybersecurity always at
the fore.
MCT USE ONLY. STUDENT USE PROHIBITED
vi Performing Big Data Engineering on Microsoft Cloud Services

Contents
Module 1: Architectures for Big Data Engineering with Azure
Module Overview 1-1

Lesson 1: Understanding big data 1-3

Lesson 2: Architectures for processing big data 1-6

Lesson 3: Considerations for designing big data solutions 1-11


Lab: Designing a big data architecture 1-18

Module Review and Takeaways 1-21

Module 2: Processing Event Streams using Azure Stream Analytics


Module Overview 2-1
Lesson 1: Introduction to Stream Analytics 2-2

Lesson 2: Configure Stream Analytics jobs 2-14

Lab: Process event streams with Stream Analytics 2-23


Module Review and Takeaways 2-31

Module 3: Performing Custom Processing in Azure Stream Analytics


Module Overview 3-1

Lesson 1: Implementing custom functions and debugging jobs 3-2


Lesson 2: Incorporating Machine Learning into a Stream Analytics job 3-7

Lab: Performing custom processing with Stream Analytics 3-12

Module Review and Takeaways 3-17

Module 4: Managing Big Data in Azure Data Lake Store


Module Overview 4-1

Lesson 1: The Azure Data Lake Store 4-2

Lesson 2: Monitoring and protecting data in Data Lake Store 4-9

Lab: Managing big data in Data Lake Store 4-16

Module Review and Takeaways 4-21

Module 5: Processing big data using Azure Data Lake Analytics


Module Overview 5-1
Lesson 1: Introduction to Azure Data Lake Analytics 5-2

Lesson 2: Analyzing data with U-SQL 5-7

Lesson 3: Sorting, grouping, and joining data 5-23

Lab: Processing big data using Data Lake Analytics 5-36

Module Review and Takeaways 5-40


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services vii

Module 6: Implementing Custom Operations and Monitoring Performance in Azure Data Lake
Analytics
Module Overview 6-1

Lesson 1: Incorporating custom functionality into analytics jobs 6-2

Lesson 2: Optimizing jobs 6-19

Lesson 3: Managing jobs and protecting resources 6-37


Lab: Implementing custom operations and monitoring performance in Azure
Data Lake Analytics 6-41

Module Review and Takeaways 6-45

Module 7: Implementing Azure SQL Data Warehouse


Module Overview 7-1

Lesson 1: Introduction to SQL Data Warehouse 7-2

Lesson 2: Designing tables for efficient queries 7-8

Lesson 3: Importing data into SQL Data Warehouse 7-17


Lab: Implementing SQL Data Warehouse 7-28

Module Review and Takeaways 7-32

Module 8: Performing Analytics with Azure SQL Data Warehouse


Module Overview 8-1
Lesson 1: Querying data in SQL Data Warehouse 8-2

Lesson 2: Maintaining performance 8-10

Lesson 3: Protecting data in SQL Data Warehouse 8-19


Lab: Performing analytics with SQL Data Warehouse 8-24

Module Review and Takeaways 8-28

Module 9: Automating Data Flow with Azure Data Factory


Module Overview 9-1
Lesson 1: Introduction to Data Factory 9-2

Lesson 2: Transferring data 9-11

Lesson 3: Transforming data 9-26


Lesson 4: Monitoring performance and protecting data 9-33

Lab: Automating the Data Flow with Azure Data Factory 9-38

Module Review and Takeaways 9-44


MCT USE ONLY. STUDENT USE PROHIBITED
About This Course i

About This Course


This section provides a brief description of the course, audience, suggested prerequisites, and course
objectives.

Course Description
This five-day instructor-led course describes how to perform Big Data Engineering on Microsoft Cloud
Services.

Audience
The primary audience for this course is data engineers (IT professionals, developers, and information
workers) who plan to implement big data engineering workflows on Azure.

Student Prerequisites
In addition to their professional experience, students who attend this course should have:

A good understanding of Azure data services.

A basic knowledge of the Microsoft Windows operating system and its core functionality.
A good knowledge of relational databases.

Course Objectives
After completing this course, students will be able to:

Describe architectures for Big Data Engineering with Azure

Process Event Streams using Azure Stream Analytics

Perform Custom Processing in Azure Stream Analytics

Manage data in Azure Data Lake Store

 Process data in Azure Data Lake Store

 Implement Custom Operations and monitor performance in Azure Data Lake Store

 Create a repository to support large-scale analytical processing in Azure SQL Data Warehouse

 Perform analytics with Azure SQL Data Warehouse

 Automate data flow with Azure Data Factory

Course Outline
The course outline is as follows:

 Module 1: ‘Architectures for Big Data Engineering with Azure’ describes the common architectures for
processing big data using Azure tools and services.

 Module 2: ‘Processing Event Streams using Azure Stream Analytics’ explains how to use Azure Stream
Analytics to design and implement stream processing over large-scale data.

 Module 3: ‘Performing Custom Processing in Azure Stream Analytics’ describes how to include
custom functions and incorporate machine learning activities into an Azure Stream Analytics job.

 Module 4: ‘Managing Big Data in Azure Data Lake Store’ explains how to use Azure Data Lake Store
as a large-scale repository of data files.

 Module 5: ‘Processing Big Data using Azure Data Lake Analytics’ describes how to use Azure Data
Lake Analytics to examine and process data held in Azure Data Lake Store.
MCT USE ONLY. STUDENT USE PROHIBITED
ii About This Course

 Module 6: ‘Implementing Custom Operations and Monitoring Performance in Azure Data Lake
Analytics’ describes how to create and deploy custom functions and operations, integrate with Python
and R, and protect and optimize jobs.

 Module 7: ‘Implementing Azure SQL Data Warehouse’ explains how to use Azure SQL Data
Warehouse to create a repository that can support large-scale analytical processing over data at rest.

 Module 8: ‘Performing Analytics with Azure SQL Data Warehouse’ describes how to use Azure SQL
Data Warehouse to perform analytical processing, how to maintain performance, and how to protect
the data.

 Module 9: ‘Automating the Data Flow with Azure Data Factory, explains how to use Azure Data
Factory to import, transform, and transfer data between repositories and services.

Course Materials
The following materials are included with your kit:
 Course Handbook: a succinct classroom learning guide that provides the critical technical
information in a crisp, tightly-focused format, which is essential for an effective in-class learning
experience.

 Lessons: guide you through the learning objectives and provide the key points that are critical to the
success of the in-class learning experience.

 Labs: provide a real-world, hands-on platform for you to apply the knowledge and skills learned in
the module. Your instructor will provide you with the lab steps.

 Module Reviews and Takeaways: provide on-the-job reference material to boost knowledge and
skills retention.
 Lab Answer Keys: provide step-by-step lab solution guidance. Your instructor will provide you with
the lab steps.

Important: The lab answer keys are based on the versions of the Azure portal and
associated software that were current at the time of writing. Azure is a constantly evolving
environment, so some of the detailed steps provided in the lab answer keys might not reflect the
exact procedures that you need to perform, although the general principles should remain the
same.

Additional Reading: Course Companion Content on the https://aka.ms/Companion-


MOC Site: searchable, easy-to-browse digital content with integrated premium online resources
that supplement the Course Handbook.

 Modules: include companion content, such as questions and answers, detailed demo steps and
additional reading links, for each lesson. Additionally, they include Lab Review questions and answers
and Module Reviews and Takeaways sections, which contain the review questions and answers, best
practices, common issues and troubleshooting tips with answers, and real-world issues and scenarios
with answers.

o Resources: include well-categorized additional resources that give you immediate access to the
most current premium content on TechNet, MSDN®, or Microsoft® Press®.
MCT USE ONLY. STUDENT USE PROHIBITED
About This Course iii

Additional Reading: Student Course files on the https://aka.ms/Companion-MOC


Site: includes the Allfiles.exe, a self-extracting executable file that contains all required files for
the labs and demonstrations.

o Course evaluation: at the end of the course, you will have the opportunity to complete an
online evaluation to provide feedback on the course, training facility, and instructor.

o To provide additional comments or feedback, or to report a problem with course resources, visit
the Training Support site at https://trainingsupport.microsoft.com/en-us. To inquire about the
Microsoft Certification Program, send an email to certify@microsoft.com.

Virtual Machine Environment


This section provides the information for setting up the classroom environment to support the business
scenario of the course.

Virtual Machine Configuration


In this course, you will use Microsoft® Hyper-V™ to perform the labs.

The following table shows the role of each virtual machine that is used in this course:

Virtual machine Role

20776A-LON-DEV LON-DEV provides the main development environment.

20776A-LON-SQL LON-SQL provides the SQL Server database and services

MT17B-WS2016-NAT WS2016-NAT provides access to the Internet

Software Configuration
The following software is installed on the virtual machines:

 Microsoft SQL Server 2016

 Microsoft Azure Storage Explorer version 0.8.16

 Microsoft Power BI Desktop (x64)

 Microsoft CLI 2.0 for Azure

 Microsoft SQL Server Management Studio version 17.2

 Microsoft Visual Studio 2017

 Microsoft Office 2016

Course Files
The files associated with the labs in this course are located in the E:\Labfiles folder on the 20776A-LON-
DEV virtual machine.
MCT USE ONLY. STUDENT USE PROHIBITED
iv About This Course

Classroom Setup
Each classroom computer will have the same virtual machines configured in the same
way. All students and the instructor must perform the following tasks prior to
commencing module 1:
Start the VMs
1. In Hyper-V Manager, under Virtual Machines, right-click MT17B-WS2016-NAT, and then click Start.

2. In Hyper-V Manager, under Virtual Machines, right-click 20776A-LON-SQL, and then click Start.

3. In Hyper-V Manager, under Virtual Machines, right-click 20776A-LON-DEV, and then click Start.

4. Right-click 20776A-LON-DEV, and then click Connect.

5. Log in as Admin with the password Pa55w.rd.

Course Hardware Level


To ensure a satisfactory student experience, Microsoft Learning requires a minimum equipment
configuration for trainer and student computers in all Microsoft Learning Partner classrooms in which
Official Microsoft Learning Product courseware is taught.

 Processor:
o 2.8 GHz 64-bit processor (multi-core) or better

 AMD:

o AMD Virtualization (AMD-V)


o Second Level Address Translation (SLAT) - nested page tables (NPT)

o Hardware-enforced Data Execution Prevention (DEP) must be available and enabled (NX Bit)

o Supports TPM 2.0 or greater

 Intel:

o Intel Virtualization Technology (Intel VT)

o Supports Second Level Address Translation (SLAT) – Extended Page Table (EPT)
o Hardware-enforced Data Execution Prevention (DEP) must be available and enabled (XD bit)

o Supports TPM 2.0 or greater

 Hard Disk: 500GB SSD System Drive

 RAM: 32 GB minimum

 Network adapter

 Monitor: Dual monitors supporting 1440X900 minimum resolution


 Mouse or compatible pointing device

 Sound card with headsets

In addition, the instructor computer must:


 Be connected to a projection display device that supports SVGA 1024 x 768 pixels, 16 bit colors.

 Have a sound card with amplified speakers.


MCT USE ONLY. STUDENT USE PROHIBITED
1-1

Module 1
Architectures for Big Data Engineering with Azure
Contents:
Module Overview 1-1 
Lesson 1: Understanding big data 1-3 

Lesson 2: Architectures for processing big data 1-6 

Lesson 3: Considerations for designing big data solutions 1-11 


Lab: Designing a big data architecture 1-18 

Module Review and Takeaways 1-21 

Module Overview
This module introduces the concept of big data, and what makes the processing of big data different
from other data processing. The module then introduces the main architectures for processing big data,
and looks at the main issues that you should consider when designing a big data solution.

Objectives
After completing this module, you will be able to:
 Explain the concept of big data.

 Describe the Lambda and Kappa architectures for processing big data.

 Describe design considerations for building big data solutions with Azure.

Prerequisites
This section outlines the steps you need to take to
set up the environment for this module. To
complete the labs, you will require an Azure trial
subscription.
Azure trial subscription

There are demonstrations and labs in this course


that require access to Microsoft® Azure®. You
need to allow sufficient time for the setup and
configuration of a Microsoft Azure Pass that will
provide access for you and your students.
MCT USE ONLY. STUDENT USE PROHIBITED
1-2 Architectures for Big Data Engineering with Azure

Access to Microsoft Azure Learning Passes for students of authorized Microsoft Learning
Partners
https://aka.ms/jjhtex
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-3

Lesson 1
Understanding big data
Data that is too large, too fast, or too different from traditional data sources, is called “big data”. This
information comes from many different sources and resides within multiple departments of a single
company. Big data often answers vital company questions, such as ”where should we open our next
branch?”
The underlying information is often too complex to analyze with traditional relational databases, or it’s in
the form of a real-time stream. The source data could be trapped in a file format that can’t be processed
or stored on a disk that is no longer accessible from the company network. Whatever the challenge you
are facing, big data solutions can help you unlock these business insights from your data.

Lesson Objectives
After completing this lesson, you will be able to describe:
 The three Vs of big data.

 The differences between batch processing and stream processing.

 The different use cases for big data.

Defining the three Vs of big data


It’s common to use the term “three Vs” to define
the characteristics of big data. These three Vs
cover the following factors:
 Volume. There might simply be a huge
amount of data (in the hundreds of terabytes)
and this data has to be stored and processed
across multiple systems. Examples include
server logs, website analytics, and genomics
datasets.

 Variety. The data does not have a standard


structure; it might be completely
unstructured, or perhaps semi-structured. In
consequence, it’s not possible to apply any form of schema to the data as it’s being stored. Examples
include photos, free text, Excel® files, or JSON data objects. This list is practically endless.

 Velocity. The data is being collected from a wide range of devices and sources, and the sources
increase in number as the data increases in volume. Examples include Twitter data, manufacturing
sensors, and mobile phone GPS data.

When you consider which technology to use, it’s helpful to first understand which V you need to
overcome to find your answer. It’s not always straightforward and you might find that, to answer all your
questions, your solution will need to address all three Vs.

What is big data?


https://aka.ms/ivurw8
MCT USE ONLY. STUDENT USE PROHIBITED
1-4 Architectures for Big Data Engineering with Azure

Listing processing techniques


When there is an insurmountable challenge, the
common advice is to break the challenge into
smaller, more digestible pieces, and address each
one individually. This advice also applies when you
are trying to solve a volume-based big data
problem. Consequently, big data engines use
batch processing when they query huge stores of
data.

Batch processing is the execution of a series of


steps without intervention from an outside source.
You visualize this idea by thinking of a COUNT(*)
FROM Y WHERE X>10 SQL query on a relational
database. This query has two steps: to get the records where X>10, and then to count them all—so this
batch has two steps. This processing technique is very effective for use on large datasets where data isn’t
changing rapidly. If data is changing rapidly—that is, it has high velocity—by the time your batch returns
a result, the data might already be stale. High velocity data requires a new approach for processing
information.
Streaming queries are used with high velocity data that has near real-time reporting requirements. A
streaming query analyzes data on arrival, keeping the query’s value in memory. When you request a
result, the query returns the result from memory, instead of kicking off a batch job to retrieve it. To
highlight the differences between a streaming query and a batch query, consider the following scenarios:
1. If I were to ask, “how much water is currently in a still body of water?” it would be possible to collect
all the water, measure it, and provide an answer. This is an example of a batch query.
2. But what if that body of water was a flowing river? We couldn’t divert it into a holding container, and
I might want to know the number of gallons passed in the last hour. A new approach would be
needed. Adding a flow meter that responds with the amount of water measured for the last hour
would provide a solution. This is an example of a streaming query.
When dealing with a variety of data, you might need to analyze both large volumes of data and data with
a high velocity. In these situations, both a batch and streaming query might be needed. This is called a
hybrid approach.

Identifying big data use cases


What exactly does incorporating big data
solutions mean to companies? Being able to
analyze big data efficiently might directly
influence the cost of manufacturing their product,
the amount of revenue they produce, or the
experience of their customers.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-5

Here are three examples of how organizations might use big data to solve business issues:

1. Wingtip Toys produces toys for young children using multiple machines in a large manufacturing
plant they built in 2014. During the last holiday rush, a manufacturing machine broke and delayed
the production of 100,000 Tailspins, their best-selling toy. Customers were outraged and many orders
were cancelled. Since the last holiday season, Wingtip Toys has implemented a big data solution that
monitors its facilities and recommends times of low production when it’s best to perform
maintenance on their manufacturing lines. This predictive maintenance solution minimizes the chance
of production delays during times of holiday rush, helping Wingtip Toys to fulfil their orders.

2. Northwind Traders is a financial firm that specializes in high frequency trading. The traders use
machine learning algorithms to buy and sell securities, making small profits on each trade. Recently,
they created an algorithm that would look at how much a particular stock is discussed on popular
social media platforms, and use the results to execute trades. They implemented big data technology
to run analysis over entire social networks and extract valuable data to train their algorithm.
Managing this volume and variety of data made it possible to implement a new trading strategy,
bringing brand new revenue to the firm.

3. Alpine Ski House is the most popular ski resort in the world. The resort management recently realized
that people were using goggle tan lines to see who spent the most days skiing on the mountain. This
method proved to be inaccurate and many locals were visibly upset when claims that they had
amassed 120 skiing days were not believed. Alpine Ski House implemented a big data solution that
would track skiers’ runs down the mountain, and provide aggregated statistics of the vertical feet they
had skied, along with the number of days spent on the mountain. This approach made a huge
difference to customer satisfaction and set Alpine Ski House apart from their competitors.

Question: What are some examples of big data use cases in your industry?
MCT USE ONLY. STUDENT USE PROHIBITED
1-6 Architectures for Big Data Engineering with Azure

Lesson 2
Architectures for processing big data
Because real-time data gathering requires the processing of vast amounts of information, it’s vital that
you implement an effective real-time processing architecture that can cope with the rate of data input,
implements the necessary analysis, and generates the outputs that you need.

In this lesson, you will learn about the two main architectures for real-time data processing—Lambda and
Kappa. You will review their similarities and differences and identify when one or the other is the best
choice. You will then identify how Azure implements these architectures as services, along with how to
implement a big data processing architecture on this platform.

Lesson Objectives
After completing this lesson, you will be able to:

 Describe the Lambda architecture.


 Describe the Kappa architecture.

 Explain how Azure services implement the Lambda and Kappa architectures.

 Identify how to implement a big data processing architecture on Azure.

The Lambda architecture


Lambda architecture is older than Kappa, and uses
batch and stream processing methods to handle
huge quantities of real-time data:
 Batch data processing. Lambda uses batch
processing to generate accurate and
comprehensive views of batch data. However,
batch processing always lags behind the real-
time stream.

 Data streams processing. In addition to


batch processing, Lambda simultaneously
provides views of online data through stream
processing. These views are not so accurate
but, unlike batch data, are available immediately.
These processes are implemented through three layers:

 Batch layer—this performs the batch processing operations.

 Speed layer—this does the data stream processing without being concerned with accuracy.
 Serving layer—this stores output from both the batch and stream layers, and then responds to ad-
hoc queries by providing precomputed views of the data or building custom views, depending on the
query.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-7

Input for analysis comes from a stream of data that provides a series of record. This information might
come from a device or could be captured as part of a sampling process. The Lambda internal architecture
refers to processing as “hot path” or “cold path” flows:

 The hot path provides data stream processing, giving instant access to the data and analytics, but at
the expense of lower accuracy.

 The cold path improves accuracy with comprehensive batch processing, using data refinement
methods. Both methods use the same data, but apply different levels of granularity to obtain their
outputs.

The hot and cold paths perform similar tasks over the same data, but the hot path is more likely to use
approximations and coarse granularity to evaluate data, whereas the cold path is more precise and finer
grained. Apps use the hot path data to provide instant feedback and time critical decision-making, but
subsequently use the cold path data to make more measured evaluations and corrections. For example, if
you need to monitor the flow of water through a sewage treatment works, the hot path (stream
processing) might identify if there has been a sudden surge in water that requires immediate action, such
as opening outputs or closing inputs. For this type of real-time decision, the issue is not the exact flow
level in gallons per hour, but its relative magnitude. Alongside this monitoring of the critical real-time
data, the cold path process could then calculate data such as performance charts, looking at oxygen levels
or flow rates over time.

To give another example, if you are developing applications for a professional sports team, such as long-
distance cycle racers, the real-time hot path data might be road speed, pedal cadence, rider power output
or heart rate—all inputs that might require immediate follow-up to maximize an athlete’s performance. In
contrast, the cold path data could involve more long-term reviews of rider performance, perhaps along a
specific section of a track or course, to identify any required changes to training programs.

Limitations
The main challenge with Lambda architecture is that you need to maintain consistency between the hot
and cold path processing logic. If you change the analytical requirements, then both sets of code must be
changed to maintain consistency.

A second consideration is that, after the data has been captured and processed, the data is effectively
immutable, and can no longer be changed. Alterations to the model, such as gathering a wider range of
data or bringing newer capture devices online, makes it difficult to generate longer-term comparisons of
historical data.

The Kappa architecture


Kappa has been developed to overcome some of
the limitations of the Lambda architecture.
One of the key differences is the way that Kappa
addresses the reanalyzing of historic data. In the
Kappa model, the original data is retained as an
historical sequence of “log” files. If you need to
change the model, you can “replay” the old data
from logs and apply the logic of the new
processing model to this data. The hot and cold
paths effectively use the same data. If the model
changes, historical data can be fed back through
the hot path to give the updated results.
MCT USE ONLY. STUDENT USE PROHIBITED
1-8 Architectures for Big Data Engineering with Azure

In essence, Kappa is a simplification of Lambda, in which the batch processing function is removed. Data is
stored in an append-only immutable log. That log database could be something like Apache Kafka. From
that store, the information is then streamed through computational systems to generate the required
analysis—these computational systems could be Apache Storm, Apache Spark, or Kafka Streams.

There are two main advantages to the Kappa approach:

 The removal of the batch processing function means that Kappa requires maintenance of only one set
of code.

 Migrations and reorganizations are easy, because you simply repopulate a database from the
canonical store.

With Kappa, the serving layer still provides optimized responses to queries. However, these databases are
almost like caches—you can simply wipe and regenerate them from the original log data. You can use any
database type in the serving layer—for example, in-memory, persistent, or special purpose, such as full
text search.

Common Azure services for implementing Lambda and Kappa


architectures
In reality, most big data processing systems
implement some form of hybrid combination of
Lambda and Kappa architectures, depending on
factors such as data usage, volume, requirements
for accuracy, and so on. The fact is that solutions
for big data need to meet processing
requirements, rather than be tied to one particular
ideology.

The following Azure components are used to


implement Lambda and Kappa architectures for
big data processing:
 Azure Stream Analytics. Enables processing
of data streams from event hubs and IoT hubs.

 Azure Machine Learning. Supports analyzing and predictive modeling of data streams.

 Azure Blob storage. Persists data at scale, up to a certain scale point.

 Azure Data Lake Storage. Persists data at massive scale with unlimited objects or file sizes.

 Azure Data Lake Analytics. Performs batch processing and analysis of big data, implementing very
large data parallel processing.
 Azure SQL Data Warehouse. Provides a massively parallel processing (MPP) capable relational
database that processes large volumes of data.

Note: Azure data lakes provide unstructured storage for very large amounts of data, which
typically ties in with use cases that Lambda and Kappa architectures enable. SQL Data Warehouse
is more suited for storing structured or processed data with a preconfigured schema. However,
SQL Data Warehouse could be used to store the processed output from Lambda or Kappa
architectures.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-9

These options are all for batch processing and real-time data stream analysis. For structured data storage
in Azure, you would use the large relational database processing features. These options include:

 SQL Server running on a virtual machine.

 SQL Database, for storage up to 4 TB.

 SQL Data Warehouse for data volumes of 1 TB and more, with typical sizes in the 10 TB range.
Data Lake Analytics provides additional facilities to go beyond what is possible by using structured SQL
data storage, such as:

 Combining C# and SQL to generate more powerful queries.

 Managing unstructured files, such as images, XML or JSON.

 Federating queries to bring together structured data from a database or data warehouse and
unstructured data from a data lake.

Ultimately, your decision on which architecture to implement, and what features to use, depends on your
data source and how you are going to use the data. If you need elasticity and scalability, then Azure Data
Factory provides the facilities for moving data around in the cloud.

Implementing a big data processing architecture on Azure


Azure provides many of the features, facilities and
services that you need to build a big data solution,
regardless of the processing architecture you use.
A key part of this provision is Stream Analytics,
which provides a fully managed engine for
processing large numbers of streamed events.
Stream Analytics can then perform real-time
analytical computations, based on this streaming
data.

The streamed data comes from a wide range of


devices and sensors, such as flow rate monitors,
density testers, atmospheric pressure evaluators,
traffic counters, voltmeters, and so on. Streamed data might also come from online presences such as
websites, social media feeds and applications, covering clicks, click rates, page views, video views, and
time spent on a webpage or using an app. Infrastructure systems also provide data streams of information
such as temperature, humidity or even the number of times elevators go up and down, along with
passenger numbers.

You use Stream Analytics to examine very high, often disparate, streams of data, and use that information
to identify and analyze relationships, patterns and trends between the monitored sources. For example,
what is the effect of weather patterns on traffic flow? Using that data, you then carry out tasks in
applications or propagate that information to a website. These days, it’s easy to get a live feed of current
journey times on your commute, but what if you could find out what the expected levels of traffic would
be in an hour or later that afternoon? Would that information affect whether you were likely to go
shopping now or later?
MCT USE ONLY. STUDENT USE PROHIBITED
1-10 Architectures for Big Data Engineering with Azure

This course is based around a hybrid Lambda/Kappa architecture, as shown on the slide:

 Device data is passed in from event hubs and IoT hubs (depending on the source devices).

 Stream Analytics does its own “hot” stream processing of this data, but also passes unchanged event
data to long-term storage (such as Blob storage or Data Lake Storage).

 Data Lake Analytics processes the same data to provide additional insights, and to generate analytical
models for Data Warehouse.

 Additional, slow moving data (lookups and supporting information from other databases) is ingested
through Data Factory and/or methods such as PolyBase and SQL Server Integration Services (SSIS).

 Stream Analytics and Data Lake Analytics use this data as part of the decision-making process in the
analytical logic.

 Stream Analytics, Data Lake Analytics, and Data Factory might include Azure Machine Learning
modules to provide additional predictive input, based on existing data.

 Power BI and custom apps built using the Azure SDK visualize and report the results.

Check Your Knowledge


Question

Which of the following are features of the


Lambda model?

Select the correct answer.

Processing is split into hot and cold paths.

There are two sets of processing logic to


maintain.

Streaming data processing is not supported.

Batch data processing is not supported.

It cannot be implemented in Azure.


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-11

Lesson 3
Considerations for designing big data solutions
By necessity, big data designs are different both from traditional data warehouse projects, and from large
relational databases, particularly in relation to the number of decision points. Rather than a plug and play
approach using tried and tested technologies, each big data implementation will be unique, bringing
together previously unused combinations of components to create the solution.
In this lesson, you will review the various considerations that you must address when designing big data
solutions. Getting these factors correct at the design stage will help to create a stable, secure environment
that achieves the solution parameters and will cost less overall than a poorly-specified implementation.

Lesson Objectives
After completing this lesson, you will be able to:

 Identify scalability requirements for big data designs.


 Pinpoint throughput factors.

 Implement effective security.

 Create a reliable big data system.


 Identify data sources and integration features to implement with a design.

Scalability
As a first principle, any well-designed big data
system should scale both up and down according
to demand, enabling the system to respond to
changing demand and yet minimize costs. To scale
effectively, you need to consider the following
factors:
 Workload partitioning

 Resource allocation

 Data partitioning

 Data storage

 Design for scaling

 Client affinity

 Platform auto scaling

Workload partitioning
The trick with designing for scalability is to partition the process into discrete and decomposable
elements. Processing elements should be as small as practicable, bearing in mind that distribution of these
elements should maximize utilization of compute units.
MCT USE ONLY. STUDENT USE PROHIBITED
1-12 Architectures for Big Data Engineering with Azure

Resource allocation
To achieve effective scaling, it’s essential that resources are easily allocated to, and deallocated from, any
component in the system. For a virtual machine, for example, it’s often the amount of system memory and
processing power that needs to be adjusted; for databases, it might be the size of log files; for analytics
jobs, it might be the computational units that are assigned to a job.

Data partitioning
You should consider whether to divide the data across database locations or use a database service (such
as Data Lake) that automatically partitions data transparently. Having control over partitions may be
beneficial in some cases, but such benefits can be outweighed by the administrative overhead that might
be required to maintain an optimum partitioning model as data volumes increase. For this reason, using a
storage option that does not directly expose the underlying partition scheme is often the best choice for
large dynamic enterprise-wide applications.

Data storage
You should ensure that you use the right type of data store for each component in the system. For
example, service layer elements in Lambda architectures might work best with an Azure SQL or SQL Server
database that implements a schema. You should also plan your storage on the assumption that data
volumes almost always increase over time, so even without fluctuations on the demand side, there will
typically be an ongoing requirement to scale up your storage.

Design for scaling


You should ensure that your entire design has been architected with scaling in mind. For example, you
might implement stateless services that add or remove instances without affecting current users, along
with auto-detection and auto-configuration of any new instances. This means that the system reacts
dynamically to any scaling actions and users benefit from the changes immediately.

Client affinity
Alongside stateless services within the back end, client connections should also be stateless so that, as new
service endpoints are deployed in a scale up scenario, client connections do not remain attached to their
initial (and possibly now overloaded) connection. Within Azure, for example, this affinity is configured
through the Load Balancer.

Platform auto scaling


Azure supports automatic scaling out and in, sometimes referred to as horizontal scaling, where instances
of a resource are added or removed as required. This type of scaling is most commonly applied to
compute resources, such as bringing new virtual machines on stream to meet increased compute
demand. For resources such as databases and message queues, such scaling is more challenging and
typically involves changes to data partitioning, which are more likely to be manual (rather than automatic)
operations.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-13

Throughput
Throughput and response times are critical to the
success of any big data system. You will need to
consider how to handle large numbers of inputs,
along with the anticipated data volumes.

When you create the design, you need to


determine which operations are hot path, and
which are cold path and can therefore be
processed in batches. Throughput limitations are
more likely to have a negative impact on hot path
operations, because these are typically time
sensitive.

When thinking about throughput, it’s as important


not to under provision, as it is not to over provision:

 Under provisioning. The consequences of under provisioning will vary depending on the Azure
service. For example, event hubs have a capacity determined by the number of Throughput Units
(TUs) which, if not properly set, can impose a limit on the number of messages that are processed per
minute. TUs are shared across an event hub namespace, which might comprise many event hubs. For
services like this, where the throughput granularity is at a fairly coarse level, it’s important that one
particular service, such as an especially active event hub, doesn’t end up consuming all the potential
throughput and, therefore, generating a processing bottleneck.

 Over provisioning. It’s also important not to set overly high limits because, for some Azure services
(such as SQL Data Warehouse), you are billed for the available capacity, and not the current usage.

When designing for big data processing, it’s therefore important to be able to estimate the likely required
throughput, in addition to ensuring that there will be functions in place to adjust throughput to suit
actual workloads.

Security
Security is another key part of any IT system
design. Data breaches are becoming increasingly
commonplace, with attackers looking to steal
important data that could be sensitive to an
organization and its customers. With big data
designs, data needs to be protected, both when it
is in transit between a client and Azure services—
such as databases or storage—and when it’s at
rest on the storage medium.

Protecting data in transit


One way to protect data in transit is by not
routing traffic over the public internet, and instead
using Azure ExpressRoute for private network connectivity to cloud datacenters.
MCT USE ONLY. STUDENT USE PROHIBITED
1-14 Architectures for Big Data Engineering with Azure

Another key approach to network security is to make use of Azure security features, including Virtual
Networks (vNets), Network Security Groups (NSGs), and perimeter networks to create a multilayered
defense:

 Azure Virtual Networks (vNets) provide a logical segmentation of the Azure cloud to which only
your subscription has access; you have full control over DNS settings, routing tables, security policies,
and IP address blocks.

 Network Security Groups (NSGs) help you to create rules (ACLs) at various levels, including at
network interfaces, at compute units such as virtual machines, and at the virtual subnet. NSGs,
therefore, enable fine-grained access control by permitting or denying communication between
workloads within a virtual network, and between on-premises and cloud-based systems.

 Azure perimeter networks should be designed so that all inbound packets must pass through
security appliances, such as firewalls, intrusion detection systems (IDS), and intrusion prevention
systems (IPS), before being able to access Azure services. Outbound packets should also pass through
these security appliances, to enable effective policy enforcement, inspection, and auditing.

Protecting data at rest


A key element in the protection of data at rest is effective access control, including account management
and access enforcement, and building on the principle of least privilege. Access control includes Azure
features, such as Role-Based Access Control (RBAC), Access Control Lists (ACLs), and the use of logical
partitions, together with network address, and date and time restrictions.

Identity management
To help prevent unauthorized access, it’s important to be able to enforce the establishment of correct
identity. To overcome the limitations, and vulnerabilities, of traditional username and password
combinations, Azure supports two-factor authentication using smartcards or mobile phone apps to verify
that users are who they say they are. After authentication is validated against Azure Active Directory
(AAD), features such as RBAC and AAD user groups enable you to grant the minimum access required by
users and services.

Conditional access
Taking this a step further, the AAD conditional access feature protects against the potential consequences
of stolen or phished credentials by combining two-factor authentication with the requirement to possess
a device that is managed by Microsoft Intune® to access services, such as administrator accounts.
Conditional access also blocks access attempts from particular geographic locations, from untrusted
networks, or when access attempts have been made from two or more geographically distant locations
within a short time period.

Encryption
Encryption is another fundamental requirement for a secure big data system—how you implement
encryption differs, depending on whether you are encrypting data in transit or at rest. For data in transit
technologies, such as Virtual Private Networks (VPNs) using IP Security (IPSec), you use tunnels to provide
encrypted clients to Azure connections. For data at rest, there are several options, including Azure Storage
Service Encryption (SSE) for storage accounts, with block blobs and page blobs being automatically
encrypted when written to Azure Storage; similarly, Data Lake Store has automatic encryption for data
lake storage. You might also use Azure Disk Encryption to encrypt the OS disks and data disks used by an
Azure Virtual Machine. Client-Side Encryption is built into the Java and the .NET storage client libraries.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-15

In addition to the mechanisms used for encrypting data, another consideration is the management of
encryption keys. In some environments, such as defense, it’s typically mandated that all keys must be
managed by the organization itself. For many Azure services, you choose to use Azure-managed keys or
your own keys stored in Azure Key Vault—or something similar. Azure Key Vault enables access to keys
and other secrets in Azure Key Vault to be limited to a specific AAD account.

Auditing
Auditing is an important component of any security plan, enabling systems access to be monitored, and
potential threats to be proactively identified and mitigated. By logging Azure operations, including system
access, any unauthorized or accidental changes will be recorded in an audit trail.

Reliability
Ensuring the reliability of a big data processing
system involves a range of tasks, operations, and
considerations. You must set up good monitoring
systems so that you know when issues occur, and
have procedures in place to be able to respond to
these issues in a timely manner. You must be able
to maintain network connectivity, and access to
your data and services. If things go seriously
wrong, you must have a good plan in place to be
able to recover from potential disaster.

Monitor your resources


Monitoring is all about having good systems logs,
diagnostics and alerts in place so that, if things go wrong, you know about it as soon as possible.
Although each Azure service has its own logging and diagnostics procedures, you use methods such as
Azure Monitor to provide a single source for monitoring all your Azure resources.

Maintain connectivity
Cloud-based environments such as Azure require a reliable connection to the remote datacenter location.
The likelihood of service outages caused by failures in the connecting internet infrastructure is very low.
This is because the packet-switching technologies used by the internet backbone route traffic around
multiple failures in the network. Therefore, the most likely point of failure is your own connection to the
internet.

To ensure that this connection is secure and resilient, you should use a dedicated connection, such as a
VPN connection, or use ExpressRoute and, ideally, deploy multiple redundant connections. For example,
you could use a dedicated ExpressRoute connection as the primary route and have a VPN as backup.

Protect data in storage


The most important consideration is to replicate your data across multiple locations. With Azure Storage,
you choose the level of replication you require. Data is always replicated locally and, for all levels except
locally redundant storage, the data is also replicated across multiple datacenters. You should, therefore,
ensure that you select the appropriate redundancy level for your data. With Data Lake Store, this
replication is handled automatically, providing locally redundant copies, and multiple copies of data
across one Azure region.
MCT USE ONLY. STUDENT USE PROHIBITED
1-16 Architectures for Big Data Engineering with Azure

When using database services, such as Azure SQL Server, you configure active geo-replication, so that
your primary database is replicated to up to four readable secondary databases. You might configure
these secondary databases to be in different regions, so you use them for querying and failover if the
primary database becomes unavailable.
SQL Data Warehouse supports distributed data, where the data is stored across multiple locations that
each act as independent storage and processing units. This means that, in addition to providing resilience,
distributed data enables very high query performance through running queries in parallel across locations.

Maintain service availability


To achieve reliability and resilience for Azure services such as Stream Analytics, Service Bus endpoints,
event hubs and so on, one of the key considerations is to minimize communications latency between
services. One way to reduce the risk of latency related issues is, where possible, to host all services within
the same region. For mission critical services, it might be possible to provide redundancy—for example,
you could configure identical Stream Analytics jobs in multiple Azure regions to achieve geo-redundancy,
with each job using local sources for job inputs and outputs.

Within any big data processing system, there will almost certainly be custom code; you should ensure that
such code includes robust failure handling procedures.

Plan for disaster recovery


An underlying principle for any reliable and robust system is to have a tried and tested disaster recovery
(DR) plan in place. One key piece of any DR plan is how to manage backups, and strategies for restores. In
the Azure environment, there is a range of backup provisions; for example, Azure SQL Server provides for
an automatic full backup every week, with differential backup every hour, and a transaction log backup
taken every five minutes. SQL Data Warehouse supports local and geographical backups and snapshot
backups to Azure Blob storage; you can restore a restore point in the primary region, or to a different
region.

Data Lake Store requires a slightly different approach to DR. Because Data Lake Store uses automated
replicas within a region, you need to maintain your own copies of critical data in a different store account
or location to protect against issues such as accidental deletions. You maintain these copies using tools
such as ADLCopy, Azure PowerShell or Azure Data Factory (Data Factory is particularly useful for
managing recurring data copying and mirroring).

Data sources and integration


When you plan a big data processing system,
there are several key sets of questions that must
be answered:

 Where is your data coming from?

 Where is your data going?

 Does any data need to be transformed before


it is used?

Data sources
Source data might come from devices, sensors,
data feeds, databases, flat files, other Azure
services, and other on-premises or cloud services. For example, in Azure, you might use IoT hubs or event
hubs to collect data, or the data may already be online or in other databases, such as Azure SQL Database.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-17

Data destinations
After processing, data might be sent to dashboards, such as Power BI, or be consumed by services, such as
Azure Analysis Services. The data might also be copied for long-term storage to locations such as SQL
Data Warehouse or Data Lake Store.

Data transfers and transformations


Because data might come from a wide range of sources and systems, it’s likely that some form of data
transformations will be needed so that your processing logic works with, and combines, any disparate and
potentially incompatible data sources.

Example: Azure Stream Analytics


The Stream Analytics pipeline starts with a source of streaming data; this might be from devices that pass
data through event hubs or IoT hubs, or from a data store such as Azure Blob storage. The processing is
done using a Stream Analytics job that, in addition to specifying the data source, will specify a
transformation by using a SQL-like query language so that you filter, sort, aggregate, and join streaming
data over a particular time period. The job output could be to an actionable queue, a Power BI dashboard,
or storage such as Data Lake Store, SQL Server database, or Azure blobs or tables.

Example: Azure Data Factory


Data Factory pipelines also take data from disparate sources, but then process/transform the data using
compute services, including Azure HDInsight®, Hadoop, Spark, Azure Data Lake Analytics, and Azure
Machine Learning. Data Factory then outputs the processed data to SQL Data Warehouse so that it can be
used by business intelligence (BI) applications.

Check Your Knowledge


Question

Which of the following are valid storage locations


in Azure?

Select the correct answer.

Azure SQL Database

Azure SQL Data Warehouse

Azure Data Lake Store

Azure Blob storage

Azure Data Tables


MCT USE ONLY. STUDENT USE PROHIBITED
1-18 Architectures for Big Data Engineering with Azure

Lab: Designing a big data architecture


Scenario
You have been asked to design a traffic surveillance system that monitors vehicles travelling on public
roads.

The system is based on fixed roadside traffic cameras with automatic number plate recognition (ANPR)
software built in. These cameras are positioned at strategic junctions and locations on the road network,
and capture information about each vehicle that passes.

Information from the ANPR cameras needs to be passed to the Police Central Communications Center
(PCCC). Software at the PCCC determines whether vehicles are speeding, and can trigger functionality that
raises fines or summonses (depending on how much over the speed limit a vehicle is travelling). The PCCC
can also detect whether a vehicle is reported as stolen, and alert a nearby police patrol car to try and
intercept the vehicle.
Police patrol cars are equipped with devices that communicate with the PCCC, transmitting their current
location and speed. These devices can also receive alerts from the PCCC concerning the location of nearby
suspect and stolen vehicles.

Objectives
After completing this lab, you will have considered the requirements for a system that:

 Captures a stream of vehicle details from each ANPR camera and sends it to the PCCC for processing.
 Tracks the locations of police patrol cars, and communicates real-time information about suspect
vehicles to these patrol cars.

 Automates processing for generated speeding fines and summonses.

 Generates real-time reports and other statistical information about vehicle speeds.

 Stores all captured information for future analysis.

Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.

Lab Setup
Estimated Time: 60 minutes

Virtual machine: 20776A-LON-DEV

Username: Admin

Password: Pa55w.rd

This is a paper-based lab. Students do not require access to Azure.

Exercise 1: Capturing vehicle data and alerting patrol cars


Scenario
The ANPR cameras capture the speed, location, and registration number for each vehicle that passes. You
want to use this data to:

 Detect whether the vehicle is speeding and/or stolen.


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-19

 Alert patrol cars about any speeding/stolen vehicles in their vicinity.

 Generate real-time reports about the speeds captured by each camera (such as the average speed
during the last 30 seconds).

 Detect whether a traffic incident has occurred that could require a patrol car to attend.

You need to determine the most appropriate technologies to implement these features. The solution must
be scalable to handle data from many hundreds of cameras, and be capable of processing data very
efficiently to enable the timely interception of stolen vehicles. It must also be able to detect anomalies,
such as the same car appearing to be in more than one place simultaneously; a car might be fitted with
false registration plates, for example.

The main tasks for this exercise are as follows:

1. Handling incoming speed camera data

2. Communicating between the PCCC and the patrol cars

3. Incorporating supplementary data

 Task 1: Handling incoming speed camera data

 Task 2: Communicating between the PCCC and the patrol cars

 Task 3: Incorporating supplementary data

Results: At the end of this exercise, you will have selected the technologies that support the stream
processing requirements listed in the scenario.

Exercise 2: Performing batch processing


Scenario
In addition to the dynamic stream processing performed by the PCCC, several batch processes are
required, including:

 Generating speeding fines and summonses.

 Creating reports that show how traffic flows vary over time at each speed camera.

 Performing general analyses, such as the likelihood of speeding vehicles also being recorded as
stolen.

You need to determine the most appropriate technologies to implement these features.

The main tasks for this exercise are as follows:

1. Generating fines and summonses

2. Reporting and analyzing data in batch

 Task 1: Generating fines and summonses

 Task 2: Reporting and analyzing data in batch

Results: At the end of this exercise, you will have selected the technologies that support the batch
processing requirements listed in the scenario.
MCT USE ONLY. STUDENT USE PROHIBITED
1-20 Architectures for Big Data Engineering with Azure

Exercise 3: Storing long-term data and performing analytical processing


Scenario
The data captured and generated by the various streaming and batch processes might be required for
detailed analytical processing in the future. It might also be necessary to combine this data with
information retrieved from other sources.

You have been asked how to store and structure this information.

The main tasks for this exercise are as follows:

1. Determining long-term data storage and processing requirements

2. Migrating data to long-term analytical storage

 Task 1: Determining long-term data storage and processing requirements

 Task 2: Migrating data to long-term analytical storage

Results: At the end of this exercise, you will have selected the technologies that support the data storage
and detailed analytical processing requirements of the system.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-21

Module Review and Takeaways


In this module, you learned about:

 The concept of big data.

 The Lambda and Kappa architectures for processing big data.


 Design considerations for building big data solutions with Azure.

Review Question(s)
Question: What are likely to be your biggest challenges when planning for big data
processing in your organization?
MCT USE ONLY. STUDENT USE PROHIBITED
 
MCT USE ONLY. STUDENT USE PROHIBITED
2-1

Module 2
Processing Event Streams using Azure Stream Analytics
Contents:
Module Overview 2-1 
Lesson 1: Introduction to Stream Analytics 2-2 

Lesson 2: Configure Stream Analytics jobs 2-14 

Lab: Process event streams with Stream Analytics 2-23 


Module Review and Takeaways 2-31 

Module Overview
In today’s business world, enterprises receive larger amounts of data from different applications and
devices faster than ever before. Being able to process this data as it flows into an organization’s data
platform is crucial to uncover real-time insights, like sales performance or maintenance issues. You use
Microsoft® Azure® Stream Analytics to achieve this real-time analysis on streaming data.

Stream Analytics is a managed service in Azure that you use to create analytic computations on streaming
data. You can connect a Stream Analytics job to different streaming inputs, transform the data using a
SQL-like query language to join, aggregate, sort, and filter data over a given time interval, then output the
data to one or many destinations—including relational and nonrelational data stores, Service Bus queues
or topics, and Power BI™. Customers use Stream Analytics to create a real-time stream processing
workflow quickly to gain insights from streaming data in minutes.

Objectives
By the end of this module, you will be able to:
 Describe Stream Analytics and its purpose.

 Develop and deploy Stream Analytics jobs.

 Configure Stream Analytics jobs for scalability, reliability, and security.


MCT USE ONLY. STUDENT USE PROHIBITED
2-2 Processing Event Streams using Azure Stream Analytics

Lesson 1
Introduction to Stream Analytics
Stream Analytics is used to process large amounts of streaming data coming from devices, applications, or
processes. Stream Analytics jobs are configured to query the streaming data to filter the output, look for
patterns, and control the flow of data to the destination. You use Stream Analytics to create automation
workflows, perform real-time reporting, store data for batch processing, or raise alerts based on the
analytics performed on the data.

Lesson Objectives
By the end of this lesson, you should be able to:

 Describe the purpose of Stream Analytics.

 Describe the structure of a Stream Analytics job.

 List the types of data sources.


 List the types of outputs.

 Define a basic query.

 Explain aggregations and groupings.

 Describe how to group data by time.

 Show how to join data from different sources.

 Test and document jobs.


 Create and run a Stream Analytics job.

What is Stream Analytics?


Stream Analytics is a managed Azure service you
create to process large amounts of streaming
data coming from Azure Event Hubs or Azure IoT
Hubs. Stream Analytics jobs are used as a quick
and reliable way to set up stream processing
workflows that provide valuable insights on
streaming data.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-3

Structure of a Stream Analytics job


All Stream Analytics jobs have a basic structure of
one or more inputs, one or more outputs, and
queries to analyze data from the inputs and
route data to the outputs.

The following table shows the supported inputs


and outputs of Stream Analytics jobs. These will
be covered in greater detail later in the module.

Inputs Outputs

IoT Hub Azure Data Lake Store

Event Hubs Azure SQL DB

Azure Blob storage Azure Blob storage

Event Hubs

Power BI

Azure Table Storage

Service Bus queues

Service Bus topics

Azure Cosmos DB

Types of data sources


A Stream Analytics data source is referred to as
an input to the Stream Analytics job. Stream
Analytics jobs can connect to three input
sources—Azure IoT Hub, Azure Blob storage, and
Azure Event Hubs.
Input sources can be one of two types of input:
data stream or reference data.
MCT USE ONLY. STUDENT USE PROHIBITED
2-4 Processing Event Streams using Azure Stream Analytics

Data stream inputs


Data stream inputs are streaming events that form an unbounded sequence over time. Event Hubs, IoT
Hubs, and Blob storage are all used as data stream inputs—each Stream Analytics job must have at least
one data stream input. Data stream inputs can be in CSV, JSON, or Avro format.

Reference data inputs


Stream Analytics can connect to static auxiliary datasets that are used as lookup data in the Stream
Analytics job. You use a join in the Stream Analytics query to join the reference data with the data stream
input to perform a lookup. Blob storage is used as the input source for reference data, and the reference
data source blob size is limited to 100 MB.

Event Hubs as data stream input


Event Hubs is a highly scalable event ingestion service that ingests millions of events per second. Event
hubs are typically used to collect events from many devices, services, and applications, and have seamless
integration with Stream Analytics.

Event hubs have a concept known as a consumer group that provides separate views of the event hub to
enable multiple applications to connect and read the stream independently. It’s a best practice to create a
separate consumer group for each Stream Analytics job that connects to the event hub, though there is a
limit of 20 consumer groups per event hub. If your Stream Analytics job contains multiple SELECT
statements that connect to the same event hub consumer group, you should consider creating multiple
consumer groups and Stream Analytics inputs—one for each reference to the input data stream in your
query.

IoT Hub as data stream input


IoT Hub is a highly scalable ingestion service that is optimized for IoT scenarios by being able to connect
and communicate with many IoT devices and gateways. Using an IoT hub as an input to a Stream
Analytics job is very similar to using an event hub, though it passes extra metadata specific to IoT hub. The
same consumer group guidance is applicable to event hub and IoT hub inputs.

Blob storage as data stream input


Data stored in Blob storage is typically considered data at rest, but it is sometimes useful to process this
data as an input stream to a Stream Analytics job. One widely used scenario for using Blob storage as a
data stream is to process log files that have been stored in Blob storage but still need to be analyzed to
extract actionable insights.

Blob storage as reference data


To use Blob storage as reference data, you create an input to the Stream Analytics job that is similar to
creating a data stream input, but specify reference data as the type of input. Blob reference data has a
limit of 100 MB per blob, but you use the path pattern property of the input to reference multiple blobs
in your query.
You generate reference data on a schedule and reference this data in the query by taking advantage of
the {date} and {time} substitution tokens. For example, when specifying the input pattern of the blob,
you use ‘input/referenceData/{date}/{time}/data.csv’ to instruct Stream Analytics to look for a reference
blob with the location ‘input/referenceData/2017-08-01/18-30/data.csv’ when querying data on August 1,
2017 at 6:30 PM.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-5

Types of output
Stream Analytics integrates seamlessly with many
types of output. These outputs might be
persistent storage, queues for further data
processing, or Power BI for reporting of the
streaming dataset. Outputs can also be in CSV,
JSON, or Avro format.

The following is a list of Stream Analytics output


types and typical scenarios for the given type:

Output Scenario

Azure Data Lake Store Used to store data for further batch processing by Azure Data Lake
Analytics or HDInsight®.

Azure SQL DB Used to store data that is relational in nature and needs to be
accessible via a SQL query.

Azure Blob storage Used to store data in a cost-effective and scalable manner that
provides access to applications and batch processing solutions like
Azure Data Lake Analytics and HDInsight.

Event Hubs Used to ingest data from the output of one Stream Analytics job to
send to another streaming job for further processing.

Power BI Used to create a real-time reporting dataset in Power BI that is used to


create real-time dashboards and metric reports.

Azure Table storage Used to store data in a low latency, highly scalable manner for
integration with downstream applications.

Service Bus queues Used as a first in, first out (FIFO) message delivery service that has
competing consumers. This can be used for round robin processing of
the data by downstream applications.

Service Bus topics Used as a one-to-many message delivery service to send events to be
processed by multiple downstream applications.

Azure Cosmos DB Used as a highly scalable NoSQL document database to enable


applications to query documents on a global scale with low latency.
MCT USE ONLY. STUDENT USE PROHIBITED
2-6 Processing Event Streams using Azure Stream Analytics

Define basic queries


You use a SQL-like query language to define
Stream Analytics queries. For syntax purposes,
this query language is similar to T-SQL but differs
in how the query is executed. For details of the
Stream Analytics query language reference, see:
Stream Analytics query language
reference
https://aka.ms/swzhvp

You use Stream Analytics queries to select data


from one or many inputs into one or many
outputs. The following is an example of a very simple query that you use to pass all events from the input
to the output:

SELECT
*
INTO [SampleOutput]
FROM [SampleInput]

You might also select specific columns and filter the data based on a condition, as follows:

SELECT
ProductName, ProductCategory, Price
INTO [SampleOutput]
FROM [SampleInput]
WHERE Price>=200

The following is an example of how to use multiple outputs in one Stream Analytics job:

SELECT
*
INTO [SampleOutput1]
FROM [SampleInput]
WHERE Price>=200
SELECT
*
INTO [SampleOutput2]
FROM [SampleInput]
WHERE Price<200

Note that, while you can have multiple inputs and multiple outputs in a single Stream Analytics job, it’s
best practice to split unrelated queries into multiple Stream Analytics jobs. This helps optimize the
performance of each Stream Analytics job by reducing complexity and processing steps.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-7

Aggregations and groupings


When querying data streams with Stream
Analytics, many aggregate functions are used to
perform calculations and create an output with
more actionable data. Aggregate functions
create a single value output based on a set of
input values.

You use aggregate functions in the following


circumstances:

 In the SELECT statement of a subquery or an


outer query.

 In a HAVING clause.

The following is a list of aggregate functions you use with Stream Analytics:

Aggregate function Description

AVG Returns the average of a group of values.

COUNT Returns the number of items in a group as a BIGINT data type.

Collect Returns an array with all values from the time window.

CollectTOP Returns an array of ranked records defined by the ORDER BY clause.

MAX Returns the maximum value in a set of values.

MIN Returns the minimum value in a set of values.

Percentile_Cont Returns a percentile based on a continuous distribution of the dataset


sorted by the ORDER BY clause. The result is interpolated and might be
different from each of the values in the input dataset.

Percentile_Disc Returns a percentile based on the entire dataset sorted by the ORDER
BY clause. The result will be equal to a value in the input dataset.

STDEV Returns the standard deviation of a set of values in the input.

STDEVP Returns the standard deviation for the population for a set of values in
the input.

SUM Returns the sum of the set of values in the input.

TopOne Returns the top result from a set of values based on the ORDER BY
clause.

VAR Returns the variance of a set of values in the input.

VARP Returns the variance for the population of a set of values in the input.

The GROUP BY clause is typically used when aggregate functions are used in the SELECT statement.
GROUP BY clauses are used in Stream Analytics in the same way as a GROUP BY clause is used in T-SQL.
MCT USE ONLY. STUDENT USE PROHIBITED
2-8 Processing Event Streams using Azure Stream Analytics

For example, to select the top 10 products by sales quantity over the last three minutes (windowing
functions are described in more detail in the next section):

SELECT
ProductName,
CollectTop(10) OVER (ORDER BY QuantitySold DESC)
FROM [SampleInput] timestamp by time
GROUP BY TumblingWindow(minute, 3), ProductName

Here is a list of common query patterns for Stream Analytics:

Query examples for common Stream Analytics usage patterns


https://aka.ms/esevh2

Grouping data by time


When using aggregate functions and the GROUP
BY clause, you must include a windowing
function in the GROUP BY statement—this is to
instruct Stream Analytics to perform the
aggregate function on only the data inside the
window. There are three windowing functions in
Stream Analytics: sliding window, tumbling
window, and hopping window.

Sliding window
Sliding windows consider all possible windows of
the given length. However, to make the number
of windows manageable for Stream Analytics,
sliding windows produce an output only when an event enters or exits the window. Every window has at
least one event, and each event can be in more than one sliding window.
For example, you have the following input dataset with a time and a value:

Timestamp (seconds) Value

12 1

19 2

24 3

A sliding window of 10 seconds will produce windows ending at the following times:

 12—value 1 enters the window; one value in window (value 1).

 19—value 2 enters the window; two values in window (value 1, value 2).
 22—value 1 exits the window; one value in window (value 2).

 24—value 3 enters the window; two values in window (value 2, value 3).

 29—value 2 exits the window; one value in window (value 3).

A window is not created at time 34, because this would create an empty window (value 3 exits the
window).
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-9

Tumbling window
Tumbling windows are fixed-size windows that do not overlap and are contiguous. When the timespan of
the window size has passed, a new window is immediately created with the same duration.

For example, to calculate the average temperature by device for each five-minute period:

SELECT
TimeStamp,
DeviceId,
AVG(Temp) AS AvgTemp
FROM [SampleInput] TIMESTAMP BY TimeStamp
GROUP BY DeviceId, TumblingWindow(mi, 5)

Hopping window
Hopping windows are used to specify overlapping windows that are scheduled. Hopping windows are
defined with a windowsize, hopsize, and timeunit.

 Windowsize—describes the size of the window.

 Hopsize—describes how much each window moves forward.

 Timeunit—describes the unit of time for the windowsize and hopsize.

For example, you have the following set of data including a timestamp and event name:

Timestamp (seconds) Name

3 A

5 B

8 C

12 D

16 E

20 F

You specify a hopping window with a windowsize of 10, a hopsize of 5, and seconds for the time unit. This
will create the following windows:

 0-10 seconds—contains events A, B, and C.

 5-15 seconds—contains events C and D.

 10-20 seconds—contains events D, E, and F.

Notice that the windows are inclusive of the end of the window and exclusive of the beginning of the
window. You use the offsetsize parameter to change this behavior.

TIMESTAMP BY
There’s a timestamp associated with each event that is processed by Stream Analytics. Timestamps are
generally created based on the arrival time to the input source. For example, events in Blob storage have
a timestamp based on the blob’s last modified time; events captured with event hubs are given a
timestamp of when the event arrives. An event timestamp is retrieved by using the System.Timestamp
property in any part of the query.
MCT USE ONLY. STUDENT USE PROHIBITED
2-10 Processing Event Streams using Azure Stream Analytics

Many streaming scenarios require processing events based on the time they occur, rather than the time
they are received. For example, point-of-sale systems typically need to process the data based on the
timestamp when the transaction occurs.

To process events based on a custom timestamp or the time they occur, you use the TIMESTAMP BY
clause in the SELECT statement:

SELECT
TransTime,
RegisterId,
TransId,
TransAmt
FROM [SampleInput] TIMESTAMP BY TransTime

Joining data from different sources


Joins in Stream Analytics are used in a similar way
to how joins are used in standard T-SQL.
However, joins in Stream Analytics have a
temporal nature to define how far apart events
can occur and still be joined. This is because joins
without a temporal element in Stream Analytics
would be unbounded and, potentially, might be
an infinite collection. Reference data joins are the
only joins in Stream Analytics that do not require
a temporal bound.
The DATEDIFF function is used in the ON clause
of the JOIN statement to define the time bounds
to join the two datasets.

For example, to join two streaming datasets with a reference dataset:

SELECT
i1.StoreId,
i1.CustEntryTime,
i2.CustExitTime,
r.StoreDescription
FROM [SampleInput1] i1 TIMESTAMP BY CustEntryTime
JOIN [SampleInput2] i2 TIMESTAMP BY CustExitTime
ON DATEDIFF(minute, i1, i2) BETWEEN 0 AND 5
AND i1.StoreId=i2.StoreId
JOIN [ReferenceInput] r
ON i1.StoreId=r.id
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-11

Testing and documenting jobs


When you create a Stream Analytics job, it’s
useful to test your query before starting—this
prevents unnecessary waiting before starting and
stopping the job. You test jobs directly in the
portal using the query editing blade.
To test the Stream Analytics job in the Azure
portal:

1. Browse to the query editing blade.


2. Right-click the input datasets and click
Upload sample data from file.

3. After you have uploaded test data for each


input, click Test in the top ribbon of the blade to test the query with the sample data.

4. View the output of the query in the browser and download output files if necessary.

After you have successfully tested your query in the portal, you save the query and start the Stream
Analytics job to begin processing events.
You also take advantage of multiple SELECT INTO statements to output intermediate data to test your
data streams and joins.

For example, you have the following query that is producing zero events:

SELECT
r1.StoreName,
r2.ProductName,
i1.ProductId,
i1.ProductQty
INTO [SampleOutput]
FROM [SampleInput] i1
JOIN [ReferenceInput1] r1
ON i1.StoreId=r1.StoreId
JOIN [ReferenceInput2] r2
ON i1.ProductId=r2.ProductId

You can rewrite the query with multiple outputs to test each step and data stream:

WITH StepOne AS
(
SELECT
r1.StoreName,
i1.ProductId,
i1.ProductQty
FROM [SampleInput] i1
JOIN [ReferenceInput1] r1
ON i1.StoreId=r1.StoreId
),
StepTwo AS
(
SELECT
s1.StoreName,
r2.ProductName,
s1.ProductId,
s1.ProductQty
FROM [StepOne] s1
JOIN [ReferenceInput2] r2
ON s1.ProductId=r2.ProductId
MCT USE ONLY. STUDENT USE PROHIBITED
2-12 Processing Event Streams using Azure Stream Analytics

)
--Regular output to SampleOutput
SELECT
*
INTO
[SampleOutput]
FROM
StepTwo
--Log input data
SELECT
*
INTO
[TestOutput1]
FROM
[SampleInput]
--Log data from first join
SELECT
*
INTO
[TestOutput2]
FROM
StepOne

Now you can test and view the output for each step of the query to see why your data might not be
joining successfully.

Using the job diagram


Another way to debug your query is by using the job diagram in the Azure portal—this shows a diagram
of each query step, in addition to the inputs and outputs. Metrics are also shown for each query step so
you quickly identify the issue.

To use the job diagram:

1. On the Stream Analytics job blade in the Azure portal, click Job diagram in the SUPPORT +
TROUBLESHOOTING section on the left pane.

2. View the metrics for each query step by selecting the query step in the diagram.
3. To view the metrics for partitioned inputs or outputs, you select the ellipses (…) on the input or
output then select Expand partitions.

4. Click a single partition node to view the metrics for that partition.

5. Click the merger node to view the metrics for the merger.

The job diagram gives a helpful visual representation of your job that you use to identify issues and
bottlenecks quickly.

Demonstration: Create and run a Stream Analytics job


Question: Why do data streams have to be joined with a temporal bound?
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-13

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? You can use Azure SQL DB


as an input source for a Stream Analytics
job.
MCT USE ONLY. STUDENT USE PROHIBITED
2-14 Processing Event Streams using Azure Stream Analytics

Lesson 2
Configure Stream Analytics jobs
It’s quick and easy to create a Stream Analytics job and begin to process incoming streaming data. To
create a scalable and reliable solution, you need to plan the design and configuration for Stream Analytics
deployments.

Lesson Objectives
By the end of this lesson, you should be able to:

 Design Stream Analytics jobs for scalability.

 Explain the functions of streaming units.


 Describe how to handle events reliably in Stream Analytics.

 List the ways of protecting Stream Analytics jobs.

 Describe how to monitor Stream Analytics jobs.

 Explain the process for automating Stream Analytics jobs.

 Run and manage Stream Analytics jobs.

Designing for scalability


To take advantage of the scalability of Stream
Analytics, you should plan with the entire
solution in mind. This means you must think
about how a Stream Analytics query will interact
with the inputs and outputs of the job.

When using Event Hubs or Blob storage as your


streaming input, you can take advantage of data
partitioning to achieve maximum parallelism
while processing your data.

Event Hubs partitioning


Event Hubs is built with a partitioning pattern
where each consumer only reads from a specific partition. Each partition holds a sequence of events for a
specified retention time. This retention time is defined at an event hub level and is the same for all
partitions. There is no way to delete data from an event hub partition—it must expire after the retention
period.

Event hubs have between two and 32 partitions. These are specified at creation time and cannot be
changed, so you should consider the long term when you design the streaming solution. The number of
partitions in an event hub should correlate directly with the number of concurrent readers—in this case,
Stream Analytics jobs—that you have in your solution. For example, if you plan to have four Stream
Analytics jobs reading from one event hub, you should create the event hub with four partitions.

Blob storage partitioning


Blob storage is designed to have multiple blobs, or files, that are read or written to in parallel. Actions on
a single file must be processed by a single server, but actions across multiple blobs can be scaled across
many servers to increase parallelism and performance.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-15

Azure blobs are stored in a container within a storage account. The partition key for a given blob is the
Azure storage account name, plus the container name and the blob name. In this scenario, each blob can
have its own partition.

Embarrassingly parallel jobs


To take full advantage of input source partitioning to create the most scalable Stream Analytics solutions,
you must create an embarrassingly parallel job. Embarrassingly parallel means that the job connects a
single partition of the input to a single instance of the query, to a single partition of the output. This
concept requires the following:
 You must make sure that the input events are sent to the same partition of your input source if your
query logic requires the same key to be processed by the same query instance. When you send data
to event hubs, this means that the input data must include the PartitionKey value. When using Blob
storage, the events need to be sent to the same partition folder.

o Simple queries, like a select-project-filter query, might not require the same query instance to
process the same key, and you can ignore this requirement.
 Just as the input source must be partitioned, the query must also be partitioned. The Partition By
clause must be used in each step of the query, and they must all be partitioned by the same key. To
be fully parallel, the partitioning key must be set to PartitionId.

 Only event hubs and Blob storage output types allow partitioned output. The partition key must be
set to PartitionId for event hub outputs, but you don’t have to do anything for Blob storage outputs
because of how they are partitioned.

 The number of input partitions must equal the number of output partitions. However, this doesn’t
matter when using Blob storage as an output, because it inherits the partitioning scheme of the
query. The following are example scenarios that allow for fully parallel jobs:

o Four input event hub partitions and Blob storage output.

o Four input event hub partitions and four output event hub partitions.

o Four input Blob storage partitions and four output event hub partitions.

o Four input Blob storage partitions and Blob storage output.

The following examples demonstrate instances of embarrassingly parallel queries:

 Event hub input with four partitions.

 Blob storage output.

SELECT
COUNT(*) AS Count,
DeviceId
FROM [SampleInput] Partition BY PartitionId
GROUP BY TumblingWindow(second, 30), DeviceId, PartitionId

Because the query has a grouping key (DeviceId), that key needs to be processed by the same query
instance each time. That means the input events must be partitioned when being sent to the event hub
input. Because DeviceId is the grouping key, the PartitionKey value of the input event data should use
DeviceId.
Even though the ideal scenario is to have embarrassingly parallel jobs, sometimes you can’t prevent jobs
that don’t fall into this category. The following are some cases where the Stream Analytics job is not
embarrassingly parallel:

 Using any output other than Event Hubs or Blob storage.


MCT USE ONLY. STUDENT USE PROHIBITED
2-16 Processing Event Streams using Azure Stream Analytics

 Having a different number of partitions for input and output sources.

 Multistep queries that use a different partition key for the Partition By clause.

Explaining streaming units


To understand the amount of compute power to
provide to your Stream Analytics job, you must
understand streaming units. Streaming units
(SUs) are the compute resources required to
execute a streaming job, and are a combination
of the measure of CPU, memory, and read/write
rates. SUs provide a way to abstract the user from
having to learn the individual compute resources
required to run Stream Analytics jobs.

You use the scale blade of the Stream Analytics


job to set SUs. It’s best practice to begin with six
SUs for a nonpartitioned input source, then use
the SU %Utilization metric to find the optimal number of SUs by trial and error. SUs can be set as 1, 3, 6,
and up in increments of six. It’s generally recommended to allocate more SUs than the job utilizes for
cases when memory and other resources go beyond the average utilized amount.

You’ll find the maximum number of SUs that your job can use by looking at the input, output, and query.
The amount of SUs depends on the number of steps in the query and the number of partitions for each
step. For queries that do not have any partitioned steps, the maximum number of SUs is six. For
partitioned queries, the maximum number of SUs for the job is calculated by multiplying the number of
partitions by the number of partitioned steps, by six SUs for a single step.

For example, consider the following scenarios:

Query properties Maximum streaming units

 Four input partitions 24


 One partitioned step (4 partitions x 1 partitioned step x 6 max SUs per
step)

 Four input partitions 30


 One nonpartitioned step (6 max SUs for nonpartitioned step plus (4 partitions
x 1 partitioned step x 6 max SUs per step))
 One partitioned step
 SELECT statement reads from
partitioned input
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-17

Handling events reliably


You can specify policies for each Stream Analytics
job to reliably handle events that come out of
order or fail when writing to the output.

Late arriving and out of order events


When you handle temporal data streams, you will
encounter instances where the events in the
stream are not received in order or are received
late. Receiving events out of order might occur
for many reasons—for example, latency between
the source device and the input system, clock
skew between the source device and input
system, or a temporary interruption of the
connection between the source device and the input system. You set policies on your Stream Analytics job
to handle late arriving events and events that arrive out of order.

To handle events that are out of order or late arriving, you set Event ordering policies, which consist of a
late arrival tolerance window, an out of order tolerance window, and an action.

 Late arrival tolerance window—the Stream Analytics job will accept late events with a timestamp
that is in the specified window.

 Out of order tolerance window—the Stream Analytics job will accept out of order events with a
timestamp that is in the specified window.

 Action—the Stream Analytics job will either Drop an event that occurs outside the acceptable
window, or Adjust the timestamp to the latest acceptable time.

When using the out of order tolerance window, Stream Analytics will buffer the events up to that window
then reorder the events to make sure they are back in the correct order before processing. The output of
the Stream Analytics job is delayed by the same amount of time as the out of order buffer.

Error policy
When processing streaming data, there might be several reasons why a Stream Analytics job sometimes
fails to write to the output. To remedy this, you specify how errors are handled in the Error policy blade.
You set the Action to one of two settings:

 Drop—drops any events that cause errors when writing to the output.

 Retry—retries writing to the output until the event succeeds.


MCT USE ONLY. STUDENT USE PROHIBITED
2-18 Processing Event Streams using Azure Stream Analytics

Protecting Stream Analytics jobs


Most Azure services, including Stream Analytics,
are managed by the Azure Resource Manager
(ARM) API. This API includes how users and
groups are granted access to interact with
services, called Access Control (IAM). You use
Role-Based Access Control (RBAC) to grant access
to ARM resources.

You manage user and group access to Stream


Analytics jobs by using the following default
roles:

Role Description

Owner Provides access to manage everything about the resource, including


access.

Contributor Provides access to manage everything about the resource except for
access.

Reader Provides access to view all information about the resource, but not
change anything.

Log analytics contributor Provides access to read all monitoring data and edit monitoring
settings, including settings for Azure Log Analytics and Diagnostics.

Log analytics reader Provides access to read all monitoring data, including settings for
Azure Log Analytics and Diagnostics.

Monitoring contributor Provides access to read all monitoring data and edit monitoring
settings.

Monitoring reader Provides access to read all monitoring data.

User access administrator Provides access to manage user and group access to the resource.

To assign a role to users or groups:

1. On the Stream Analytics blade, click Access Control (IAM) in the left pane.

2. Click the + Add button on the top ribbon.

3. Select the role for the user or group to be added to, then use the search box to select the user or
group.

4. Click the Save button at the bottom of the blade to complete adding the user or group to the role.

For more information about RBAC in Azure, see:

Get started with Role-Based Access Control in the Azure portal


https://aka.ms/b4zzez
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-19

Monitoring Stream Analytics jobs


After setting up a Stream Analytics job, you will
need to monitor the job to understand resource
consumption, error handling, and to
troubleshoot your jobs.

To view key Stream Analytics metrics in the Azure


portal:
1. On the Stream Analytics Overview blade,
view the Metrics pane.
2. Click the Metrics pane to open the Metric
blade and see a more detailed breakdown of
the metrics.
3. You customize the metric chart by clicking the Edit chart button on the top ribbon and selecting the
metrics you want to see in the chart.

4. Click the Save button to finish customizing the metric chart.

The following shows a list of Stream Analytics metrics you can view in the Azure portal:

Metric Description

SU % utilization Utilization percentage of the SUs assigned to the Stream Analytics


job.

Input events Number of events received by the Stream Analytics job.

Output events Number of events sent by the Stream Analytics job.

Out-of-order events Number of events that were received out of order and processed
based on the event ordering policy.

Data conversion errors Number of errors that the Stream Analytics job encountered when
attempting to convert data types.

Runtime errors Number of errors during the execution of the Stream Analytics
job.

Late input events Number of late arriving events processed by the event ordering
policy.

Function requests Number of calls to an Azure Machine Learning function.

Failed function requests Number of failed Azure Machine Learning function calls.

Function events Number of events sent to the Azure Machine Learning function.

Input event bytes Amount of data in bytes received by the Stream Analytics job.
MCT USE ONLY. STUDENT USE PROHIBITED
2-20 Processing Event Streams using Azure Stream Analytics

Configure metric alerts


You can define the alerts that will be triggered when a Stream Analytics metric hits a specified threshold.

To set up a metric alert in the Azure portal:

1. On the Stream Analytics blade, click the Monitoring tile.

2. Click the Add alert button on the ribbon on the Metric blade.

3. Enter a name and a description for the alert and choose a metric that is used to trigger the alert.

4. Select a condition and a threshold to trigger the alert.

5. Select a time period for the alert.

6. Enter the email(s) addresses where the alert should be sent.

7. Click the OK button to save the alert definition.

An email will be sent to the email(s) addresses specified when the metric hits the threshold you provided.

Collect and examine diagnostic logs


You can set up the collection of diagnostic logs on your Stream Analytics job that you use to monitor and
troubleshoot the job.

To turn on a collection of diagnostic logs:

1. On the Stream Analytics job blade, click Diagnostic logs in the left panel.

2. Click Add diagnostic setting.

3. Check the Archive to a storage account option.

4. Click Configure then select the storage account to use for diagnostic log collection, and click OK.

5. Under LOG, check the boxes for Execution and Authoring and set the retention policy.
6. Under METRIC, check the box for 1 minute and set the retention policy.

7. Click Save on the top ribbon to save the diagnostic logging settings.

You will now be able to view, search, and filter through the diagnostic logs in the Activity log blade to
perform troubleshooting and auditing on your Stream Analytics job. You can also download the
diagnostic data from your specified storage account if you want to examine and process it locally.

Planning for latency and data movement


If possible, it’s a best practice to create all services for your streaming solution in the same Azure region.
Because the data has a shorter distance to travel, this will help reduce the latency, which provides more
real-time analytics on your streaming data.

It’s also recommended to keep the solution in one Azure region to prevent data egress from the Azure
datacenter. You are not charged for data that is streaming into the Azure datacenter, but if your
streaming solution sends data from one datacenter to another, you will be charged for that data egress
between Azure datacenters. It’s also typically recommended to create compute solutions where the data
rests, instead of moving data to different regions to be analyzed. This will reduce the latency, complexity,
and cost of your solution.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-21

Automating Stream Analytics jobs


When you create Stream Analytics jobs, it’s useful
to be able to automate the deployment. You
create an automation script to help with the
quick deployment of the service and the entire
solution. This is useful when you deploy via a
DevOps pipeline—you might want to deploy to
different environments like Dev, Test, and Prod.

You use Azure PowerShell commandlets to


create, update, delete, and monitor Stream
Analytics jobs.

Create a Stream Analytics job


The following example shows how to create, list status, and delete a Stream Analytics job named
SampleJob with one input and output. Notice that the definition for the job, inputs, outputs, and
transformations are defined in a JSON file:

#Create a new Stream Analytics Job


New-AzureRMStreamAnalyticsJob -ResourceGroupName SampleRg –File
"C:\StreamAnalyticsJobDefinition.json" –Name SampleJob –Force
#Add an input
New-AzureRMStreamAnalyticsInput -ResourceGroupName SampleRg -JobName SampleJob –File
"C:\Input.json" –Name SampleInput –Force
#Add an output
New-AzureRMStreamAnalyticsOutput -ResourceGroupName SampleRg –File "C:\Output.json" –
JobName SampleJob –Name SampleOutput -Force
#Add a transformation
New-AzureRMStreamAnalyticsTransformation -ResourceGroupName SampleRg –File
"C:\Transformation.json" –JobName SampleJob –Name SampleTransform

List information about a Stream Analytics job


#Get information about the Stream Analytics job
Get-AzureRMStreamAnalyticsJob -ResourceGroupName SampleRg -Name SampleJob
#List all inputs
Get-AzureRMStreamAnalyticsInput –ResourceGroupName SampleRg -JobName SampleJob
#List all outputs
Get-AzureRMStreamAnalyticsOutput -ResourceGroupName SampleRg -JobName SampleJob

Delete a Stream Analytics job


#Delete the Stream Analytics job
Remove-AzureRMStreamAnalyticsJob -ResourceGroupName SampleRg –Name SampleJob –Force

You also use the Stream Analytics .NET SDK to create and manage Stream Analytics jobs. For more
information, see:
Management .NET SDK: Set up and run analytics jobs using the Azure Stream Analytics API
for .NET

https://aka.ms/quwsdf
MCT USE ONLY. STUDENT USE PROHIBITED
2-22 Processing Event Streams using Azure Stream Analytics

Demonstration: Run and manage jobs


Question: Why should you consider setting policies for late arriving and out of order events?

Check Your Knowledge


Question

To make a job embarrassingly parallel, what is


the required number of output event hub
partitions if the input event hub has six
partitions?

Select the correct answer.

12

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? The maximum number of


SUs for a nonpartitioned query is three.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-23

Lab: Process event streams with Stream Analytics


Scenario
You work for Adatum as a data engineer, and have been asked to build a traffic surveillance system for
traffic police. This system must be able to analyze significant amounts of dynamically streamed data,
captured from speed cameras and automatic number plate recognition (ANPR) devices, and then
crosscheck the outputs against large volumes of reference data that holds vehicle, driver, and location
information. Fixed roadside cameras, hand-held cameras (held by traffic police), and mobile cameras (in
police patrol cars) are used to monitor traffic speeds and raise an alert if a vehicle is travelling too quickly
for the local speed limit. The cameras also have built-in ANPR software that reads vehicle registration
plates.

For the first phase of the project, you will use Stream Analytics, together with Event Hubs, IoT Hubs,
Service Bus, and custom applications to:

 Provide insights into average speeds at various locations.

 Determine the locations of police patrol cars.

 Present vehicle locations on a map.

 Check vehicles recorded by speed cameras against a list of stolen vehicles.


 Determine the nearest patrol car to a speeding vehicle or stolen vehicle, send a dispatch alert to the
nearest patrol car, and show the dispatched patrol car locations on a map.
 Use Stream Analytics monitoring and alerting tools to help identify issues during the system's
deployment, and use the Azure portal and PowerShell to scale up and scale down the system to cope
with particular demands.

Objectives
After completing this lab, you will be able to:

 Create a Stream Analytics job to process event hub data.

 Create a Stream Analytics job to process IoT hub data.

 Reconfigure a Stream Analytics job to send output through a Service Bus queue.

 Reconfigure a Stream Analytics job to process both event hub and static file data.

 Use multiple Stream Analytics jobs to process event hub, IoT hub and static file data, and output
results using a Service Bus and custom application.

 Use the Azure portal and PowerShell to manage and scale Stream Analytics jobs.

Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.

Estimated Time: 120 minutes

Virtual machine: 20776A-LON-DEV

Username: ADATUM\AdatumAdmin

Password: Pa55w.rd
MCT USE ONLY. STUDENT USE PROHIBITED
2-24 Processing Event Streams using Azure Stream Analytics

Exercise 1: Create a Stream Analytics job to process event hub data


Scenario
For the first phase of the project, you will start to build the traffic surveillance system to provide insights
into average speeds at various locations. In this exercise, you will create a Stream Analytics job that
captures speed camera data sent to an event hub from a Visual Studio application (SpeedCameraDevice).
You will configure the Stream Analytics job to send output data to a Power BI dashboard and to Azure
Data Lake Storage, using filters to remove unnecessary fields before storage.
The main tasks for this exercise are as follows:

1. Create a Data Lake Store

2. Create an event hubs namespace and hub

3. Create a Stream Analytics job

4. Configure Stream Analytics job inputs

5. Configure Stream Analytics job outputs


6. Configure a Stream Analytics job query

7. Start a Stream Analytics job

8. Generate event hub data for processing with Stream Analytics

9. Visualize Stream Analytics output using Power BI

10. View Stream Analytics output in Data Lake Store

 Task 1: Create a Data Lake Store

 Task 2: Create an event hubs namespace and hub

 Task 3: Create a Stream Analytics job

 Task 4: Configure Stream Analytics job inputs

 Task 5: Configure Stream Analytics job outputs

 Task 6: Configure a Stream Analytics job query

 Task 7: Start a Stream Analytics job

 Task 8: Generate event hub data for processing with Stream Analytics

 Task 9: Visualize Stream Analytics output using Power BI

 Task 10: View Stream Analytics output in Data Lake Store

Results: At the end of this exercise, you will have created an Azure Data Lake Store, an event hubs
namespace, and a Stream Analytics job. You will then use Stream Analytics to process event hubs data,
and view the results in a Power BI dashboard and in Data Lake Store.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-25

Exercise 2: Create a Stream Analytics job to process IoT hub data


Scenario
You will now add the locations of police patrol cars to the traffic surveillance system. In this exercise, you
will create a second Stream Analytics job that captures patrol car location data from an IoT hub (using a
Visual Studio application, PatrolCarDevice, to generate the raw data). You will configure the Stream
Analytics job to send data to a Power BI report and to Data Lake Storage.

The main tasks for this exercise are as follows:

1. Create an IoT hub

2. Create a new Stream Analytics job

3. Configure Stream Analytics job inputs

4. Configure Stream Analytics job outputs

5. Configure the Stream Analytics job query

6. Start the Stream Analytics job

7. Generate IoT hub data for processing with Stream Analytics

8. Visualize Stream Analytics output using Power BI

9. View Stream Analytics output in Azure Data Lake Store

 Task 1: Create an IoT hub

 Task 2: Create a new Stream Analytics job

 Task 3: Configure Stream Analytics job inputs

 Task 4: Configure Stream Analytics job outputs

 Task 5: Configure the Stream Analytics job query

 Task 6: Start the Stream Analytics job

 Task 7: Generate IoT hub data for processing with Stream Analytics

 Task 8: Visualize Stream Analytics output using Power BI

 Task 9: View Stream Analytics output in Azure Data Lake Store

Results: At the end of this exercise, you will have created a Data Lake Store, an IoT hub, and a new Stream
Analytics job. You will then use Stream Analytics to process IoT hub data, and view the results in a Power
BI report and in Data Lake Store.
MCT USE ONLY. STUDENT USE PROHIBITED
2-26 Processing Event Streams using Azure Stream Analytics

Exercise 3: Reconfigure a Stream Analytics job to send output through a


Service Bus queue
Scenario
After you have added the locations of police patrol cars to the system, it becomes clear that a better,
more visual, approach is needed by presenting vehicle locations on a map. In this exercise, you will add an
input to the PatrolCarAnalytics Azure Stream Analytics job that captures patrol car locations—and displays
the results on a map—by using a simple custom Visual Studio application that listens to a Service Bus
queue. This exercise demonstrates how to overcome the shortcomings of trying to show this type of data
in a Power BI report, and illustrates how to send data from a Stream Analytics job to another application
instead of directly to an output such as Power BI.

The main tasks for this exercise are as follows:

1. Create a Service Bus and queue

2. Reconfigure the IoT hub

3. Reconfigure the PatrolCarAnalytics Stream Analytics job

4. Start the Stream Analytics job

5. Prepare an application to receive Stream Analytics data using a Service Bus

6. Generate IoT hub data for processing with Stream Analytics

 Task 1: Create a Service Bus and queue

 Task 2: Reconfigure the IoT hub

 Task 3: Reconfigure the PatrolCarAnalytics Stream Analytics job

 Task 4: Start the Stream Analytics job

 Task 5: Prepare an application to receive Stream Analytics data using a Service Bus

 Task 6: Generate IoT hub data for processing with Stream Analytics

Results: At the end of this exercise, you will have created an Azure Service Bus, reconfigured an existing
IoT hub, and an existing Stream Analytics job. You will then use Stream Analytics to process IoT hub data
and to send results to the Service Bus. Finally, you will use a custom Visual Studio application to view the
output of the Service Bus.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-27

Exercise 4: Reconfigure a Stream Analytics job to process both event hub


and static file data
Scenario
The next requirement for the traffic surveillance system is to add a facility for checking the vehicles
recorded by speed cameras against a list of stolen vehicles. In this exercise, you will edit the Stream
Analytics job from Exercise 1 (TrafficAnalytics) to detect whether a vehicle observed in a speed camera is
stolen. You will create an Azure Storage block blob and upload a file containing static vehicle theft
records, and use this file as input reference data for the Stream Analytics job (in addition to the speed
camera data). To achieve this, you will update the event hub that you used in Exercise 1, and add two
more consumer groups (this is a best practice, because you will use the event hub to provide two
additional inputs to Stream Analytics). You will configure Stream Analytics to send the results to a Power
BI dashboard, and to a JSON format file in Data Lake Store, organized by date and time.

The main tasks for this exercise are as follows:

1. Create a Blob storage account for holding stolen vehicle data

2. Examine the StolenVehiclesReport.csv file


3. Use Azure Storage Explorer to upload a StolenVehiclesReport.csv to the Blob storage container

4. Update the event hub and add two more consumer groups

5. Reconfigure the TrafficAnalytics Stream Analytics job inputs

6. Reconfigure the TrafficAnalytics Azure Stream Analytics job outputs

7. Reconfigure the TrafficAnalytics Stream Analytics job query

8. Start the TrafficAnalytics Stream Analytics job

9. Generate event hub data for processing with Stream Analytics

10. Visualize Stream Analytics output using Power BI

11. View Stream Analytics output in Data Lake Store


MCT USE ONLY. STUDENT USE PROHIBITED
2-28 Processing Event Streams using Azure Stream Analytics

 Task 1: Create a Blob storage account for holding stolen vehicle data

 Task 2: Examine the StolenVehiclesReport.csv file

 Task 3: Use Azure Storage Explorer to upload a StolenVehiclesReport.csv to the Blob


storage container

 Task 4: Update the event hub and add two more consumer groups

 Task 5: Reconfigure the TrafficAnalytics Stream Analytics job inputs

 Task 6: Reconfigure the TrafficAnalytics Azure Stream Analytics job outputs

 Task 7: Reconfigure the TrafficAnalytics Stream Analytics job query

 Task 8: Start the TrafficAnalytics Stream Analytics job

 Task 9: Generate event hub data for processing with Stream Analytics

 Task 10: Visualize Stream Analytics output using Power BI

 Task 11: View Stream Analytics output in Data Lake Store

Results: At the end of this exercise, you will have uploaded data to a new Blob storage container, updated
your event hub with new consumer groups, and reconfigured your TrafficAnalytics Azure Stream Analytics
job to use these new inputs. You will then use Stream Analytics to process the event hubs data, and view
the results in a Power BI dashboard, and in Data Lake Store.

Exercise 5: Use multiple Stream Analytics jobs to process event hub, IoT
hub and static file data, and output results using a Service Bus and custom
application
Scenario
For the final part of this initial phase in the development of the traffic surveillance system, you have been
asked to add the ability to determine the nearest patrol car to a speeding vehicle or stolen vehicle, send a
dispatch alert to the nearest patrol car, and then show the dispatched patrol car locations on a map.
Specifically, the system must be able to identify the nearest patrol car to a speeding or stolen vehicle, and
then send a message (using Service Bus) to that patrol car. The message would contain details about the
vehicle’s registration number, location, and speed. Any patrol car situated within five kilometers of the
stolen or speeding vehicle’s most recently reported location could then be dispatched to that location.
The message should contain the ID of the patrol car, the registration number of the stolen vehicle, and
the coordinates of the location where the vehicle was observed. In this exercise, you will create a new
Service Bus topic, and add a subscription to the topic. You will use this topic to send alert messages to
patrol cars about stolen vehicles. Patrol car devices will subscribe to the subscription in this topic.

The main tasks for this exercise are as follows:

1. Create a new Service Bus topic and add a subscription

2. Reconfigure the IoT hub and add a new consumer group

3. Reconfigure the event hub and add a new consumer group

4. Reconfigure the TrafficAnalytics Azure Stream Analytics job inputs


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-29

5. Reconfigure the TrafficAnalytics Stream Analytics job outputs

6. Reconfigure the TrafficAnalytics Stream Analytics job query

7. Start the TrafficAnalytics Azure and PatrolCarAnalytics Stream Analytics jobs

8. Generate event hub and IoT hub data for processing with Stream Analytics

9. Start an application to receive Stream Analytics data using a Service Bus

 Task 1: Create a new Service Bus topic and add a subscription

 Task 2: Reconfigure the IoT hub and add a new consumer group

 Task 3: Reconfigure the event hub and add a new consumer group

 Task 4: Reconfigure the TrafficAnalytics Azure Stream Analytics job inputs

 Task 5: Reconfigure the TrafficAnalytics Stream Analytics job outputs

 Task 6: Reconfigure the TrafficAnalytics Stream Analytics job query

 Task 7: Start the TrafficAnalytics Azure and PatrolCarAnalytics Stream Analytics jobs

 Task 8: Generate event hub and IoT hub data for processing with Stream Analytics

 Task 9: Start an application to receive Stream Analytics data using a Service Bus

Results: At the end of this exercise, you will have:

Created a new Service Bus topic, and added a subscription to this topic.

Reconfigured the IoT and event hubs, and added a new consumer group to each hub.
Reconfigured the TrafficAnalytics Azure Stream Analytics job to use these new inputs, and to use the new
Service Bus topic as a job output.

Updated the job query to send data to the Service Bus topic, by using a Visual Studio application.

Exercise 6: Use the Azure portal and Azure PowerShell to manage and scale
Stream Analytics jobs
Scenario
You have been asked how the traffic surveillance system could cope with any large-scale incident or event
that requires additional police resources being brought on stream. You have also been asked how the
system might be monitored and managed, and to demonstrate any potential for automation. In this
exercise, you will monitor a Stream Analytics job, and create an alert when the job uses more than a
threshold number of streaming units. You will then use the Azure portal to scale up the job, and review
the streaming unit utilization. You will use the Azure PowerShell cmdlets for Stream Analytics to stop the
Stream Analytics job, to scale the job back down, and then to restart the job. Finally, you will use job
diagrams to visualize the configurations of your two Stream Analytics jobs.

The main tasks for this exercise are as follows:

1. Add a monitoring alert to a Stream Analytics job

2. Use the Azure portal to scale up a Stream Analytics job


MCT USE ONLY. STUDENT USE PROHIBITED
2-30 Processing Event Streams using Azure Stream Analytics

3. Use Azure PowerShell to stop a Stream Analytics job

4. Use Azure PowerShell to scale down and restart a Stream Analytics job

5. Use job diagrams to visualize Stream Analytics job configurations

6. Lab closedown

 Task 1: Add a monitoring alert to a Stream Analytics job

 Task 2: Use the Azure portal to scale up a Stream Analytics job

 Task 3: Use Azure PowerShell to stop a Stream Analytics job

 Task 4: Use Azure PowerShell to scale down and restart a Stream Analytics job

 Task 5: Use job diagrams to visualize Stream Analytics job configurations

 Task 6: Lab closedown

Results: At the end of this exercise, you will have:

Added a monitoring alert to a Stream Analytics job.

Used the Azure portal to scale up a Stream Analytics job.

Used Azure PowerShell to stop a Stream Analytics job.

Used Azure PowerShell to scale down and restart a Stream Analytics job.

Used job diagrams to visualize Stream Analytics job configurations.

Question: What data types would you process using Stream Analytics within your
organization?

Question: How might you use multiple stream analytics jobs within your organization?
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-31

Module Review and Takeaways


Having completed this module, you should be able to:

 Describe Stream Analytics and its purpose.

 Develop and deploy Stream Analytics jobs.


 Configure Stream Analytics jobs for scalability, reliability, and security.

Review Question(s)
Question: How might you implement Stream Analytics within your organization?
MCT USE ONLY. STUDENT USE PROHIBITED
 
MCT USE ONLY. STUDENT USE PROHIBITED
3-1

Module 3
Performing Custom Processing in Azure Stream Analytics
Contents:
Module Overview 3-1 
Lesson 1: Implementing custom functions and debugging jobs 3-2 

Lesson 2: Incorporating Machine Learning into a Stream Analytics job 3-7 

Lab: Performing custom processing with Stream Analytics 3-12 


Module Review and Takeaways 3-17 

Module Overview
This module describes how to use custom functions in Microsoft® Azure® Stream Analytics and includes
an understanding of how to use Microsoft Azure Machine Learning with Stream Analytics. It also covers
how to test and debug Stream Analytics jobs.

Objectives
By the end of this module, you will be able to:

 Use custom functions that are implemented by using JavaScript in Stream Analytics jobs.
 Integrate Machine Learning models into a Stream Analytics job.
MCT USE ONLY. STUDENT USE PROHIBITED
3-2 Performing Custom Processing in Azure Stream Analytics

Lesson 1
Implementing custom functions and debugging jobs
This lesson explains how to create user-defined functions (UDFs) and use them in Stream Analytics. It also
explains how to test and debug Stream Analytics jobs.

Lesson Objectives
By the end of this lesson, you should be able to:

 Create user-defined functions in Stream Analytics.

 Call user-defined functions.

 Describe how to send data to Azure Function Apps.


 Test Stream Analytics jobs.

 Debug Stream Analytics jobs.

Creating user-defined functions (UDFs)


Stream Analytics supports the creation of UDFs
and uses JavaScript programming language. With
the support of JavaScript’s vast library, users
develop complex routines for data
transformation.

Use the following steps to create a UDF in Stream


Analytics:

1. Log in to the Azure portal and open the


specific Stream Analytics job where you need
a UDF.

2. Under Job Topology  functions, you add


a new UDF.

3. Click Add to add a new UDF.

4. You require the following information:


o Function alias. The name of the function.

o Function type. This is either JavaScript UDF or machine learning. For this purpose, select
JavaScript UDF.

o Output type. Select the required output type. The data types that are supported are array, bigint,
datetime, float, nvarchar(max) or record. “Any” data type can be used if you want a function to
return different data types based on different scenarios—and you decode at the function call.

5. By default, it gives you a default UDF as follows:

// Sample UDF which returns sum of two values.

function main(arg1, arg2) {


return arg1 + arg2;

}
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 3-3

6. Customize the function as per the requirements, and click Create.

7. This UDF can now be called from the query within a Stream Analytics job just like any scalar function.

Collect() function. Collect() is a built-in function in the Azure Stream Analytics Query Language and is
used to return an array containing all the records from a specific time window.

It’s important to note that, even though UDFs normally work on a row-by-row basis, they also take an
entire dataset for a specific time window as an input if you use the Collect() function. An example of using
the Collect() function is as follows:

SELECT Collect() AS allCollectedEvents


FROM AzureStreamAnalyticsInput
GROUP BY TumblingWindow(second,30)

Calling a UDF from a query


After creating a JavaScript UDF in Stream
Analytics, the logical next step is to invoke a
function call to execute that function.
To call a UDF from a Stream Analytics query, you
use the following steps:
1. Log in to the Azure portal and open the
specific Stream Analytics job that has the
UDF defined.
2. Under Job Topology  Query, you will find
the query written for the Stream Analytics
job.
3. You invoke the UDF function using the format UDF.function_name(parameters). For example, in the
following sample query, Text2Int is a custom UDF invoked from the query:

SELECT
time,
UDF.Text2Int(offset_parameter) AS Text2IntOffset
INTO
ASAOutput
FROM
ASAInputStream

It’s important to note certain conversions between Stream Analytics object data types and JavaScript data
types. For example, JavaScript only represents integers up to precisely 2^53 and JavaScript only supports
milliseconds in Date data type. The following table provides a mapping between them:

Azure Stream Analytics (ASA)


JavaScript data types
data types

bigint Number

double Number

nvarchar(MAX) String

DateTime Date
MCT USE ONLY. STUDENT USE PROHIBITED
3-4 Performing Custom Processing in Azure Stream Analytics

Azure Stream Analytics (ASA)


JavaScript data types
data types

Record Object

Array Array

NULL Null

Sending data to Azure Functions


Azure Function Apps is another Azure platform-
as-a-service (PaaS) service that helps users to
build, host and manage an app without needing
to maintain physical hardware such as servers.
You use Azure Function Apps to code, using
languages such as C#, JavaScript, Python,
PowerShell, PHP, and so on.

You can send data to Azure Function Apps by


using the Service Bus queue sink. The Azure
Function App can listen for incoming data on the
queue and perform the necessary processing.
Typical use cases where this approach would be
highly beneficial include:

 Large-scale data processing from IoT devices.

 Identification of illegal transactions such as in fraud detection.

 Analysis of data patterns—for example, machine learning.

Testing jobs
After you’ve created a Stream Analytics job, it’s
important to test it against a sample dataset. The
Azure portal provides functionality to upload a
sample file to test the Stream Analytics job. You
should use the following steps:

1. Log in to the Azure portal and open the


specific Stream Analytics job that you need
to test with sample data.

2. Under Job Topology  Query, you find the


query for the Stream Analytics job. Click
Query to open the actual query.

3. Right-click on the input where the dataset needs to be uploaded for testing. You will see two options:

o Upload sample data from file. Use this option to upload an input file for test purposes.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 3-5

o Sample data from input. This option extracts a portion of data based on Start date time and
duration of data collection.
4. When the test data is either uploaded or extracted, you click the Test button. This triggers the
execution of the Stream Analytics job with the test data.

5. After you execute the Stream Analytics job, the results are displayed on screen and available for
download.

It’s important to test your Stream Analytics job with sample data, to see if the function or query is working
as expected, before executing the job with full dataset.

Debugging jobs
Stream Analytics provides a highly efficient
platform for quickly creating large-scale data
processing applications. As the application/query
logic becomes more complex, it can become
difficult to debug jobs when you run into issues.
For this purpose, it’s important to understand
how to debug Stream Analytics jobs.

One of the standard principles in debugging


applications is to divide the application/query
logic into discrete stages and log the outputs at
each stage. In simple terms, the
application/query logic should be split into
multiple steps and the intermediate output should be logged either to a database or a file, whichever is
convenient. You can use a query such as SELECT * INTO StepOutput FROM StepInput to capture
intermediate results for debugging purposes. This approach makes it easier to pinpoint the step at which
an error occurred.

The Azure portal provides an activity log for Stream Analytics jobs for events like Critical, Error, Warning
and Informational. It’s useful to review the activity log to understand the issues and resolve them
accordingly. It’s important to note that informational events contain the actual details for causes of errors
when compared to error events, which often state that an error has occurred.

It’s important to note that the Azure portal also provides a capability to add an activity log alert when
errors occur. You can configure the Azure portal to send alerts via an email message, SMS or webhook.

Demonstration: Creating a UDF and using it in a Stream Analytics job


In this demonstration, you will see how to create a UDF using JavaScript, and then call this UDF from a
Stream Analytics query.

This demonstration analyzes the prices of stocks in a stock market, and attempts to work out which stocks
are more volatile than others.

The MostVolatile UDF function returns the stock ticker, the number of price changes, and the maximum
and minimum prices for the most volatile stock item from all the stock market price changes recorded in a
given time window.
MCT USE ONLY. STUDENT USE PROHIBITED
3-6 Performing Custom Processing in Azure Stream Analytics

The definition of “most volatile” is:

 The price must have changed at least 10 times in the time window.

 The difference between the maximum and minimum prices must be greater than that of all other
stocks that have changed prices at least five times.

 In the event of a tie, the number of price changes decides which is the most volatile.

Question: Can you use Stream Analytics to process data for fraud or threat detection?

Check Your Knowledge


Question

Which of the following is not an event captured


by a Stream Analytics activity log?

Select the correct answer.

Critical

Error

Warning

Informational

Important

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? You can’t have more than


one output in a Stream Analytics job.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 3-7

Lesson 2
Incorporating Machine Learning into a Stream Analytics
job
This lesson describes how to use Machine Learning models in Stream Analytics jobs.

Lesson Objectives
At the end of this lesson, you should be able to:

 Understand the purpose of Machine Learning.

 Create a Machine Learning UDF in Stream Analytics.


 Handle complex data returned from a UDF.

 Invoke a Machine Learning model from a Stream Analytics query.

Overview of Machine Learning


Machine Learning is a fully managed cloud
service that enables users to develop, deploy and
share predictive analytics solutions. To solve
problems, people historically look at data,
understand what it represents, and use that data
to find solutions. However, by using Machine
Learning, you have the tools and algorithms that
analyze large volumes of preprocessed data to
produce a Machine Learning model that is
consumed by any downstream applications.
Machine Learning is an iterative process that
includes the following key components:

Azure Machine Learning Studio. This is a graphical tool you use to design, implement, test, and deploy
machine learning functions. Machine Learning Studio provides a large library of preprocessing routines
you use to prepare data for machine learning from the raw data, run experiments on the prepared data
using machine learning algorithms, and test the model. After an effective model is found, Machine
Learning Studio also helps to deploy the model.

Data preprocessing modules. Machine Learning provides a large library of data preprocessing modules
you use to process and prepare raw data into processed data on which machine learning algorithms are
executed.
Machine Learning algorithms. Machine Learning provides many algorithms such as Multiclass Decision
Jungle, Two-Class Boosted Decision Tree, One-vs-All Multiclass, Bayesian Linear Regression, Boosted
Decision Tree Regression, Neural Network Regression.

Machine Learning API. After an effective model is deployed, Machine Learning provides a rich API for
downstream applications to consume this model—for example RESTful web services.
MCT USE ONLY. STUDENT USE PROHIBITED
3-8 Performing Custom Processing in Azure Stream Analytics

Creating a Machine Learning UDF


Stream Analytics provides the functionality to
create a Machine Learning UDF that is called
from the Stream Analytics query within the
Stream Analytics job. Two types of UDFs are
available in Stream Analytics—the JavaScript
UDF, which you learned about earlier in this
module, and the Machine Learning UDF.

Before you create a Machine Learning UDF, you


need to publish your Machine Learning model as
a web service. After your web service is
published, you use the following steps to create a
Machine Learning UDF in Stream Analytics:

1. Log in to the Azure portal and open the specific Stream Analytics job where you need to add a UDF.

2. Under Job Topology  functions, you add a new UDF.

3. Click Add to add a new UDF using the following information:


o Function alias.The name of the function.
o Function type. Select Azure ML.
o URL. The URL for the web service.
o Key. The API key for the web service.
4. Click Create to create the Machine Learning UDF in Stream Analytics.

Accessing complex types in queries


UDFs, including those implemented by Machine
Learning, can return complex data structures
such as records and arrays. Stream Analytics
provides mechanisms that enable you to access
this data inside a query.

Records
A record is a collection of name and value pairs
that JSON uses extensively. A typical JSON file
has a structure similar to that of name and value
pairs. For example, consider the following sensor
information for a single event in JSON format:

{
"DeviceIdentification":"ABC123",
"LocationInformation":{"Lat":"100", "Long":"200"},
"SensorInformation":{
"Temperature": "70",
"Humidity": "50",
"CustomSensor01": "10",
"CustomSensor02": "20",
"CustomSensor03": "30"
}
}
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 3-9

Many events similar to this are produced by the sensors and are then sent as a data stream to Stream
Analytics. The Stream Analytics job processes this information to either store it in a database or send it to
another application for further processing.

To access this information in a Stream Analytics query, you use the dot notation to reference each specific
field. For example:

SELECT
DeviceIdentification,
LocationInformation.Lat,
LocationInformation.Long,
SensorInformation.Temperature,
SensorInformation.Humidity
FROM input

Arrays
Arrays are an ordered collection of values. Arrays are particularly useful when you don’t know the number
of sets of name and value pairs that are coming for each element within that event. Therefore, you use an
array to represent that information. Consider the same sensor information in the preceding example but
include many point measurements in the form of an array, as follows:

{
"DeviceIdentification": "ABC123",
"LocationInformation": {"Lat": "100", "Long": "200"},
"SensorInformation":
[
{
"Temperature": "50",
"Humidity": "50",
"CustomSensor01": "10",
"CustomSensor02": "20",
"CustomSensor03": "30"
},
{
"Temperature": "60",
"Humidity": "60",
"CustomSensor01": "20",
"CustomSensor02": "30",
"CustomSensor03": "40"
},
{
"Temperature": "70",
"Humidity": "70",
"CustomSensor01": "30",
"CustomSensor02": "40",
"CustomSensor03": "50"
}
]
}

In this example, SensorInformation is an array structure as denoted between [ and ] brackets and has three
array elements—each element has five name/value pairs.
MCT USE ONLY. STUDENT USE PROHIBITED
3-10 Performing Custom Processing in Azure Stream Analytics

Stream Analytics provides functionality like CROSS APPLY that makes it easy to extract information from
such complex array data types:

SELECT
t.DeviceIdentification,
t.LocationInformation.Lat,
t.LocationInformation.Long,
flat.ArrayValue.Temperature,
flat.ArrayValue.Humidity,
flat.ArrayValue.CustomSensor01,
flat.ArrayValue.CustomSensor02,
flat.ArrayValue.CustomSensor03
FROM
Input t
CROSS APPLY GetElements(t.SensorInformation) AS flat

The preceding query produces a flat table that has three records, as follows:

deviceidentificat tem humidi customsenso customsenso customsenso


lat long
ion p ty r01 r02 r03

"ABC123" "10 "20 "50 "50" "10" "20" "30"


0" 0" "

"ABC123" "10 "20 "60 "60" "20" "30" "40"


0" 0" "

"ABC123" "10 "20 "70 "70" "30" "40" "50"


0" 0" "

Invoking a Machine Learning model from a query


After the Machine Learning model is deployed as
a web service, the next step is to use it in a
Stream Analytics job. The following sample code
shows how to invoke a web service (in this case
weatherForecast) from the Stream Analytics job
query:

WITH weatherForecast AS (
SELECT inputParam, weatherForecast
(inputParam) as result from input
)
select inputParam,
result.[expectedTemperature]
into output
from weatherForecast

weatherForecast(inputParam) calls the Machine Learning UDF with the inputParam and receives the result.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 3-11

Demonstration: Calling a Machine Learning model from a Stream Analytics


job
In this demonstration, you will see how to create and use a machine learning experiment that analyzes
data retrieved by Stream Analytics.

This demonstration continues the stock market scenario. Stream Analytics captures the tickers and new
price each time a price change occurs. The Stream Analytics job uses machine learning to detect whether
a price is unusual for the stock (much higher than might be expected given the price history of the stock).

Question: Can you use Stream Analytics with Machine Learning?

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? To flatten JSON arrays in


Stream Analytics, you use CROSS APPLY.
MCT USE ONLY. STUDENT USE PROHIBITED
3-12 Performing Custom Processing in Azure Stream Analytics

Lab: Performing custom processing with Stream Analytics


Scenario
You work for Adatum as a data engineer, and you’ve been asked to build a traffic surveillance system for
traffic police. This system must be able to analyze significant amounts of dynamically streamed data—
captured from speed cameras and automatic number plate recognition (ANPR) devices—and then
crosscheck the outputs against large volumes of reference data that holds vehicle, driver, and location
information. Fixed roadside cameras, hand-held cameras (held by traffic police), and mobile cameras (in
police patrol cars) are used to monitor traffic speeds and raise an alert if a vehicle is travelling too quickly
for the local speed limit. The cameras also have built-in ANPR software that reads vehicle registration
plates.

For the second phase of the project, you will use Stream Analytics, together with Event Hub, IoT Hubs,
Service Bus, Machine Learning, and custom applications to:

 Post messages to a Service Bus queue for all vehicles that are speeding, by using a simple JavaScript
UDF function to determine whether a vehicle’s speed is above a particular limit.

 Identify vehicles that appear to be using the same registration number, by using a JavaScript UDF
function to determine whether a vehicle with the same registration (not necessarily speeding) has
been spotted at two locations that are an impossible distance apart within a given timeframe.

 Identify traffic flow issues, such as road blockages, or excessive speeds, by using Machine Learning
with Stream Analytics to detect consistent speed anomalies from a speed camera; for example, if
speeds are consistently very low for a period, the cause could be a traffic accident or incident.

Objectives
After completing this lab, you will be able to:
 Use a Stream Analytics UDF function to identify specific data points.

 Use a Stream Analytics UDF function to identify duplicate data records.

 Use Machine Learning with Stream Analytics to identify data anomalies.

Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.

Lab Setup
Estimated Time: 90 minutes
Virtual machine: 20776A-LON-DEV

Username: ADATUM\AdatumAdmin

Password: Pa55w.rd

This lab uses the following resources from Lab 2, all in resource group CamerasRG:

Data Lake Store: adls<name><date>

Event Hub: camerafeeds<name><date>

IoT Hub: patrolcars<name><date>

Service Bus: locationalerts<name><date>


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 3-13

Storage account: datastore<name><date>

Streaming Analytics jobs:

TrafficAnalytics

PatrolCarAnalytics

Exercise 1: Use a Stream Analytics UDF function to identify specific data


points
Scenario
For the first part of this next phase of the traffic surveillance system, you will add functionality to identify
speeding vehicles. In this exercise, you will add logic to the analytics that capture vehicle speeds, and post
a message to a Service Bus queue for all vehicles that are speeding. You will add a simple JavaScript
function that ascertains whether a vehicle is speeding. You will then use an updated version of the app
from Lab 2 to display the velocity of cars that were caught speeding.
The main tasks for this exercise are as follows:

1. Update the event hub and add a consumer group

2. Add a queue to the Service Bus

3. Configure the Location Alerts app

4. Reconfigure the TrafficAnalytics Stream Analytics job

5. Start the Stream Analytics jobs


6. Start the Speed Camera app

7. Start the Patrol Car app

8. Start the Location Alerts app

9. View the results

10. Close jobs and apps

 Task 1: Update the event hub and add a consumer group

 Task 2: Add a queue to the Service Bus

 Task 3: Configure the Location Alerts app

 Task 4: Reconfigure the TrafficAnalytics Stream Analytics job

 Task 5: Start the Stream Analytics jobs

 Task 6: Start the Speed Camera app

 Task 7: Start the Patrol Car app

 Task 8: Start the Location Alerts app

 Task 9: View the results

 Task 10: Close jobs and apps


MCT USE ONLY. STUDENT USE PROHIBITED
3-14 Performing Custom Processing in Azure Stream Analytics

Results: At the end of this exercise, you will have added a new consumer group to your event hub, added
a new queue to your Service Bus, reconfigured an existing Stream Analytics job to use these resources,
and added a UDF that returns an integer value indicating whether a vehicle is speeding. You will also have
tested this logic using Visual Studio apps.

Exercise 2: Use a Stream Analytics UDF to identify duplicate data records


Scenario
For the next part of the traffic surveillance system, you will build logic to identify vehicles that appear to
be using the same registration number. In this exercise, you will use a JavaScript function to determine
whether a vehicle with the same registration (not necessarily speeding) has been spotted at two locations
that are an impossible distance apart within a given timeframe. You will store the details (times,
registrations, locations), and this data will be used in a later lab exercise.
The main tasks for this exercise are as follows:

1. Update the event hub and add a consumer group

2. Create and configure a new Stream Analytics job

3. Start the Speed Camera app

4. Examine the generated data

5. Close jobs and apps

 Task 1: Update the event hub and add a consumer group

 Task 2: Create and configure a new Stream Analytics job

 Task 3: Start the Speed Camera app

 Task 4: Examine the generated data

 Task 5: Close jobs and apps

Results: At the end of this exercise, you will have added a new consumer group to your event hub, and
created a new Stream Analytics job that uses a UDF to identify duplicate vehicle registrations. You will also
have tested this logic using a Visual Studio app.

Exercise 3: Use Machine Learning with Stream Analytics to identify data


anomalies
Scenario
For the final part of this phase of the traffic surveillance system, you will use Machine Learning with
Stream Analytics to detect consistent speed anomalies from a speed camera, to help locate traffic
accidents or other road incidents.
In this exercise, you will use Machine Learning with Stream Analytics to detect consistent speed anomalies
from a speed camera. If speeds are consistently very low for a period, the cause could be a traffic accident
or incident, so you should display a message on screen using an updated version of the app from Exercise
1.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 3-15

The main tasks for this exercise are as follows:

1. Update the event hub and add a consumer group

2. Create and configure a new Stream Analytics job

3. Configure and start the Speed Camera app

4. Stop the Speed Camera app and the CaptureTrafficData Stream Analytics job

5. Download the training data

6. Create a Machine Learning workspace

7. Create a machine learning experiment

8. Create a trained model

9. Expose the trained model as a web service

10. Add another consumer group to the Speed Camera event hub
11. Create a new Service Bus queue

12. Create and configure a new Stream Analytics job

13. Start the Stream Analytics jobs


14. Configure the Location Alerts app

15. Start the Patrol Car and Speed Camera apps

16. Start the Location Alerts app and view the results
17. Lab closedown
MCT USE ONLY. STUDENT USE PROHIBITED
3-16 Performing Custom Processing in Azure Stream Analytics

 Task 1: Update the event hub and add a consumer group

 Task 2: Create and configure a new Stream Analytics job

 Task 3: Configure and start the Speed Camera app

 Task 4: Stop the Speed Camera app and the CaptureTrafficData Stream Analytics job

 Task 5: Download the training data

 Task 6: Create a Machine Learning workspace

 Task 7: Create a machine learning experiment

 Task 8: Create a trained model

 Task 9: Expose the trained model as a web service

 Task 10: Add another consumer group to the Speed Camera event hub

 Task 11: Create a new Service Bus queue

 Task 12: Create and configure a new Stream Analytics job

 Task 13: Start the Stream Analytics jobs

 Task 14: Configure the Location Alerts app

 Task 15: Start the Patrol Car and Speed Camera apps

 Task 16: Start the Location Alerts app and view the results

 Task 17: Lab closedown

Results: At the end of this exercise, you will have added new consumer groups to your event hub, and
created a new Stream Analytics job that works with a Visual Studio app to generate training data. You will
then create a machine learning experiment to detect anomalous data, train your model using the
generated training data, and then deploy the trained model as a web service. You will also have created a
second new Stream Analytics job that uses the Machine Learning web service, and the model using Visual
Studio apps.

Question: How might you use Stream Analytics UDF functions to identify specific data points
within your organization?

Question: What requirements does your organization have for deduplicating information?
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 3-17

Module Review and Takeaways


In this module, you have learned how to:

 Use custom functions using JavaScript in Stream Analytics.

 Incorporate Machine Learning into Stream Analytics jobs.

Review Question(s)
Question: What requirements does your organization have for implementing custom
functions and debugging jobs?
MCT USE ONLY. STUDENT USE PROHIBITED
 
MCT USE ONLY. STUDENT USE PROHIBITED
4-1

Module 4
Managing Big Data in Azure Data Lake Store
Contents:
Module Overview 4-1 
Lesson 1: The Azure Data Lake Store 4-2 

Lesson 2: Monitoring and protecting data in Data Lake Store 4-9 

Lab: Managing big data in Data Lake Store 4-16 


Module Review and Takeaways 4-21 

Module Overview
Microsoft® Azure® Data Lake Store is a hyperscale distributed file service that is part of the Azure Data
Lake collection of services. Data Lake Store plays a key role in the process of the management and
analysis of big data. Data Lake Store, sometimes referred to as an analytics store, provides a staging area
for substantial amounts of data where transformation or preparation and other analytics jobs are
performed. Data Lake Store fully incorporates tools and interfaces—such as Visual Studio®, PowerShell™,
and U-SQL™—that are commonly used by developers, data scientists, engineers, and architects. It is
extensible through its compatibility with open-source big data solutions like those found within the
Hadoop ecosystem.

Objectives
By the end of this module, you will be able to:

 Use Data Lake Store for managing massive datasets.

 Protect and monitor data that resides in a Data Lake Store.


MCT USE ONLY. STUDENT USE PROHIBITED
4-2 Managing Big Data in Azure Data Lake Store

Lesson 1
The Azure Data Lake Store
This lesson describes how to use the Data Lake Store service to create and manage large-scale storage
structures. You use a Data Lake Store to hold vast amounts of data, unconstrained by the storage capacity
limits of a single computer. However, to utilize large-scale storage effectively, you have to understand
how to structure and manage the data so that you find it quickly.

Lesson Objectives
At the end of this lesson, you should be able to:

 Describe the purpose of the Data Lake Store service.

 Create a Data Lake Store and storage account.

 Create a hierarchy in which to store data.

 Populate the Data Lake Store with data using the Azure portal and other methods.
 Describe how to optimize access to a Data Lake Store.

What is a Data Lake Store?


Data Lake Store is a cloud-enabled repository
that has unlimited capacity. It is designed to
store or stage structured, semi-structured, and
unstructured data. By developing on concepts
utilized by the Apache Hadoop Distributed Files
System (HDFS), Data Lake Store provides an
enterprise ready, cluster-less analytics store that
encrypts data throughout the data life cycle, and
manages access and authorization through a
Role-Based Access Control (RBAC) model. Data
Lake Store is compatible with many existing
analytical solutions and is used in place of HDFS
with services such as Apache Storm, Spark, MapReduce, and Hive running on HDInsight 3.2 and later.

There are no limits for account sizes, file sizes, or the amount of data stored in a Data Lake store. It holds
files ranging in size from kilobytes to petabytes, with no limit on the duration during which a file is stored.
Examples of information typically held in a Data Lake store include CSV or text files, data streamed from
medical devices or automobiles, video and audio files, databases, and much more.

Data Lake Store is an analytics store that is primed for high capacity I/O operations and compute
functions. This distinguishes Data Lake Store from other cloud-based storage solutions that are typically
designed for relatively static data, usually stored as blobs (such as in OneDrive® or Azure storage
containers). However, the complexity and design of the underlying storage system invokes a premium
when compared to other Azure storage solutions, and might be more expensive to use. Therefore, it is
important to store data efficiently to minimize these costs.

For more information, see:


Overview of Data Lake Store
https://aka.ms/W80vfs
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-3

Comparing Azure Data Lake Store and Azure Blob storage


https://aka.ms/Mv8yk6

Creating a Data Lake Store


You use the Azure portal to create a Data Lake
Store. When creating a Data Lake Store, the
owner must populate the following attributes:

Attribute Notes (value)

Name The name of your Data Lake Store must be lowercase and contain between
three and 24 alphanumeric characters.
The Data Lake Store will also be given a suffix so that the name will resemble:
myadsl.azuredatalakestore.net

Subscription Any available subscription that is assigned to your Azure Active Directory (Azure
AD) account.

Resource All resources within Azure must reside in a Resource Group (RG). You have the
Group option to create one or to select an existing one from a drop-down menu.

Location At the time of writing, the available regions for Data Lake Stores are:
 East US 2
 Central US
 North Europe
Other regions are expected to come online.

Pricing The pricing model is either pay-as-you-go or a monthly payment based on


terabytes—the cost varies between regions.

Encryption  Encryption is enabled by default using keys that are managed by the Data
Lake Store service. Other options include:
 No encryption
 Encryption using keys from a personal Key Vault account
MCT USE ONLY. STUDENT USE PROHIBITED
4-4 Managing Big Data in Azure Data Lake Store

For detailed information on using the Azure portal to create a Data Lake Store account, see:

Get started with Data Lake Store using the Azure portal
https://aka.ms/Yixagp

You can also create a Data Lake Store programmatically, using interfaces that are available for PowerShell,
Azure CLI, and Visual Studio. Modules are also available for open source languages, such as Python. You
use these to automate many common Data Lake Store tasks.

The following example shows how to create a Data Lake Store using the PowerShell interface:

Creating a Data Lake Store using PowerShell


# Log in to Azure
Login-AzureRmAccount

# Specify a subscription
Set-AzureRmContext -SubscriptionId <subscription ID>

# Register the Data Lake Store provider


Register-AzureRmResourceProvider -ProviderNamespace "Microsoft.DataLakeStore"

# Create a new Data Lake Store account


$resourceGroupName = <name of existing resource group>
$dataLakeStoreName = <name of new Data Lake Store>
New-AzureRmDataLakeStoreAccount -ResourceGroupName $resourceGroupName -Name
$dataLakeStoreName -Location "East US 2"

# Verify that the Data Lake Store account has been created
Test-AzureRmDataLakeStoreAccount -Name $dataLakeStoreName

This example uses the default settings for encryption and the pricing tier, but you can modify these values
by using the –Encryption and –Tier parameters to the New-AzureRmDataLakeStoreAccount command.

For additional information, see:

Create a Data Lake Store account


https://aka.ms/Nuklln

Storage structures within a Data Lake Store


Data Lake Store implements a hierarchical file
system structure, similar to that used by many
operating systems including Windows®, Linux,
and HDFS. At the top is a root folder to which
you add new subfolders. You upload files to any
folder (including the root) if you have the
appropriate security rights (described in Lesson 2
of this module).

You use systematic folder names to organize


your data. For example, you could create
separate folders for different types of data
(product details, accounts, users, and so on).
Another recommended approach for handling time-sensitive data is to create a folder for each day,
possibly with subfolders for the hour of the day. Azure Stream Analytics supports this structure when you
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-5

specify a Data Lake Store as an output sink. However, whichever scheme you use, you should ensure that
you can easily locate data quickly—don’t just create folders in which to “dump” data and depend on the
file name to determine what the file contains. Additionally, analytics processors such as Data Lake
Analytics work better with a few large data files, rather than many small ones—you should structure
datasets into large chunks, each of which is processed as a single item. Microsoft recommend that you
organize datasets into files of 256 MB or larger.

Note: If you have many small files, consider using a preprocessor that combines these files
into larger pieces before passing them to an analytics processor.

The simplest way to create folders, upload files, and navigate the structure of a Data Lake Store is to use
the Data Explorer that is available on the Data Lake Store blade in the Azure portal. However, if you are
automating tasks, you use the programmatic interfaces that are available to the many programming
languages supported by Azure.

The following code shows how to create a new folder and view the contents of a folder by using
PowerShell:

Creating and viewing folders in a Data Lake Store


$dataLakeStoreName = <name of Data Lake Store>

# Create a folder named newdirectory


$myrootdir = “/”
New-AzureRmDataLakeStoreItem –Folder –AccountName $dataLakeStoreName -Path
$myrootdir\newdirectory

# List all files and folders in the root directory


Get-AzureRmDataLakeStoreChildItem –AccountName $dataLakeStoreName -Path $myrootdir

Demonstration: Creating and using a Data Lake Store


In this demonstration, you will see how to:

 Use the Azure portal to create a new Data Lake Store.

 Use the Azure portal to create Data Lake Store folders and upload data.

 Use PowerShell to access Data Lake Store resources.

 Use the Azure portal to download data from Data Lake Store.
MCT USE ONLY. STUDENT USE PROHIBITED
4-6 Managing Big Data in Azure Data Lake Store

Populating the Data Lake Store


You transfer data into a Data Lake Store as the
result of a streaming process—for example, when
you specify a Data Lake Store as the output sink
of a Stream Analytics job. You can also perform
large-scale bulk or batch transfers directly from
other repositories.

Bulk data loading happens when you want to


transfer anything from a single file to a large
dataset—either as a one-off task (for example,
historical data) or on a schedule (such as log
shipping). There are several ways to accomplish a
bulk load and more are in development. The size
of your dataset, your network bandwidth, and other factors will play a role in deciding which method or
toolset you use.
Perhaps the most straightforward method of bulk loading extremely large datasets is to use the Azure
disk shipping service, where you copy data to a physical drive using a special tool and physically ship the
disk to a regional datacenter. For more information on using the Import/Export service with Data Lake,
see:
Use the Azure Import/Export service for offline copy of data to Data Lake Store
https://aka.ms/Uc9jzx

Whilst the Import/Export service provides a fast way to onboard large amounts of data, for smaller
datasets you will likely use one of several software-based tools, such as PowerShell, AzCopy, AdlCopy,
Visual Studio, or Azure Data Factory.
You upload and download individual files programmatically using PowerShell. You use the Import-
AzureRmDataLakeStoreItem cmdlet to upload a file, and the Export-AzureRmDataLakeStoreItem cmdlet
to download a file.

If you need to transfer multiple files quickly, there’s a two-stage process. The first stage involves moving
the data into Blob storage, typically using the AzCopy utility. The second stage involves moving the data
across Azure’s data plane from Blob storage to your Data Lake Store; the AdlCopy utility is ideal for this
part of the process.

The following commands show how to use AzCopy and AdlCopy to transfer all CSV files in a specified
folder on an on-premises computer into a Data Lake Store:

Copy data into a Data Lake Store


# Copy CSV files from C:\datafolder to Blob storage
$account="<blob storage account name>"
$container="<blob container name>"
$key="<blob storage access key>"
$sourceFolder="C:\datafolder"
AzCopy.exe /Source:$sourceFolder /Dest:https://$account.blob.core.windows.net/$container
/DestKey:$key /Pattern:"*.csv"

# Copy CSV files from Blob storage to Data Lake Storage


$destFolder="<destination folder in Data Lake Storage>"
AdlCopy /source https://$account.blob.core.windows.net/$container/ /dest
adl://$dataLakeStoreName.azuredatalakestore.net/$destFolder/ /sourcekey $key
/Pattern:"*.csv"
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-7

Note that AdlCopy operates in two modes: standalone and by using a Data Lake Analytics account. In
standalone mode, the AdlCopy utility uses resources provided by the Data Lake Store service, and
performance might be unpredictable, especially if you are transferring large files. Using a Data Lake
Analytics account causes AdlCopy to run as an analytics job using the resources that you specify (and are
billed for). This mode of operation is more predictable. For more detailed information on using AzCopy
and AdlCopy, see:

Transfer data with the AzCopy feature on Windows


https://aka.ms/I1nouq

Copy data from Azure Storage Blobs to Data Lake Store


https://aka.ms/Stlp6r

Note: Utilities are also available for transferring data from HDInsight cluster storage
(Distcp), and Azure SQL Database (Sqoop).

You might also use the Cloud Explorer extension in Visual Studio—this is useful if you’re working on
complex projects. Visual Studio gives access to Blob storage and Data Lake Stores, providing the capability
for you to transfer data in and out of Data Lake Store. You also use Cloud Explore to transfer data
between stores held in separate Azure accounts. However, Visual Studio enforces a queuing mechanism
for requests, so the rate of transfer in and out of Data Lake Store will depend on how many operations
you are attempting to perform from Visual Studio at any given time.

Optimizing access to a Data Lake Store


Ingesting large volumes of data into a Data Lake
Store on a regular basis is a time-consuming
operation that requires careful planning and
preparation. The most important factor is
arguably the network bandwidth available for
uploading data from the source to Azure—make
sure that you have sufficient capacity available. If
necessary, you should consider using Azure
ExpressRoute to implement a private dedicated
connection directly to an Azure datacenter. For
more information, see:

ExpressRoute
https://aka.ms/Huxmw2

You should also ensure that the tools you are using (AzCopy, AdlCopy, PowerShell, and so on) are making
the best use of resources. Maximize parallelization wherever possible. For example, if you are using
AdlCopy, run using a Data Lake Analytics account and specify an appropriate number of Data Lake
Analytics units. If you are using the PowerShell Import-AzureRmDataLakeStoreItem cmdlet, specify the
PerFileThreadCount and ConcurrentFileCount parameters appropriately. Use the PerFileThreadCount
parameter to set the number of threads that are used in parallel for uploading each file. The
ConcurrentFileCount parameter indicates the maximum number of files that might be uploaded
concurrently.
MCT USE ONLY. STUDENT USE PROHIBITED
4-8 Managing Big Data in Azure Data Lake Store

For further information and considerations on using these parameters, see:

Performance guidance while using PowerShell


https://aka.ms/Bu68sn

It’s also important to understand that you are responsible for handling disaster recovery. Data Lake Store
is robust and managed by Microsoft, but you should always ensure that you have at least two copies of
critical data stored in separate regions to protect you from unscheduled outages and other regional
disasters. You automate this process by using scripts that run AdlCopy to copy data from one store to
another, or by using Data Factory to perform these tasks according to a regular schedule. This approach is
discussed further in Module 9: Automating the Data Flow with Azure Data Factory.

You should also protect your data to ensure that it can’t be overwritten or deleted (accidentally or
maliciously) by applying the appropriate security controls, as discussed in Lesson 2. Additionally, you
should consider applying an Azure resource lock over a Data Lake Storage account to prevent the entire
account from being removed.

Demonstration: Working with Data Lake Store using PowerShell


In this demonstration, you will see how to:
 Part 1—use PowerShell to upload data to Data Lake Store:

o Use PowerShell to upload a single file directly to Data Lake Store.

o Use Data Explorer to verify the file upload.


o Repeat the file upload with an increased PerFileThreadCount.

 Part 2—upload a set of files to Data Lake Store from Blob storage:

o Use AzCopy to upload files to new Blob storage.

o Use Cloud Explorer in Visual Studio to examine the uploaded blob.

o Use AdlCopy to transfer files from Blob storage to Data Lake Store.
Question: What makes a Data Lake Store an unlikely choice for the replacement of
corporate file shares?
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-9

Lesson 2
Monitoring and protecting data in Data Lake Store
In this lesson, you will learn about the techniques that are available for protecting the data held in Data
Lake Store, and how to prevent unauthorized access to this data. You will also see how to monitor data
and track attempts to access that data.

You will consider the Access Control List (ACL) model employed by Data Lake Store. Most of the
techniques and concepts will be familiar to those with a Microsoft background but one subject in
particular—the POSIX ACL model—might be entirely new. You will also learn about additional features
such as network security and encryption, which typically work in the same way as they do in other Azure
services.

Lesson Objectives
By the end of this lesson, you should be able to:
 Describe how to encrypt data in Data Lake Store, and manage the encryption keys.

 Prevent access to Data Lake Store requests that originate from unknown sites.

 Authenticate users who need to access data.


 Authorize the operations that these users perform.

 Audit the operations that take place in a Data Lake Store.

 Explain how to apply authentication and security in applications that use a Data Lake Store.

Encrypting data
All interactions with a Data Lake Store take place
over an HTTPS connection. This protocol helps to
ensure that all data that enters or exits the
service is encrypted. However, a Data Lake Store
also encrypts data as it is stored—even if a
successful attempt is made to break into the
service, the data itself is unusable without the
appropriate encryption keys.

A Data Lake Store comes with encryption


enabled by default. The easiest way to manage
the encryption of data held within a store is to let
the service manage it for you. However, there will
be times when even the most ironclad terms and conditions professed by the hosting company will not be
enough to satisfy your Chief Information Security Officer (CISO) or others within your organization. Data
Lake provides options, albeit with overhead, for these circumstances.

You use Azure’s Key Vault service for the management of encryption keys. There are two modes of master
encryption key (MEK) management in Data Lake Store:

 Service managed keys

 Customer managed keys


MCT USE ONLY. STUDENT USE PROHIBITED
4-10 Managing Big Data in Azure Data Lake Store

Access to the MEK, or MEKs, is required to access data within a store. The following table lists the
comparison of capabilities between the two methods. Note that, after you choose the method at the time
of creation, it can’t be changed unless data is migrated to a new store.

Service managed keys Customer managed keys

When is my data It is encrypted prior to being stored. It is encrypted prior to being stored.
encrypted?

Where does my Key Vault Key Vault


MEK reside?

Are encryption No No
keys stored in the
clear anywhere
outside of the
Key Vault?

Is it possible to No. After the MEK is stored in Key No. After the MEK is stored in Key
retrieve my MEK Vault, it is locked and is then only Vault, it is locked and is then only
from Key Vault? used for encryption and decryption. used for encryption and decryption.

Who owns the The Data Lake Store service. The owner of the Azure subscription
Key Vault also owns the instance of Key Vault
instance and the that stores the MEK. Note that a
MEK? Hardware Security Module (HSM)
can also be used—as provided by the
service.

Can you revoke No Yes. You manage access through


access to the your instance of Key Vault and by
MEK for the Data adding and removing permissions for
Lake Store the Data Lake service.
service?

Can you No Yes, but the data will be unreadable.


permanently You will need a backup of the MEK
delete the MEK? to access the data.

The following table summarizes the three key types used within the design of Data Lake encryption:

Key Association Storage location Type Notes

Master A Data Lake Key Vault Asymmetric Can be managed by


Encryption Key Store account a Data Lake Store or
(MEK) you.

Data Encryption A Data Lake Persistently Symmetric The DEK is


Key (DEK) Store account stored and encrypted by the
managed by the MEK and stays with
Data Lake Store the data.
service.

Block Encryption A block of data None Symmetric The BEK is built


Key (BEK) using the DEK and
the block of data.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-11

Encryption in Data Lake Store is transparent. This feature means that data is encrypted before being
persisted, and then decrypted prior to retrieval. This approach is especially important to note in the
context of applications that are called by APIs. The good news is that, by using this model, no special
consideration is required for interaction with Data Lake.

For more information about Data Lake Store encryption, see:


Encryption of data in Data Lake Store

https://aka.ms/Unxahe

Implementing network isolation


Like other data services provided by Azure, Data
Lake comes with a simple software-based firewall
that locks down HTTPS access to specified IP
addresses. You use this mechanism to prevent
access from unexpected locations, even if users at
these locations appear to have valid credentials.

If you use this method of security, it’s important


that clients have an IP address from a static range
of values—otherwise they might be periodically
unable to connect. Additionally, if you enable the
firewall for a Data Lake Store, you should also
enable IP access to Azure services and
applications (this is a separate setting). If you fail to do this, services such as Data Lake Analytics and Azure
Streaming Analytics will be blocked by the firewall, even if they are running using the same Azure
subscription as the owner of the Data Lake Store.

Note that the firewall is disabled by default, but you enable it and set client IP address ranges using the
Azure portal. You can also perform these tasks programmatically. The following example uses PowerShell:

Configuring the Data Lake Store firewall


# Install the latest version of the AzureRM.DataLakeStore module (this includes the
firewall cmdlets)
Install-Module -Name AzureRM.DataLakeStore -Force

# Log in to Azure
Login-AzureRmAccount

# Specify a subscription
Set-AzureRmContext -SubscriptionId <subscription ID>

# Enable the firewall for the specified Data Lake Store account
$accountName = "<account name>"
Set-AzureRmDataLakeStoreAccount -Name $accountName -FirewallState Enabled

# Enable Azure services and applications to connect (including Data Lake Analytics)
Set-AzureRmDataLakeStoreAccount -Name $accountName -AllowAzureIpState Enabled

# Create a firewall rule


$firewallRuleName = "MyRule"
$startIpAddress = "127.0.0.1"
$endIpAddress = "127.0.0.2"
Add-AzureRmDataLakeStoreFirewallRule -AccountName $accountName -Name $firewallRuleName -
StartIpAddress $startIpAddress -EndIpAddress $endIpAddress
MCT USE ONLY. STUDENT USE PROHIBITED
4-12 Managing Big Data in Azure Data Lake Store

Authenticating access to a Data Lake Store


Data Lake Store utilizes Azure Active Directory
(Azure AD) to authenticate users who run custom
applications and services. Your code operates
with Azure AD using two mechanisms:

 End-user authentication

 Service-to-service authentication

End-user authentication requires that you


provide an Azure AD native application that
authenticates users against your own directory by
generating an OAuth 2.0 access token from the
user’s credentials. You can attach this token to
requests made to the Data Lake Store by your custom application, and (if it is valid), the user’s request will
be processed. The Azure AD native application might use the OAuth 2.0 pop-up to prompt users for their
credentials, or the custom application can provide its own mechanism to gather this information and pass
users directly to the Azure AD native application for authentication. For more information about creating
and using an Azure AD native application to authenticate users, see:

Use the portal to create an Azure AD application and service principal that accesses resources
https://aka.ms/Kk4w9t

Service-to-service authentication occurs when a custom web application, which operates on behalf of
users, needs to access a resource, such as Data Lake Store. The web application runs using its own identity
rather than impersonating a user, and must therefore be authenticated before it retrieves or modifies data
in the store. This approach requires you to create an Azure AD web application to perform the
authentication and issue OAuth 2.0 access tokens. Your custom web application passes identity
information to the Azure AD web application, which validates the credentials and returns access tokens
that your web application uses to access the Data Lake Store. For more information, see:

Service-to-service authentication with Data Lake Store using Azure AD


https://aka.ms/Av3ahf

Authorizing users
User authorization operates at two levels in Data
Lake Store. You use RBAC to specify or limit the
operations that a user performs across the entire
store; you apply Access Control Lists (ACLs) to
control the operations that users perform over
specific files and folders in the store.

The simplest way to assign RBAC roles to users is


through the Access Control blade for the Data
Lake Store in the Azure portal. This blade
provides a GUI that displays a number of
predefined roles (including Owner, Contributor,
and Reader) each of which have an associated set
of privileges. For example, the Owner role has complete control over the Data Lake Store, the Contributor
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-13

role creates folders and uploads files to the store, and the Reader role accesses files held in the store but
does not modify them. You assign users from your Azure AD directory to one or more of these roles to
give them the corresponding access rights. Note that RBAC is not specific to Data Lake Store, and can be
applied across many Azure services. For more information about RBAC, visit:

Get started with Role-Based Access Control in the Azure portal


https://aka.ms/Q0bvlb

ACLs give you a finer-grained degree of control over individual files and folders. They define three
privileges: Read (R), Write (W), and Execute (X). You assign these privileges to different sets of users:
owners, named users and groups, and everyone else. An owner is the user who creates a file or folder.
Named users and groups refer to identities held in your Azure AD directory (this includes system-defined
groups and identities, such as “Azure Key Vault” and “Azure Machine Learning”). You assign ACLs to files
and folders using the Access tool in Data Explorer in the Azure portal. You also assign ACLs
programmatically when you upload files and create folders. For more information, see:

Assign users or a security group as ACLs to the Data Lake Store file system
https://aka.ms/Yqqghf

Note: Users who have the Owner RBAC role for the Data Lake Account are known as
“superusers” for ACL purposes. This means that they are not subject to any of the restrictions
imposed by ACLs, and always have full control over all files and folders, regardless of whether
they have been assigned Read, Write, or Execute privileges.

In the context of ACLs, the terms Read, Write, and Execute are a result of history (they have been inherited
from POSIX), and don’t necessarily mean exactly what their names imply. Read and Write permissions over
a file enable the specified user or group to read or write (append) the contents of that file. Execute
permission has no meaning over files in a Data Lake Store and is ignored. Read and Write permissions
over a folder work in conjunction with the Execute privilege and enable a user to read the contents of a
folder (this requires Read and Execute permissions), or to write to a folder (this requires Write and Execute
privileges). Execute permission by itself gives the user the ability to access a file in that folder and traverse
through the folder into subfolders—but not to actually list the contents of the folder. Only the owner of a
file or folder, or a superuser, can set the permissions for that file or folder.

For example, to read the file mydata.txt, located in the folder /folder1/folder2, you must have:

 Execute access on the root folder (/)

 Execute access on the folder1 folder

 Execute access on the folder2 folder

 Read access on the mydata.txt file

To modify the mydata.txt file, you must have:

 Execute access on the root folder (/)


 Execute access on the folder1 folder

 Execute access on the folder2 folder

 Write access on the mydata.txt file


MCT USE ONLY. STUDENT USE PROHIBITED
4-14 Managing Big Data in Azure Data Lake Store

To delete the mydata.txt file, you must have:

 Execute access on the root folder (/)

 Execute access on the folder1 folder

 Execute and Write access on the folder2 folder

Note that you don’t require any permissions on the mydata.txt file itself.

To list all the files in the folder2 folder, you must have:

 Execute access on the root folder (/)

 Execute access on the folder1 folder

 Execute and Read access on the folder2 folder

For a more detailed discussion on ACLs with Data Lake Store, see:

Access control in Data Lake Store


https://aka.ms/Uzmh9w

Auditing and diagnostics


In production systems, it is typically compulsory
to have an audit trail for regulatory and
compliance measures. In addition, diagnostic
logging is also useful for troubleshooting access
issues in development environments.
Audit and diagnostic logging (also known as
request logging) is disabled by default, but you
enable it through the Diagnostic Log blade for a
Data Lake Store in the Azure portal.

Note: Request logging stores information


about every request made to the Data Lake Store. Audit logging captures more detailed
information; a single request could perform multiple operations, and each of these operations are
recorded in the audit log. The audit log also includes the identity of the caller who performs the
operation.

Like logging in other Azure services, Data Lake Store logging enables you to capture auditing and request
log information to three different sinks:

 Storage account. This is useful for the batch processing of historic logs. A Blob storage account
needs to be either created or defined. When logging has been enabled, you download the log data
from the Diagnostics Log blade for the Data Lake Store in the Azure portal, or retrieve it directly from
the Blob storage account.

 Event hub. This is useful in cases where you need to alert specific events in real time. For example,
you might run a Streaming Analytics job that filters and processes this data, and perhaps incorporate
Machine Learning to spot unusual access patterns that signify an attempt to break in to the system
and steal data.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-15

 Log analytics. This destination is useful for operations teams who use Azure Operations Manager and
its dashboard capabilities.
For more information, see:

Accessing diagnostic logs for Data Lake Store

https://aka.ms/Pxudma

Demonstration: Using Data Lake Store in an application


In this demonstration, you will see how to:

 Add a guest user to Azure AD.

 Use an app to test user access to Data Lake Store.


 Set and test permissions to Data Lake Store resources.

Check Your Knowledge


Question

Which of the following statements give the


correct differences between a Windows
filesystem and that of a Data Lake Store?

Select the correct answer.

The Data Lake Store uses the POSIX model


of access control.

Files inherit the permissions of the folder in


which they are created.

Data Lake Store folder permissions are fully


compatible with NTFS.

All directories in a path must have explicit


permissions assigned.
MCT USE ONLY. STUDENT USE PROHIBITED
4-16 Managing Big Data in Azure Data Lake Store

Lab: Managing big data in Data Lake Store


Scenario
You work for Adatum as a data engineer, and have been asked to build a traffic surveillance system for
traffic police. This system must be able to analyze significant amounts of dynamically streamed data—
captured from speed cameras and automatic number plate recognition (ANPR) devices—and then
crosscheck the outputs against large volumes of reference data holding vehicle, driver, and location
information. Fixed roadside cameras, hand-held cameras (held by traffic police), and mobile cameras (in
police patrol cars) are used to monitor traffic speeds and raise an alert if a vehicle is travelling too quickly
for the local speed limit. The cameras also have built-in ANPR software that reads vehicle registration
plates.

For the next phase of the project, you will use a range of tools to enable batch mode, and automated
operations, with Data Lake Store. You will also add security to your Data Lake Store, using custom ACLs
and data encryption that uses your own managed key.

Objectives
After completing this lab, you will be able to:

 Perform Data Lake Store operations using PowerShell.


 Upload bulk data to Data Lake Store.

 Set ACLs on Data Lake Store folders and files.

 Encrypt Data Lake Store data using your own key.

Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.

Estimated Time: 90 minutes

Virtual machine: 20776A-LON-DEV

Username: ADATUM\AdatumAdmin

Password: Pa55w.rd

Exercise 1: Perform Data Lake Store operations using PowerShell


Scenario
To enable automated and batch mode operations with Data Lake Store, you will first investigate the use of
PowerShell cmdlets for Data Lake Store. In this exercise, you will use PowerShell to deploy a new Data
Lake Store, and to manage folders and files in this store.

The main tasks for this exercise are as follows:


1. Install AzCopy

2. Install AdlCopy

3. Prepare the PowerShell environment

4. Use PowerShell to create a new Data Lake Store

5. Use PowerShell to manage files and folders in Data Lake Store


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-17

 Task 1: Install AzCopy

 Task 2: Install AdlCopy

 Task 3: Prepare the PowerShell environment

 Task 4: Use PowerShell to create a new Data Lake Store

 Task 5: Use PowerShell to manage files and folders in Data Lake Store

Results: At the end of this exercise, you will have used PowerShell cmdlets to:

Create a new Data Lake Store (in a new resource group).

Create a folder in this store.


Upload a file to the folder.

Display to files in the folder.

Exercise 2: Upload bulk data to Data Lake Store


Scenario
To further enhance automated and batch mode operations with Data Lake Store, you will next investigate
the command-line tools that enable bulk data upload to Azure Blob storage, and bulk transfers of data
from Azure Blob storage to Data Lake Store. In this exercise, you will use AzCopy to upload data to Blob
storage, and then use AdlCopy to transfer this data to a Data Lake Store.

The main tasks for this exercise are as follows:

1. Create a new Storage account and Blob container


2. Upload files to the new Blob container

3. Verify the file upload using Cloud Explorer

4. Use AdlCopy to copy the files from Blob storage to Data Lake Store

5. Verify the data copy using Data Explorer

 Task 1: Create a new Storage account and Blob container

 Task 2: Upload files to the new Blob container

 Task 3: Verify the file upload using Cloud Explorer

 Task 4: Use AdlCopy to copy the files from Blob storage to Data Lake Store

 Task 5: Verify the data copy using Data Explorer


MCT USE ONLY. STUDENT USE PROHIBITED
4-18 Managing Big Data in Azure Data Lake Store

Results: At the end of this exercise, you will have:

Created a new Azure storage account and Blob container.

Uploaded a set of files and folders to the container using AzCopy.

Verified the upload using Cloud Explorer in Visual Studio.


Copied the data from Blob storage to Data Lake Store using AdlCopy.

Verified the copy using Data Explorer in the Azure portal.

Exercise 3: Set ACLs on Data Lake Store folders and files


Scenario
Adatum’s new traffic surveillance system will use Data Lake Store for primary data storage; therefore, it’s
essential that this data is protected. In this exercise, you will add security to your Data Lake Store, using
custom ACLs. To test and evaluate the security model in Data Lake Store, you will use a guest user
account.

The main tasks for this exercise are as follows:

1. Create a Microsoft account for a Data Lake guest user

2. Add the guest user account to Azure Active Directory

3. Set guest user permissions for a Data Lake Store folder

4. Start a PowerShell session as the guest user


5. Use PowerShell to attempt to list folder contents

6. Set permissions for the root folder

7. Use PowerShell to list the folder contents after a permissions change

8. Use PowerShell to attempt to read a file

9. Set permissions for a file

10. Use PowerShell to read a file after a permissions change


11. Use PowerShell to attempt to upload a new file

12. Edit folder permissions to enable file upload

13. Use PowerShell to overwrite a file

14. Use PowerShell to attempt to overwrite a file after a folder permissions change
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-19

 Task 1: Create a Microsoft account for a Data Lake guest user

 Task 2: Add the guest user account to Azure Active Directory

 Task 3: Set guest user permissions for a Data Lake Store folder

 Task 4: Start a PowerShell session as the guest user

 Task 5: Use PowerShell to attempt to list folder contents

 Task 6: Set permissions for the root folder

 Task 7: Use PowerShell to list the folder contents after a permissions change

 Task 8: Use PowerShell to attempt to read a file

 Task 9: Set permissions for a file

 Task 10: Use PowerShell to read a file after a permissions change

 Task 11: Use PowerShell to attempt to upload a new file

 Task 12: Edit folder permissions to enable file upload

 Task 13: Use PowerShell to overwrite a file

 Task 14: Use PowerShell to attempt to overwrite a file after a folder permissions
change

Results: At the end of this exercise, you will have created a guest user account in Azure AD, and then
tested the ability of this account to view folder contents, open files, and upload and update files in Data
Lake Store, depending on the specific permissions that are set. You will use the Azure portal to manage
the permissions, and use PowerShell as the guest user environment.

Exercise 4: Encrypt Data Lake Store data using your own key
Scenario
As you have already seen, Adatum are building a traffic surveillance system that will use Data Lake Store
for primary data storage. Therefore, you will also configure additional security for your Data Lake Store by
setting up data encryption using your own managed key. In this exercise, you will create a new Key Vault
and key, use this key to protect a new Data Lake Store, and then investigate the effects of key deletion
and key restoration on the ability to access data.

The main tasks for this exercise are as follows:

1. Create a new Key Vault account and key

2. Configure a new Data Lake storage account to use Key Vault encryption

3. Use Data Explorer to upload a file and verify its contents

4. Back up the Key Vault key, and then delete the key

5. Attempt to access encrypted data without a key


MCT USE ONLY. STUDENT USE PROHIBITED
4-20 Managing Big Data in Azure Data Lake Store

6. Attempt to upload data to encrypted storage without a key

7. Restore a deleted key and verify data access

8. Lab closedown

 Task 1: Create a new Key Vault account and key

 Task 2: Configure a new Data Lake storage account to use Key Vault encryption

 Task 3: Use Data Explorer to upload a file and verify its contents

 Task 4: Back up the Key Vault key, and then delete the key

 Task 5: Attempt to access encrypted data without a key

 Task 6: Attempt to upload data to encrypted storage without a key

 Task 7: Restore a deleted key and verify data access

 Task 8: Lab closedown

Results: At the end of this exercise, you will have created a new Key Vault and key, used this key to
encrypt a new Data Lake Store, uploaded data to the store, created a key backup and then deleted the
key. You will also have attempted to access and upload data after key deletion, and restored the deleted
key and verified data access.

Question: Why might you set a default permission entry on a folder in Data Lake Store?

Question: Is encryption using Key Vault the only way to encrypt data at rest properly in Data
Lake Store?
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-21

Module Review and Takeaways


In this module, you learned about:

 Using Data Lake Store for managing massive datasets.

 Protecting and monitoring data that resides in a Data Lake Store.


MCT USE ONLY. STUDENT USE PROHIBITED
 
MCT USE ONLY. STUDENT USE PROHIBITED
5-1

Module 5
Processing big data using Azure Data Lake Analytics
Contents:
Module Overview 5-1 
Lesson 1: Introduction to Azure Data Lake Analytics 5-2 

Lesson 2: Analyzing data with U-SQL 5-7 

Lesson 3: Sorting, grouping, and joining data 5-23 


Lab: Processing big data using Data Lake Analytics 5-36 

Module Review and Takeaways 5-40 

Module Overview
Microsoft® Azure® Data Lake Analytics (ADLA) provides a framework and set of tools that you use to
analyze data held in Microsoft Azure Data Lake Store, and other repositories. This module describes in
detail how ADLA works, and how you use it to create and run analytics jobs.

Objectives
At the end of this module, you will be able to:
 Describe the purpose of Azure Data Lake Analytics, and how you create and run jobs.

 Explain how to use U-SQL to process and analyze data.

 Describe how to use windowing to sort data and perform aggregated operations, and how to join
data from multiple sources.
MCT USE ONLY. STUDENT USE PROHIBITED
5-2 Processing big data using Azure Data Lake Analytics

Lesson 1
Introduction to Azure Data Lake Analytics
Azure Data Lake Analytics (ADLA) is a platform-as-a-service (PaaS) offering that enables advanced
analytics and batch processing at scale on big data. At first glance, it might appear that ADLA and Azure
Stream Analytics fulfil similar roles. However, they are designed and optimized for different scenarios;
Stream Analytics is intended for processing large-scale streaming data sources, whereas ADLA is designed
to analyze big data at rest, ideally held in a Data Lake Store (although you can also retrieve data from
other sources).

Lesson Objectives
In this lesson, you will learn:

 The purpose of ADLA.

 How ADLA works.


 How to create and run ADLA jobs.

 How to use Visual Studio® tools to package and submit ADLA jobs.

What is Data Lake Analytics?


If the Data Lake Store is used to collect raw
material, or data, then Data Lake Analytics
provides a toolset with which you begin to
construct solutions. You can think of the data
analysis pipeline as comprising five main phases:

Source -> Ingest -> Processing ->


Storage -> Delivery

 “Source” represents the original data, in its


native format.

 “Ingest” is the process by which this data is


consumed by the pipeline.

 “Processing” defines the initial analytics and other transformations performed on the ingested data.

 “Storage” describes the way in which the transformed data is held, optimized in structures to support
the operations and queries required by the business.
 “Delivery” covers the way in which the data is used; it could be subjected to further detailed analysis,
joined with other datasets to provide additional insights, summarized to generate an overall view of
the system, and presented graphically, possibly with drill-down capabilities to support detailed
investigation. Some of these tasks might be implemented by performing an additional “Processing”
iteration.

ADLA is concerned with the processing phase, taking data that has been ingested into large-scale storage
(such as Data Lake Store), transforming it, and then processing it for analytics and data storage purposes.
Within the data processing phase, there are the concepts of “hot” and “cold” data streams. While Stream
Analytics is the best choice for real-time, inbound data analysis (hot data), ADLA is optimized for jobs that
take minutes, hours, or even days against cold data.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-3

ADLA behavior is not dissimilar to that of a MapReduce programming model that you would find in
traditional Hadoop environments. In this model, low-level algorithms typically authored in Java create a
framework by which input data is separated into “chunks” that are processed independently and in
parallel—this reduces the overall execution time of a given job or jobs. ADLA tracks these process
mappings and can “stitch” them back together for the final output. This can also be thought of as its own
pipeline and represents a typical ADLA job at a high level. However, unlike many of the other options for
batch processing—for example, SQL Data Warehouse and Hadoop—ADLA is a true PaaS solution because
the interface is abstracted from the underlying distributed architecture. ADLA is focused on running jobs,
writing scripts, and managing processing tasks. Microsoft provides the infrastructure for building and
executing jobs, and you are insulated from the low-level details of ensuring that sufficient resources are
available to perform your tasks.

How Data Lake Analytics works


You define a job using a language called U-SQL—
this is a declarative language that extends the
transactional SQL (T-SQL) dialect used by
Microsoft SQL Server® to process relational data.

ADLA takes the U-SQL description of a job, parses


it to make sure it is syntactically correct, and then
compiles it into an internal representation. ADLA
breaks down this internal representation into
stages of execution. Each stage performs a task,
such as extracting the data from a specified
source, dividing the data into partitions,
processing the data in each partition, aggregating
the results in a partition, and then combining the results from across all partitions. Partitioning is used to
improve parallelization, and the processing for different partitions is performed concurrently on different
processing nodes. The data for each partition is determined by the U-SQL compiler, according to the way
in which the job retrieves and processes the data.

The tasks that run in parallel in each stage are referred to as “vertices” by ADLA. Each vertex is managed
and scheduled by using a distributed component called Yet Another Resource Negotiator (YARN). This is
the same component that is used by many Hadoop installations. YARN takes the responsibility for
handing tasks to the available computing resources, and retrieving the results from those resources. This
enables the system to decouple the resource management responsibilities of the platform from the
processing components; Microsoft might modify or replace the management component at any time
without affecting any existing U-SQL jobs.

Computing resources are allocated to jobs in Analytics Units (AUs). Each AU represents the processing
power of two CPU cores and 6 GB of RAM. The more AUs you have allocated to a job, the greater the
potential degree of parallelization, and the faster the job will run. However, each AU has an associated
cost, so increasing the number of AUs will have a financial impact.

Although ADLA splits jobs into vertices for the purposes of parallelism, individual vertices might take
longer to complete than one another. Therefore, if one aspect of the job—an extremely complex
calculation, for example—inherently takes 10 minutes to complete, then adding more AUs might not
decrease the time to completion. In this instance, you might need to take a closer look at algorithm
optimization. Adding additional AUs will enable ADLA to break up the totality of a job, but only if the
characteristics of the job allow for this.
MCT USE ONLY. STUDENT USE PROHIBITED
5-4 Processing big data using Azure Data Lake Analytics

For more information about how AUs are applied to vertices, see:

Understanding the ADL Analytics Unit


https://aka.ms/F3ee25

Module 6: Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics
describes the tools that are available for examining the way in which a job has been broken down into
vertices.

Creating and running jobs


An ADLA job runs in the context of an ADLA
account. The ADLA account defines the security
and scalability settings for jobs.

The simplest way to create an ADLA account is to


use the Azure portal; you specify a name for the
account, your Azure subscription, resource group,
and location. You must also provide the details of
the Data Lake Store account. ADLA will use this
account to act as the primary data source for
ADLA jobs. You will add additional Data Lake Store
accounts later, if you need to process and combine
data held in different stores.

Note: Important: The Data Lake Store account must be located in the same region as the
ADLA account. This is to minimize the time and costs required to access data.

Additionally, you specify the pricing tier for the account. The pricing tier determines how many AUs will
be available to run jobs.

You then create an ADLA job as a U-SQL script. The following example shows a simple U-SQL script that
reads stock market price data (tickers and prices) from a CSV file, calculates the maximum price for each
ticker, and saves the results to another CSV file.

U-SQL job that calculates the maximum price for a stock market ticker
@priceData =
EXTRACT Ticker string,
Price int,
HourOfDay int
FROM "/StockMarket/StockPrices.csv"
USING Extractors.Csv(skipFirstNRows: 1);

@maxPrices =
SELECT Ticker, MAX(Price) AS MaxPrice
FROM @priceData
GROUP BY Ticker;

OUTPUT @maxPrices
TO "/output/MaxPrices.csv"
USING Outputters.Csv(outputHeader: true);

You will learn more about U-SQL in Lesson 2.


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-5

After you have defined the work for the job, you submit it for processing. Again, the simplest way to
achieve this is to use the Azure portal. At this point, the U-SQL code is compiled and broken down into
stages and vertices. The vertices are then queued ready for submission by YARN to the ADLA processors.
You give the job a priority to determine which of your jobs should run ahead of others that you have
submitted. When a job reaches the front of the queue, the vertices are run. The number of ADLA
processors available to execute the vertices is governed by the number of AUs you assign to the job (you
specify this number when you submit the job—up to the value permitted by the pricing plan for the ADLA
account). The higher the number of AUs, the greater the parallelism and the faster the job runs. When the
vertices have completed their work, the results are combined and aggregated into the final result. The job
pane in the Azure portal enables you to track these tasks, view how the job is being processed, and
examine the results.

Tools for packaging and running jobs


Apart from using the Azure portal, you can also
create and submit ADLA jobs by using PowerShell.

To run a U-SQL job from PowerShell, use the


Submit-AdlJob cmdlet. This cmdlet expects you to
specify the U-SQL either as an inline script (using
the –Script parameter), or as a file containing a U-
SQL script (using the –ScriptPath parameters). You
track the progress of a job using the Get-AdlJob
cmdlet, and you wait for a job to complete using
the Wait-AdlJob cmdlet. When a job has
completed, you retrieve the results from Data Lake
Store using the Export-AdlStoreItem cmdlet.

The following PowerShell script shows how to run the U-SQL job shown in the previous topic. The
StockPriceJob.usql file referenced by this example contains the U-SQL code:

PowerShell script that runs a U-SQL job


# Log in to Azure
Login-AzureRmAccount

# Specify a subscription
Set-AzureRmContext -SubscriptionId <Subscription ID>

$adla = "<ADLA Account>"


$rg = "<Resource Group>"

# Verify that the ADLA account exists


Get-AdlAnalyticsAccount -ResourceGroupName $rg -Name $adla

# Submit an ADLA job


$scriptpath = "D:\Demofiles\Mod05\Demo 1\StockPriceJob.usql"
$job = Submit-AdlJob -AccountName $adla –ScriptPath $scriptpath -Name "Demo"

# Repeatedly view the status of the job


$job = Get-AdlJob -AccountName $adla -JobId $job.JobId

# Wait for the job to complete


Wait-AdlJob -Account $adla -JobId $job.JobId

# Download the results


$adls = "<ADLS Account for ADLA>"
MCT USE ONLY. STUDENT USE PROHIBITED
5-6 Processing big data using Azure Data Lake Analytics

Export-AdlStoreItem -AccountName $adls -Path "/output/MaxPrices.csv" -Destination


"D:\Demofiles\Mod05\Demo 1\MaxPrices.csv"

Visual Studio provides another environment for building and running ADLA jobs, if you have the Data
Lake and Stream Analytics Tools for Visual Studio installed. These tools are available as part of Visual
Studio 2017—you can also download these tools for older versions of Visual Studio from the following
page:
Plug-in for Data Lake and Stream Analytics development using Visual Studio
https://aka.ms/Wds9j1

The Azure Data Lake Tools for Visual Studio provides a number of templates for building Data Lake
applications, including a U-SQL project template. Use this template to incorporate and debug user-
defined functions and other code items into your jobs more easily than using the Azure portal. You
submit jobs directly from this template, single-step through your code, and track the progress of a job as
it runs. You can also run and debug jobs locally on your own computer rather than in the cloud. This
process is described in Module 6.

Demonstration: Creating and running a Data Lake Analytics job


In this demonstration, you will see how to:
 Create and run a job using the Azure portal.

 Run a job using PowerShell.

 Run a job using the Data Lake Tools for Visual Studio.
Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? Data Lake Analytics is ideally


suited to processing fast-moving streams
of data.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-7

Lesson 2
Analyzing data with U-SQL
U-SQL is the language that you use to describe an ADLA job. It is a nonprocedural language, and you
write U-SQL code that specifies the results that you want to see rather than the process to be performed
to obtain these results—that is the purpose of the ADLA compiler and the YARN scheduler. This lesson
describes how to use U-SQL to implement a job.

Lesson Objectives
After completing this lesson, you will be able to:

 Describe the purpose of U-SQL and explain how it works.

 Explain the structure of a U-SQL job.

 Declare and use variables in a U-SQL job.

 Use complex types to hold structured data in a U-SQL job.


 Use U-SQL to handle input from multiple sources.

 Partition the output data from a U-SQL job.

 Store and query data in a Data Lake Store SQL database.


 Describe the C# functions and operators that you use in a U-SQL job.

What is U-SQL?
U-SQL is the language that you use to implement
ADLA jobs. It represents a hybrid language that
takes features from both SQL and C#, and
provides declarative and procedural capabilities. If
you are from a development background, U-SQL
will present many concepts, constructs, and a
syntax with which you will be familiar. If you are
from an engineering or DBA background, you will
also benefit from U-SQL’s origins, which are based
on SQL and T-SQL. U-SQL abstracts the parallelism
and the distributed architecture from the scripts
that you create, making it simpler to write scripts
that perform complex tasks.

The intention of U-SQL is to provide a simple way to describe complex processing using SQL-like syntax,
combined with the ability to customize the way in which this processing works and transform data by
using procedural code. You extend the capabilities of U-SQL by implementing your own user-defined
functions, operators, and aggregators. This gives you the ability to implement analytical processes that
would be difficult to achieve using SQL alone. Module 6: Implementing Custom Operations and
Monitoring Performance in Azure Data Lake Analytics describes how to implement custom functionality in
more detail.

The hybrid nature of U-SQL demands careful attention when defining your jobs. U-SQL has the case
sensitivity of C#, even for keywords—for example, you must be careful to use “SELECT” and “FROM” rather
than “Select” and “From”. Additionally, U-SQL uses C# comparison operators and expressions. This means
that you must use operators such as “==” to test for equality, and not “=”.
MCT USE ONLY. STUDENT USE PROHIBITED
5-8 Processing big data using Azure Data Lake Analytics

The key abstraction used by U-SQL is the rowset. A rowset is a tabular structure containing unordered
rows of data. Each row in a rowset has the same schema that defines a set of columns, and each row can
be up to 4 MB in size. However, the number of rows in a rowset is potentially unlimited, and it is the task
of ADLA to determine how to divide up a rowset into manageable subsets for processing, and then
combine the results.
The columns in a rowset might be simple types, such as int, float, double, string, char, or DateTime, or
they can be complex structures such as maps and arrays (a map is a collection that provides key/value
lookup functionality).

Note: The simple data types available correspond to those used by C#, and include the
nullable types (types that hold null values).

For more information about the data types available, see:

Built-in U-SQL types


https://aka.ms/X5w93k

Structure of a U-SQL job


A minimal U-SQL job contains three primary
elements:

 The EXTRACT statement that identifies the


structure of the input data and where you
retrieve it.

 The processing statements that specify the


transformations and other analytical
operations to be performed over the input
data.

 The OUTPUT statement that describes what to


do with the transformed data.

Note: All statements in U-SQL code must finish with a semicolon (;) terminator.

The EXTRACT statement has the following format:

EXTRACT <schema>

FROM <source>
USING <extractor>

 The <schema> part lists the individual fields in the input data and the types of data that they contain.
The input data will typically comprise schema-less text information (in the form of CSV, TSV, or
possibly JSON text files). The <schema> section is intended to try to map the individual elements in
the data into identifiable items in each row in a rowset, and then convert the input data into types
that can be processed. The data types are specified as C# types, as described earlier. The following
code fragment shows an example, describing the fields in a company’s personnel data:
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-9

EXTRACT EmployeeID int,


ForeName string,
LastName string,
Department int

For detailed information about the type conversions performed by the EXTRACT section, see:
U-SQL built-in extractors
https://aka.ms/Su2pbo

 The <source> part lists the input data files. There might be one or more of these files. In its simplest
form, the FROM clause names a file in the default Data Lake Store associated with the ADLA account.
For example:
FROM “/PersonnelData/PersonnelFile.csv”
However, you can also specify a file in another Data Lake Store account or Blob storage (you must
first register the Data Lake Store account or Blob storage account with the ADLA account, to provide
the necessary security settings and keys). Here are some examples:
// myadlsaccount is a separate ADLS account
FROM “adl://myadlsaccount.azuredatalakestore.net/Data/MoreData.csv”
// Blob storage
FROM “wasb://mycontainer@mystorageaccount/Data.csv”
 The USING clause specifies an extractor you use to read the data from the input file. ADLA provides
three built-in extractors: Extractors.Csv (which handles CSV files), Extractors.Tsv (for TSV files), and
Extractors.Text (for reading more generalized text files). If you need to retrieve data held in a different
format (such as JSON), you implement your own custom extractor. Module 6 covers this topic in more
detail.

The built-in extractors take parameters that modify the way in which they work. For example, you
switch the row delimiters used when an extractor parses a file—change the encoding in use—and
specify that certain rows (such as headers) should be skipped. For more details, see:
Extractor parameters (U-SQL)
https://aka.ms/K1pza1

The value returned by the EXTRACT statement is a reference to the rowset generated from the data.
You then use this reference to specify the processing to perform. The processing typically takes the
form of a series of SQL SELECT statements that incorporate user-defined functions, operators, and
aggregators. You might also use common SQL clauses, such as WHERE, ORDER BY, and GROUP BY
operators. There is also an IF … ELSE construct available. As before, you assign the rowset generated
as the result of the processing to a reference variable, which you then pass in to further SQL
statements. For example, the following code fragment finds the number of employees in each
department, and then refines this rowset to find all departments with more than 100 employees:
@personnelData =
EXTRACT EmployeeID int,
ForeName string,
LastName string,
Department int
FROM "/PersonnelData/Personnel.csv"
USING Extractors.Csv(skipFirstNRows: 1);
@numInEachDepartment = SELECT COUNT(*) AS Num, Department
FROM @personnelData
GROUP BY Department;
MCT USE ONLY. STUDENT USE PROHIBITED
5-10 Processing big data using Azure Data Lake Analytics

@bigDepartments = SELECT *
FROM @numInEachDepartment
WHERE Num > 100;

For more details, see:

SELECT expression (U-SQL)


https://aka.ms/Iymw28

The OUTPUT statement describes where to send the results of the processing, and in what format. You
reference an outputter to save the data. You specify the location of the data with the TO clause—as with
the EXTRACT statement, this might specify a file in the default Data Lake Store, a separate Data Lake
Store, or Blob storage. ADLA provides built-in outputters for CSV, TSV, and generalized text data but, if
you require a different format, you should create your own custom outputters. Module 6 describes this
topic in more detail. Like extractors, the built-in outputters take parameters that modify the format of the
data, such as adding column headers. For further information, see:

Outputter parameters (U-SQL)


https://aka.ms/M165gh

The following example shows an outputter that sends the data to a CSV file in the default Data Lake Store.
Note that the folder mentioned does not have to already exist—the outputter will create it if necessary:

OUTPUT @bigDepartments
TO "/Departments/BigDepts.csv"

USING Outputters.Csv(outputHeader: true);

Note: Important: An outputter will overwrite a file that already exists, so be careful that
you don’t lose any important results. Additionally, outputters are atomic; they either write the
entire results rowset (on success) or they don’t write anything (if an error occurs during
processing). You should never get partial results.

Declaring and using scalar variables


A scalar variable holds a single value. You use
scalar variables in your U-SQL jobs to localize
values and make your scripts easier to maintain.

You use the DECLARE statement to create a


variable and initialize it with a value. The variable
declaration includes the type of data that the
variable holds. You then reference the variable in
your U-SQL code by prefixing the variable name
with the @ sign. You reference a variable
anywhere you use a constant expression of the
same type in EXTRACT, SELECT, and OUTPUT
statements.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-11

The following example uses a scalar variable to avoid repeating the same location for Data Lake Store
storage in a job:

Declaring and using a variable in an ADLA job


DECLARE @myAdlAccount string = @"adl://myadlaccount.azuredatalakestore.net/”;

@data =
EXTRACT EmployeeID int,
ForeName string,
LastName string,
Department int
FROM @myAdlAccountString + “Personnel.csv”
USING Extractors.Csv();

@results = SELECT …;

OUTPUT @results
TO @myAdlAccounString + “Results.csv”
USING Outputters.csv();

Note: Unlike T-SQL, you must create and initialize a variable in the same statement; you
can’t declare a variable and use the T-SQL SET statement to assign it a value later. Additionally,
you can’t declare two variables with the same name.

For more information about declaring variables, see:

DECLARE variables (U-SQL)


https://aka.ms/I1q3yx

Using complex types


In addition to scalar variables, you can also create
variables that hold multiple values: rowsets, arrays,
and maps.

The result of an EXTRACT statement is a rowset


variable. You define your own rowset variables
with hard-coded data using the VALUES table
constructor. This is useful if you need to test a
piece of functionality quickly over a small,
controlled subset of data. The following code
shows an example. Note that you do not use the
DECLARE keyword to create a rowset variable:
MCT USE ONLY. STUDENT USE PROHIBITED
5-12 Processing big data using Azure Data Lake Analytics

Creating a rowset variable with hard-coded values


// TestData is a rowset variable
@testData =
SELECT * FROM
( VALUES
(99, "jhgjhgjh", "iuoioui", 1),
(100, "oiuoiu", "poipoi", 1),
(101, "slkjlkjlkj", "nbmnbmn", 2),
(102, "xzcvxvn", "bemnyu", 3)
) AS Employees(EmployeeID, ForeName, LastName, Department);

// Use the rowset variable like any other rowset


@query =
SELECT Department, COUNT(*) AS NumInDept
FROM @testData
GROUP BY Department;

For more detailed information, see:

U-SQL SELECT selecting from the VALUES table value constructor


https://aka.ms/bnybn3

You create arrays and maps using the generic structured types SQL.ARRAY and SQL.MAP. The SQL.ARRAY
type holds a list of values (you specify the type), and you use subscript notation to store and retrieve data.
A SQL.MAP object holds a list of key/value pairs. When you add an item to the map, you assign it a
unique key. You look up items by providing the key.

Although you might define SQL.ARRAY and SQL.MAP variables with DECLARE statements, it’s more
common to create and populate them by using SELECT statements. The following code creates a
SQL.MAP object from the personnel data shown in the previous example. The map uses the employee ID
as the key, and the department number as the value. The second SELECT statement counts the number of
items in the map where the value is 1. This yields the number of employees who work in that department:

Creating and using a SQL.MAP object


@testData =
SELECT * FROM
( VALUES
(99, "jhgjhgjh", "iuoioui", 1),
(100, "oiuoiu", "poipoi", 1),
(101, "slkjlkjlkj", "nbmnbmn", 2),
(102, "xzcvxvn", "bemnyu", 3)
) AS Employees(EmployeeID, ForeName, LastName, Department);

@empMap =
SELECT new SQL.MAP<int, int?>{{EmployeeID, Department}} AS emp_dept
FROM @testData;

@query =
SELECT COUNT(*) AS Num
FROM @empMap
WHERE emp_dept.First().Value == 1;

For further information on using the SQL.ARRAY and SQL.MAP types, see:

Complex built-in U-SQL types


https://aka.ms/Lhc0wu
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-13

An array or map is a single column that contains a collection of values. This is different from rowset data
where each column contains a single value, and you use multiple rows if you need to store multiple
values. If you combine the data in an array or map with a rowset, you must first convert the array or map
data into a set of separate rows (one for each value in the array or map). You achieve this using the
EXPLODE function.
The EXPLODE function takes an array or map as an argument, and generates an exploded rowset where
each row contains a single value from the array or map. However, you can’t use the rowset generated in
this way as an ordinary rowset; you only work with it using the CROSS APPLY operator. The CROSS APPLY
operator takes each item in the rowset on its left side and combines it with the corresponding values in
the rowset on its right side. Each row in the rowset from the left side is repeated for each matching row
on the right side. If there are no corresponding values in the rowset on the right side, the row from the
left side is omitted from the results (there’s also an OUTER APPLY operator that includes rows with missing
rowsets and combines them with a null value instead).

In the following example, the test data in the PriceMovement rowset represents the prices for different
stock market items over time (for simplicity, the time is not included in the data). The prices are held as a
single string for each item. The @result variable generates a new rowset where the prices for each item
are converted into an array. The @exploded rowset uses the EXPLODE function to convert the array in
each row into a new rowset called Temp, with a single column named Price. The CROSS APPLY operator
then combines the data in the Temp rowset with the data in the @result variable. The final rowset
contains the ticker (from the @result rowset) repeated for each price from the Temp rowset:

Using the EXPLODE function with the CROSS APPLY operator


@stockPrices =
SELECT * FROM
( VALUES
("AAAA", "20, 21, 20, 19, 18, 19, 20, 22, 25, 22, 28, 27"),
("BBBB", "56, 58, 60, 65, 64, 63, 62"),
("CCCC", "77, 76, 74, 72, 68, 65, 67, 68"),
("DDDD", "45, 46, 45, 44, 43, 45, 44, 46, 47, 45"),
("EEEE", "1, 3, 6, 11, 15, 12, 15, 14, 15")
) AS PriceMovement(Ticker, Prices);

@result =
SELECT Ticker,
new SQL.ARRAY<string>(Prices.Split(',')) AS PricesArray
FROM @stockPrices;

@exploded =
SELECT Ticker, Price.Trim() AS Price
FROM @result
CROSS APPLY
EXPLODE(PricesArray) AS Temp(Price);

OUTPUT @exploded
TO "/Output/StockPrices.csv"
USING Outputters.Csv();

The results generated by this code look like this:

"AAAA","20"

"AAAA","21"

"AAAA","20"

"AAAA","19"

"AAAA","18"

"AAAA","19"
MCT USE ONLY. STUDENT USE PROHIBITED
5-14 Processing big data using Azure Data Lake Analytics

"AAAA","20"

"AAAA","22"

"AAAA","25"

"AAAA","22"

"AAAA","28"

"AAAA","27"

"BBBB","56"

"BBBB","58"

"BBBB","60"

"BBBB","65"

"BBBB","64"

"BBBB","63"

"BBBB","62"

"CCCC","77"
"CCCC","76"

"CCCC","74"

"CCCC","72"
"CCCC","68"

"CCCC","65"

"CCCC","67"
"CCCC","68"

"DDDD","45"

"DDDD","46"
"DDDD","45"

"DDDD","44"

"DDDD","43"
"DDDD","45"

"DDDD","44"

"DDDD","46"

"DDDD","47"

"DDDD","45"

"EEEE","1"
"EEEE","3"

"EEEE","6"

"EEEE","11"

"EEEE","15"
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-15

"EEEE","12"

"EEEE","15"

"EEEE","14"

"EEEE","15"

Handling multiple input files


You use the EXTRACT statement in a U-SQL job to
list more than one input file, either explicitly or by
using file-name pattern matching, as shown in the
following examples:

// Multiple files (explicit)

FROM
“/PersonnelData/PersonnelFile.
csv”,
“adl://myadlsaccount.azuredatalakest
ore.net/Data/MoreData.csv”
// Multiple files (using file-name
pattern matching – all CSV files in the specified directory, beginning with
a “P”)
FROM “/PersonnelData/P{*}.csv”

However, note that all files must have the same input format (such as CSV, TSV) and should contain data
that maps to the same set of fields. For more information, see:

Input files (U-SQL)


https://aka.ms/Kldyrb

If you have multiple input files of different formats, you must use multiple EXTRACT statements, and
process the files for each EXTRACT statement separately.

A useful feature of the way in which input filename matching operates concerns the ability to reference
virtual columns in the EXTRACT statement. For example, consider that Data Lake Store might hold a vast
number of files, and that one way to organize these files is to split them out into directories based on the
year, month, day, and hour in which the files were created (this approach is commonplace for systems that
generate files by streaming, such as Stream Analytics). If you need to process a set of files for a specific
period, you add a virtual column to the EXTRACT statement that identifies the period in question, and
then populate this column in the SELECT statement that processes the data. The EXTRACT statement
parses the value of this virtual column, and uses it to identify the directories and folders from which it
should fetch the data.

The following example shows an EXTRACT statement that fetches sales data held in CSV format from the
folder for a specific date. The CSV file only holds the productID, numSold, and pricePerItem fields; the
values for the date and filename virtual columns are determined by the WHERE clause of the SELECT
statement:
MCT USE ONLY. STUDENT USE PROHIBITED
5-16 Processing big data using Azure Data Lake Analytics

Using a virtual column to match filenames


@salesData =
EXTRACT productID string,
numSold int,
pricePerItem int,
date DateTime,
fileName string
FROM "/SalesData/{date:yyyy}/{date:MM}/{date:dd}/{fileName}"
USING Extractors.Csv();

@data =
SELECT productID, numSold, pricePerItem
FROM @salesData
WHERE date == DateTime.Today AND fileName LIKE "%.csv";

Outputting data to multiple files


You typically use the OUTPUT statement to write
data to a single output file. You use multiple
OUTPUT statements to send data to different files
in different formats if required.
You also use the OUTPUT statement to split the
data according to the way in which it has been
partitioned for processing. In this case, you use the
{*} wild card pattern in the output file name. This is
useful for lengthy jobs, where the processing on
some nodes takes longer than the processing on
others, or if the processing requires more nodes
than you have available through the number of
AUs allocated to the job. The wild card pattern will be replaced with the numeric ID of the node that
generated the data.

The following example generates a series of files containing the results of a U-SQL job. Each file contains
the data from a single processing node:

Sending the output to multiple files


@salesData =
EXTRACT productID string,
numSold int,
pricePerItem int,
year int,
month int,
day int,
fileName string
FROM "/SalesData/{year}/{month}/{day}/{fileName}"
USING Extractors.Csv();

@data =
SELECT …
FROM @salesData
WHERE year == DateTime.Today.Year
AND month == DateTime.Today.Month
AND day BETWEEN 20 AND 30
AND fileName LIKE "%.csv";

OUTPUT @data
TO "/SalesOutputs/{*}.csv"
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-17

USING Outputters.Csv();

Using the U-SQL Catalog


The U-SQL Catalog acts like a SQL database server,
running as part of your ADLA account. You create
new databases and tables in this instance, and use
it to persist data in much the same way as any
other instance of SQL Server. The files that
constitute the database are held in your default
Data Lake Store account. Note that any tables that
you create should have a clustered hashed index.

This example shows how to create a new database


and table in a U-SQL job, and copy data from a
data source into this table. Notice that the
columns in the table are defined using C# data
types:

Creating and populating a U-SQL Catalog database and table in a U-SQL job
// Create database for holding Personnel data if it doesn't already exist
DROP DATABASE IF EXISTS Personnel;
CREATE DATABASE IF NOT EXISTS Personnel;

// Create table
CREATE TABLE IF NOT EXISTS Personnel.dbo.PersonnelDept
(
EmpID int,
FName string,
LName string,
Dept int,
INDEX EmpIdx CLUSTERED (EmpID ASC) DISTRIBUTED BY HASH (EmpID)
);

// Fetch data from a CSV file


@personnelData =
EXTRACT EmployeeID int,
ForeName string,
LastName string,
Department int
FROM "/PersonnelData/Personnel.csv"
USING Extractors.Csv(skipFirstNRows: 1);

// Copy the data into the PersonnelDept table in the database


INSERT INTO Personnel.dbo.PersonnelDept(EmpID, FName, LName, Dept)
SELECT EmployeeID, ForeName, LastName, Department
FROM @personnelData;

Note: Important: You can’t INSERT into a table and read from it within the same U-SQL job.
This is because the INSERT and read operations could be spread across different vertices and be
performed in parallel; the results would be unreliable due to the possibility of data mutating as it
is read.
MCT USE ONLY. STUDENT USE PROHIBITED
5-18 Processing big data using Azure Data Lake Analytics

A table can have an unlimited number of rows, in addition to multiple indexes over various columns. To
help maintain performance, ADLA uses statistics to optimize queries performed against tables. These
statistics enable ADLA to determine the most efficient way to retrieve data from a table using the indexes
available; one index might be more suitable for satisfying a query than another, or ADLA might choose to
use multiple indexes to retrieve data and merge the results together, for example. You create the statistics
for a table using the CREATE STATISTICS command. The statistics are static, and if you change the
distribution of data in a table (by dropping or adding a large number of rows), they might become out of
date, resulting in poorer query performance. In this case, you use the UPDATE STATISTICS command to
refresh the statistics. The “WITH INCREMENTAL = ON” clause causes the statement to recompute the
statistics only for new rows that have been added since the statistics were created or last updated. If you
specify “WITH INCREMENTAL = OFF”, the command will drop the existing statistics for the table and
generate a completely new set. This could take a significant time over a large table.

The following examples illustrate the syntax for the CREATE STATISTICS and UPDATE STATISTICS
commands:

Creating and updating statistics over a table in the U-SQL Catalog


CREATE STATISTICS EmpStats ON Personnel.dbo.PersonnelDept(EmpID) WITH FULLSCAN;

UPDATE STATISTICS EmpStats ON Personnel.dbo.PersonnelDept WITH INCREMENTAL = ON;

It’s possible to create views in the catalog, in addition to tables. A view is a persistent, named query. A
view doesn’t hold any data of its own; instead, it references a SELECT statement that retrieves data from
one or more tables. You retrieve data from a view using a SELECT statement, in much the same way as
you fetch data from a table. Note that a VIEW is different from a rowset variable because you reuse across
multiple jobs without having to define it every time.

You create a view using the CREATE VIEW command. This example creates a simple view that limits the
employees to those who work in Department 2:

Creating a view in the U-SQL Catalog


CREATE VIEW IF NOT EXISTS Personnel.dbo.Dept2
AS SELECT EmpID, FName, LName
FROM Personnel.dbo.PersonnelDept
WHERE Dept == 2;

For more information about using the ADLA catalog, see:

Data Definition Language (DDL) statements (U-SQL)


https://aka.ms/Xvc1ww

Note: There are currently no specific security commands for the U-SQL Catalog. Instead,
you should use ADLA access control lists (ACLs) and RBAC to protect the data in the /catalog
folder in the ADLA account. For more information, see Module 4: Managing Big Data in Azure
Data Lake Store.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-19

Using C# functions and operators in a U-SQL job


U-SQL utilizes the C# type system, and can
incorporate C# functions and operators in
expressions, in addition to those of SQL.

A U-SQL job has access to the System,


System.Data, System.Text,
System.Text.RegularExpressions, and System.Linq
namespaces in the Common Language Runtime
(CLR) class libraries. These class libraries contain a
range of valuable methods, including the full
range of string manipulation functions from the
.NET Framework, data and time functions, and
maths functions. You also add types that are held
in other class libraries—you will learn more about this in Module 6.

The following example illustrates the use of the string Split function, to parse a string that contains
elements separated by the semicolon (;) character into its individual components. Note that the Split
function returns an array that is stored in a SQL.ARRAY object. The second SELECT statement retrieves the
rows from the array and outputs the individual elements as fields:

Using the string Split function


// Sample Data
@someBooks =
SELECT * FROM
( VALUES
("The Book Thief; Markus Zusak; 2005"),
("The Girl with the Dragon Tattoo; Stieg Larsson; 2005"),
("The Silver Linings Playbook; Matthew Quick; 2008"),
("Sarah's Key; Tatiana de Rosnay; 2006")
) AS Book(BookInfo);

// Split the data into an array, with each row containing title, author, and year of
publication for each book
@splitData =
SELECT new SQL.ARRAY<string>(Books.Split(';')) AS BookData
FROM @someBooks;

// Retrieve the elements from each row in the array


@elements =
SELECT BookData[0] AS Title, BookData[1] AS Author, BookData[2] AS PublicationYear
FROM @splitData;

For a list of the C# functions available with U-SQL, see:

C# Functions and Operators (U-SQL)


https://aka.ms/Bllt9a

All C# operators, with the exception of the assignment operators, are available. This includes the more
esoteric but extremely powerful items such as the ternary conditional operator (:?), the null coalescing
operation (??), and the Lambda expression operators (=>).

The ternary conditional operator is a succinct form of the if..else construct; the first operand, which must
be a Boolean expression, is evaluated—if it is true, the value of the second expression is returned as the
result; otherwise the value of the third expression is calculated and used. The following code shows an
example:
MCT USE ONLY. STUDENT USE PROHIBITED
5-20 Processing big data using Azure Data Lake Analytics

Using the ternary conditional operator


// The @data rowset contains employee grades (numeric) and other information, such as
employee name.
// Grades 10 and above are senior management.

@result =
SELECT Name, Grade >= 10 ? “Senior Management” : “Worker” AS JobLevel
FROM @data;

The null coalescing operator examines its first operand, and if it is null, returns the value of the second
operand—otherwise, it uses the value of the first operand:

Using the null coalescing operator


// Display employee grades. If the employee grade is null, then display 0

@result =
SELECT Name, Grade ?? 0 AS JobGrade
FROM @data;

You use Lambda expressions to create custom delegate functions for methods that support late-binding
of functions as arguments to expressions. Some library functions in the .NET Framework expect you to
provide your own C# code, which is run as part of the function call. An example of this is the Max function
of the SQL.ARRAY type. The Max function returns the biggest item in the array, but you have to provide
the code that defines the data where you find the maximum value. In the following example, the Max
function is used with a Lambda expression to convert a string containing a numeric value into a number.
The Max function will then return the item with the highest numeric value. Without performing this
conversion, the maximum prices for each stock would be calculated based on the character values of each
string rather than the numbers that they represent:

Using a Lambda expression


@stockPrices =
SELECT * FROM
( VALUES
("AAAA", "20, 21, 20, 19, 18, 19, 20, 22, 25, 22, 28, 27"),
("BBBB", "56, 58, 60, 65, 64, 63, 62"),
("CCCC", "77, 76, 74, 72, 68, 65, 67, 68"),
("DDDD", "45, 46, 45, 44, 43, 45, 44, 46, 47, 45"),
("EEEE", "101, 103, 106, 101, 99, 97, 95, 94, 95")
) AS PriceMovement(Ticker, Prices);

@result =
SELECT Ticker,
new SQL.ARRAY<string>(Prices.Split(',')) AS PricesArray
FROM @stockPrices;

@highestPrices =
// Use a Lambda expression to convert the string value of each element in the array
to a number
SELECT Ticker, PricesArray.Max(p => Convert.ToInt32(p)) AS HighestPrice
FROM @result;

OUTPUT @highestPrices
TO "/Output/HighestPrices.csv"
USING Outputters.Csv();
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-21

The results look like this:

"AAAA",28

"BBBB",65

"CCCC",77

"DDDD",47

"EEEE",106

For a complete list of the operators (C# and SQL) that you might use in U-SQL code, see:

Operators (U-SQL)
https://aka.ms/E1c5o8

Note: The CSHARP operator

Occasionally, there might be some ambiguity between a C# function and a SQL function with the
same name. Additionally, some C# constants might cause the U-SQL compiler to fail. This is
because the U-SQL compiler assumes all uppercase words in a statement are U-SQL reserved
words; you can’t create your own functions or variables that use completely uppercase letters.
However, C# defines some of its own constants as uppercase, the most common example being
Math.PI. A SELECT statement such as the following—that calculates the volume of a sphere with a
given radius—will fail to compile:
SELECT 4/3 * Math.PI * Math.Pow(radius, 3)
FROM ...
You disable the U-SQL parser from parsing an expression by applying the CSHARP operator to
the expression, as shown here:
SELECT 4/3 * CSHARP(Math.PI) * Math.Pow(radius, 3)
FROM ...

Demonstration: Performing an analysis using U-SQL


In this demonstration, you will see how to:

 Use an EXTRACT statement to combine multiple job inputs using virtual fields.

 Perform parallel data saving to a Lake Store catalog and CSV file.

 Test a job, before deploying to the cloud.


MCT USE ONLY. STUDENT USE PROHIBITED
5-22 Processing big data using Azure Data Lake Analytics

Check Your Knowledge


Question

Which function should you use to convert the


data in a SQL.ARRAY object into a rowset?

Select the correct answer.

CROSS APPLY

MAP

EXPLODE

CONVERT

Use a Lambda expression


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-23

Lesson 3
Sorting, grouping, and joining data
Many analyses require you to aggregate and summarize data. This also frequently means that you
combine data from different sources. These sources could be unstructured files, structured databases, or a
mixture of both. In this lesson, you will learn how to use U-SQL to sort, group, and combine data.

Lesson Objectives
After completing this lesson, you will be able to:

 Sort data to organize it.

 Partition data windows by using Windows.


 Use aggregations to reduce data into group values.

 Use the ADLA analytics functions to help spot trends in data.

 Join data from different datasets into a rowset.

 Use federated queries over data held in distributed SQL databases.

Sorting data
Sorting is a fundamental data processing
operation you use to present and handle data in a
specific sequence. U-SQL provides the ORDER BY
clause to perform this task.

You can use ORDER BY in a SELECT statement to


handle data in a specific sequence. However, this
sequence is not guaranteed to be maintained
because the data might proceed through further
steps before being output. Therefore, the most
common use of ORDER BY is with the OUTPUT
statement, which occurs as the final step in the
process.
The next example uses the ORDER BY clause with the OUTPUT statement to write Personnel data out to a
file in Department order. By default, the data is sorted in ascending order. The DESC modifier used in this
example causes the data to be written in descending order of Department instead:

Sorting data on output


@personnelData =
SELECT * FROM
( VALUES
(99, "jhgjhgjh", "iuoioui", 1),
(100, "oiuoiu", "poipoi", 1),
(101, "slkjlkjlkj", "nbmnbmn", 2),
(102, "xzcvxvn", "bemnyu", 3),
(103, "iuywe", "bvche", 3),
(104, "quytue", "lkjo", 1)
) AS Employees(EmployeeID, ForeName, LastName, Department);

@result =
SELECT EmployeeID, ForeName, LastName, Department
FROM @personnelData;
MCT USE ONLY. STUDENT USE PROHIBITED
5-24 Processing big data using Azure Data Lake Analytics

OUTPUT @result
TO "/Output/PersonnelFile.csv"
ORDER BY Department DESC
USING Outputters.Csv();

The results look like this:

102,"xzcvxvn","bemnyu",3

103,"iuywe","bvche",3

101,"slkjlkjlkj","nbmnbmn",2
99,"jhgjhgjh","iuoioui",1

100,"oiuoiu","poipoi",1

104,"quytue","lkjo",1

Apart from not guaranteeing to preserve the sequence of the data, using ORDER BY in a SELECT
statement has other implications. For example, if you have a large set of data, ordering the data while
processing it can become a very time consuming operation. To limit the effects of such a potentially costly
task, you can only use ORDER BY in a SELECT statement over a subset of data defined by using
FETCH/OFFSET clauses.
The FETCH clause specifies a number of rows to retrieve, and the optional OFFSET clause indicates from
where to start fetching the data (the default OFFSET is 0). The following code sorts the data in the
@personnelData rowset from the previous example, but only fetches three rows, starting at the third row
(OFFSET 2 means discard the first two rows):

Sorting and limiting data on SELECT


@result =
SELECT EmployeeID, ForeName, LastName, Department
FROM @personnelData
ORDER BY Department
OFFSET 2 ROWS
FETCH 3 ROWS;

Note that the FETCH clause curtails the SELECT operation completely, and no further data is retrieved
from the rowset or passed on to subsequent steps in the processing. For example, if the @result rowset is
output, it will look like this:

102,"xzcvxvn","bemnyu",3

101,"slkjlkjlkj","nbmnbmn",2

104,"quytue","lkjo",1
For further information on using SORT BY, see:

ORDER BY and OFFSET/FETCH clauses (U-SQL)


https://aka.ms/Dvf6on
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-25

Grouping data
U-SQL supports the GROUP BY and HAVING
clauses of SQL to calculate aggregate values across
a set of rows, and limit the output based on the
results of an aggregated calculation.

U-SQL supports a number of built-in aggregate


functions that are commonly found in other
implementations of SQL, including AVG, COUNT,
MAX, MIN, SUM, STDEV and VAR. U-SQL also
includes the U-SQL specific aggregate functions—
ARRAY_AGG and MAP_AGG—that you use to
combine data into SQL.ARRAY and SQL.MAP
objects respectively. The ANY_VALUE aggregate
function picks a single value from a group that you use for sampling.

For a description of the built-in aggregate functions, see:

Aggregate functions (U-SQL)


https://aka.ms/On6wxs

Note: You create your own custom aggregate functions with U-SQL—this is described in
Module 6.

Remember that, when you use an aggregate function over one or more columns in a rowset, you must
use GROUP BY over any remaining scalar columns. The following example uses the previously used
personnel dataset to calculate the number of employees in each department. Note that you must provide
an alias, using the AS clause, for expressions calculated by using aggregate functions:

Using the COUNT aggregate function and a GROUP BY clause


@personnelData =
SELECT * FROM
( VALUES
(99, "jhgjhgjh", "iuoioui", 1),
(100, "oiuoiu", "poipoi", 1),
(101, "slkjlkjlkj", "nbmnbmn", 2),
(102, "xzcvxvn", "bemnyu", 3),
(103, "iuywe", "bvche", 3),
(104, "quytue", "lkjo", 1)
) AS Employees(EmployeeID, ForeName, LastName, Department);

@result =
SELECT Department, COUNT(EmployeeID) AS NumEmployees
FROM @personnelData
GROUP BY Department;

OUTPUT @result
TO "/Output/DepartmentFile.csv"
USING Outputters.Csv();
MCT USE ONLY. STUDENT USE PROHIBITED
5-26 Processing big data using Azure Data Lake Analytics

If you need to limit which groups are returned, you use the HAVING clause to provide a filter. HAVING
acts like a WHERE clause, but it only applies to aggregates, as shown in the next example:

Using HAVING to limit the groups returned


// Only return departments that have at least two employees
@result =
SELECT Department, COUNT(EmployeeID) AS NumEmployees
FROM @personnelData
GROUP BY Department
HAVING COUNT(EmployeeID) >= 2;

By default, the aggregate functions include duplicate values in their calculations. You exclude duplicates
by including the DISTINCT keyword, like this:

Removing duplicates from a calculation


// If the same employee is found more than once, only count them the first time
@result =
SELECT Department, COUNT(DISTINCT EmployeeID) AS NumEmployees
FROM @personnelData
GROUP BY Department;

Windowing data
Windowing in U-SQL gives you a way to partition
data for processing. A window is a set of data, a
little like that defined by a GROUP BY clause—you
use the same aggregate functions over a window
as you do over a group. However, unlike GROUP
BY, which reduces data into groups for calculating
aggregate values, the aggregated results in a
window are available to every row in the window;
the window defines the scope of the aggregation.

As an example, compare the results of the


following U-SQL statements that process stock
price data. The raw data consists of the ticker, the
price, and the time at which the price was current. The first U-SQL statement uses an ordinary GROUP BY
clause with the AVG function to calculate the average price for each ticker (the same stock item will have
several prices, because the market value increases or decreases throughout the day):

Using an ordinary GROUP BY statement to calculate average prices per stock item
@stockPriceData =
EXTRACT Ticker string,
Price int,
Time int
FROM "/Input/StockPrices.csv"
USING Extractors.Csv(skipFirstNRows: 1);

@avgPrices =
SELECT Ticker, AVG(Price) AS AvgPrice
FROM @stockPriceData
GROUP BY Ticker;
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-27

The data in the @avgPrices rowset will have this format:

"MQTZ",21

"NARN",19

"NBG5",4

"NEQ4",5

"NMIN",56

The first column contains the ticker, and the second is the average price.

Now suppose you need to display the ticker, current price, and average price. To do this, you generate an
intermediate rowset with the average prices, perform another query to find all the current prices, and then
join the two rowsets over the ticker. However, using a window, you perform all of this processing in a
single query.

To define a window, you use the OVER operator when using an aggregate function. This operator
specifies how you should partition the data for the aggregation. The function is performed once for each
partition, but the results are made available to every row in the partition. The result is that you combine
aggregated values and nonaggregated columns together in the same query. This example uses the same
data as before, but defines a window over the ticker column, so data is grouped by ticker and the average
price is calculated for each ticker. However, for this time, the average price is displayed as part of the data
for each row:

Using a window to calculate average prices per stock item


@prices =
SELECT Ticker, Price, AVG(Price) OVER(PARTITION BY Ticker) AS AvgPrice
FROM @stockPriceData;

The results look like this:

"MQTZ",7,21

"MQTZ",0,21

"MQTZ",26,21

"MQTZ",16,21

"MQTZ",42,21

"MQTZ",27,21

"NARN",25,19

"NARN",17,19

"NARN",18,19
"NARN",21,19

"NARN",19,19

"NARN",0,19
MCT USE ONLY. STUDENT USE PROHIBITED
5-28 Processing big data using Azure Data Lake Analytics

"NARN",17,19

"NMIN",21,56

"NMIN",69,56

"NMIN",25,56

"NMIN",0,56

"NMIN",49,56

"NMIN",66,56

"NMIN",29,56

"NMIN",20,56

"NMIN",85,56

You modify the PARTITION BY clause of the OVER operator to change the extent of the window. For
example, you might reduce the size of the window around each row as it is processed by using the ROWS
clause. The next example partitions the data by ticker, but calculates a rolling average of the price over
the current row and the preceding 10 rows in the partition. Note that, if you window data like this, you
must also order the data in some way:

Using a window that includes the 10 previous rows


@prices =
SELECT Ticker, Price, AVG(Price) OVER(PARTITION BY Ticker ORDER BY Ticker ROWS 10
PRECEDING) AS AvgPrice
FROM @stockPriceData;

There are other options available; for instance, you might specify following rather than preceding rows.
For more information, see:

OVER expression (U-SQL)


https://aka.ms/Kcyr2d

U-SQL supplies ranking functions, RANK, ROW_NUMBER, NTILE, and DENSE_RANK, that you use to
determine the ranking value of each row in a partition. For example, the RANK function returns the order
in which an item in the partition is ranked according to the order of the data in the partition.

The following code sorts stock records by price, and displays the rank for each row. Rows with the same
price will have the same ranking value.

Ranking data in a window


@prices =
SELECT Ticker, Price, RANK() OVER(PARTITION BY Ticker ORDER BY Price) AS Rank
FROM @stockPriceData;

Results look like this:

"NARN",0,1,1

"NARN",0,1,1

"NARN",0,1,1
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-29

"NARN",0,1,1

"NARN",0,1,1

"NARN",0,1,1

"NARN",0,1,1

"NARN",0,1,1

"NARN",0,1,1

"NARN",0,1,1

"NARN",0,1,1

"NARN",0,1,1

"NARN",0,1,1

"NARN",0,1,1

"NARN",0,1,1

"NARN",0,1,1

"NARN",0,1,1
"NARN",0,1,1

"NARN",1,19,2

"NARN",1,19,2
"NARN",1,19,2

"NARN",1,19,2

"NARN",2,23,3
"NARN",2,23,3

"NARN",2,23,3

"NARN",2,23,3
"NARN",3,27,4

"NARN",4,28,5

"NARN",4,28,5
"NARN",5,30,6

"NARN",5,30,6

"NARN",5,30,6

...

Notice that rows with the same stock price receive the same ranking value. Subsequent ranks take into
account the number of items already ranked; if there are 18 items with rank 1, the 19th item has rank 19. If
you want to rank data without any gaps in the ranking sequence, use the DENSE_RANK functions.

For more information on the ranking functions, see:

Ranking functions (U-SQL)


https://aka.ms/Xyupyl
MCT USE ONLY. STUDENT USE PROHIBITED
5-30 Processing big data using Azure Data Lake Analytics

Using the built-in analytics functions


The built-in analytics functions that are available
with U-SQL operate in a similar manner to the
ranking functions. They calculate a value based on
the data in a window. The following list
summarizes these functions:
 FIRST_VALUE. Returns the value in the first
row for the specified column in the window.

 LAST_VALUE. Returns the value in the final


row for the specified column in the window.
 CUME_DIST. Calculates the cumulative
distribution of values for the specified column
in the window.
 PERCENT_RANK. Calculates the rank of a row as a percentage of the number of rows in the window. It
returns a value between 0 (for the first row) and 1 (for the final row).
 PERCENTILE_CONT and PERCENTILE_DISC. Calculate the specified percentile value (between 0 and 1)
of each row in the window. PERCENTILE_CONT assumes a continuous distribution of values in the
column (the value returned might not actually be the value held by any row), and PERCENTILE_DISC
assumes a discrete distribution.

 LEAD and LAG. Enable access to a row that either follows (LEAD) or precedes (LAG) the current row.
You specify the number of rows ahead (or behind) to look from the current row. These functions are
useful if you need to examine the data in the next or previous rows while examining the current row.

The following example uses the LAG function to retrieve the data in the Price column from the previous
row in the window together with the current row. If there is no previous row, it returns the default value -
1. Note that, in this case, you do not need to explicitly limit the data referenced by the ORDER BY clause
(usually, you would need to provide a FETCH/OFFSET clause when using ORDER BY in a SELECT statement,
as described earlier in this lesson).

Using the LAG function to retrieve the previous row


@prices =
SELECT Ticker, Price, LAG(Price, 1, -1) OVER(PARTITION BY Ticker ORDER BY Price) AS
PreviousPrice
FROM @stockPriceData;

For more information, see:

Analytic functions (U-SQL)


https://aka.ms/G5lcsz
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-31

Joining and pivoting data


U-SQL supports the full range of SQL join
operations to combine data from multiple rowsets.
These include:

 Inner joins. This is the most common type of


join, where the data in one rowset is
combined with the data in another across a
common key. Only rows that have a matching
key value are joined. If a row in one rowset
has a key that is not found in the other, it is
omitted from the results. This type of join is
frequently used with rowsets that have a
parent/child relationship.

 Outer joins. In an outer join, if a row has a key with no corresponding row in the other rowset, it is
combined with a dummy row containing null values and output.

There are three types of outer join:

 Left (in which the rowset that is mentioned first is joined with null rows, but not the other way
around).

 Right (in which the rowset that is mentioned is joined with null rows, but not the other way around).
 Full (in which both rowsets are joined with null rows).

 Cross join. This type of join generates the Cartesian product of both rowsets. Each row in the first
rowset is combined with every row in the other. If the first rowset contains N rows, and the second
contains M rows, the resulting rowset contains N * M rows.

This example shows an inner join between product and sales data for a retail outlet. The product rowset
contains information such as the product name, description, and a unique identifier (productID; the key).
The sales rowset contains information about each purchase made for the product. The ON clause joins the
two rowsets over the productID column.

Performing an inner join to combine product and sales information


@productData =
SELECT * FROM
( VALUES
(1, "sprocket", 10),
(2, "flange", 15),
(3, "widget", 26),
(4, "dibber", 31),
(5, "grommet", 35),
(6, "baldrick", 18)
) AS Products(ProductID, ProductName, UnitPrice);

@salesData =
SELECT * FROM
( VALUES
(100, 1, 20),
(101, 4, 2),
(102, 4, 3),
(103, 5, 10)
) AS Purchases(PurchaseID, ProductID, NumSold);

@result =
SELECT P.ProductID, P.ProductName, P.UnitPrice, S.NumSold
FROM @productData AS P
INNER JOIN @salesData AS S
MCT USE ONLY. STUDENT USE PROHIBITED
5-32 Processing big data using Azure Data Lake Analytics

ON P.ProductID == S.ProductID;

OUTPUT @result
TO "/Output/ProductSalesFile.csv"
USING Outputters.Csv();

The results look like this:

1,"sprocket",10,20

4,"dibber",31,3
4,"dibber",31,2

5,"grommet",35,10

This next example uses a left outer join to find all products that are yet to be purchased (there are no sales
records for them). This type of query could be used to identify products that might be worth curtailing. In
this case, the two rowsets are joined over the ProductID column as before but, for products that have no
sales, the value in the ProductID column in the @salesData rowset will be null.

Using a left outer join to find products that have never been purchased
@result =
SELECT P.ProductID, P.ProductName, P.UnitPrice, S.NumSold
FROM @productData AS P
LEFT JOIN @salesData AS S
ON P.ProductID == S.ProductID
WHERE S.ProductID == null;

This time, the results look like this. Note that there is no NumSold data:

2,"flange",15,

3,"widget",26,

6,"baldrick",18,

Additionally, U-SQL implements two other types of join operation:


 Semijoin. This type of join compares data in one rowset with the results of a subquery. There are two
versions; a left semijoin finds all rows in the rowset that have matching rows in the results returned by
the subquery; a right semijoin returns only the rows from the subquery that have a match in the
rowset.

 Antisemijoin. This is similar to the semijoin except that it finds all rows in the rowset that don’t have
a match in the results of a subquery. Again, there are two variants—left and right.

This example shows a left semijoin to find all products that have been purchased at least once:

Using a left semijoin to find all products that have been purchased at least once.
@result =
SELECT P.ProductID, P.ProductName, P.UnitPrice
FROM @productData AS P
LEFT SEMIJOIN (SELECT ProductID
FROM @salesData) AS S
ON P.ProductID == S.ProductID;

The output looks like this:


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-33

Note that, in contrast with the first example, each product appears only once regardless of how many
sales it is responsible for:

1,"sprocket",10

4,"dibber",31

5,"grommet",35
For more information about joins, see:

U-SQL SELECT selecting from joins


https://aka.ms/Ysqops

U_SQL also supports PIVOT and UNPIVOT operations. The PIVOT operation transforms a set of values in a
specified column in a rowset into a series of columns, and uses a specified aggregation to calculate the
values for the data for these columns. UNPIVOT performs the opposite task, taking the data for a set of
columns and converting them into rows. For detailed information, see:

PIVOT and UNPIVOT (U-SQL)


https://aka.ms/Dbhzhj

Performing federated queries


You use the Azure portal to add data sources to an
ADLA account that reference Data Lake Store and
Blob storage. However, U-SQL provides the
CREATE DATA SOURCE statement that you use to
connect to SQL Server databases held in SQL
Database, SQL Data Warehouse, or on-premises
servers. You then use the SELECT … FROM
EXTERNAL … statement to send requests to these
sources to perform queries as part of a U-SQL job.
Each database acts autonomously, returning the
data back to the U-SQL job where it is processed
and joined with data from other sources if
required.

To create an external data source, perform the following tasks:

1. Use the New-AzureRmDatLakeAnalyticsCatalogCredential PowerShell cmdlet to create a credential


that U-SQL uses to access the remote SQL Server database. The credential includes the secret
information necessary to connect to the database (typically, the user name and password). The
following cmdlet shows an example; you will be prompted for the username and password when the
cmdlet runs:
New-AzureRmDataLakeAnalyticsCatalogCredential -AccountName "<ADLA Account
Name>" -DatabaseName "<Database in ADLA Catalog>" -CredentialName "<Name
for your credential>" -Credential (Get-Credential) -DatabaseHost "<AZURE
SQL DATABASE SERVER>.database.windows.net" -Port 1433;
MCT USE ONLY. STUDENT USE PROHIBITED
5-34 Processing big data using Azure Data Lake Analytics

2. Using U-SQL, create a data source that references the remote SQL Server database using this
credential. The following code illustrates the syntax:
CREATE DATA SOURCE IF NOT EXISTS <Name of new data source>
FROM AZURESQLDB
WITH (
PROVIDER_STRING = "Database=<name of SQL Server database on the
remote server>;Trusted_Connection=False;Encrypt=True;",
CREDENTIAL = <Name of the credential created by using PowerShell>,
REMOTABLE_TYPES = (bool, byte, sbyte, short, int, long, decimal,
float, double, string, DateTime)
);
3. Retrieve data from the remote database by using the FROM EXTERNAL clause:

@results =
SELECT <column 1>, <column 2>, …
FROM EXTERNAL <Name of data source> LOCATION "<Table in remote
database>";

For additional information on creating external data sources, see:

CREATE DATA SOURCE (U-SQL)

https://aka.ms/Sqx2d7

For details on how to query data in external data sources, see:

U-SQL SELECT selecting from an external rowset

https://aka.ms/Iy8870

Demonstration: Grouping and analyzing data


In this demonstration, you will see how to:

 Use windowing functions and create a catalog view.

 Create an external data source that references a SQL Database credential.

 Use a federated query to join data from SQL Database with data retrieved from the ADLA catalog.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-35

Check Your Knowledge


Question

Which type of operation enables you to find, in


one rowset, all the data that has no corresponding
values in another?

Select the correct answer.

INNER JOIN

PIVOT

OVER

LEFT JOIN

HAVING
MCT USE ONLY. STUDENT USE PROHIBITED
5-36 Processing big data using Azure Data Lake Analytics

Lab: Processing big data using Data Lake Analytics


Scenario
You work for Adatum as a data engineer, and you have been asked to build a traffic surveillance system
for traffic police. This system must be able to analyze significant amounts of dynamically streamed data,
captured from speed cameras and automatic number plate recognition (ANPR) devices, and then
crosscheck the outputs against large volumes of reference data holding vehicle, driver, and location
information. Fixed roadside cameras, hand-held cameras (held by traffic police), and mobile cameras (in
police patrol cars) are used to monitor traffic speeds and raise an alert if a vehicle is travelling too quickly
for the local speed limit. The cameras also have built-in ANPR software that can read vehicle registration
plates.

For this phase of the project, you are going to use Data Lake Analytics to calculate the average speeds
detected by speed cameras, use data joins with a Data Lake Analytics job to generate speeding notices
linked to vehicle owner information, scale up the system to use Data Lake Analytics speed camera data
stored in the cloud in Azure SQL Database, and use Power BI to present the average speeds using map
visualizations.

Objectives
After completing this lab, you will be able to:

 Create and test a Data Lake Analytics job.

 Use Data Lake Analytics with data joins.

 Use Data Lake Analytics with SQL Database.

 Use Data Lake Analytics to categorize data and present results using Power BI.

Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.

Estimated Time: 90 minutes

Virtual machine: 20776A-LON-DEV

Username: ADATUM\AdatumAdmin
Password: Pa55w.rd

Exercise 1: Create and test a Data Lake Analytics job


Scenario
You are going to use Data Lake Analytics to calculate the average speeds detected by speed cameras,
across a city area. In this exercise, you will use Data Lake Analytics to analyze speed camera data, and
calculate the rolling average speed over the previous 25 observations for each speed camera.

The main tasks for this exercise are as follows:

1. Prepare data files for a local Data Lake Analytics instance

2. Create a new Data Lake project

3. Run a Data Lake Analytics job locally

4. Create a Data Lake Analytics account


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-37

5. Prepare data files for a cloud Data Lake Analytics instance

6. Run a Data Lake Analytics job in the cloud

7. Edit a Data Lake Analytics job to add rolling averages

8. Run the Data Lake Analytics updated job

 Task 1: Prepare data files for a local Data Lake Analytics instance

 Task 2: Create a new Data Lake project

 Task 3: Run a Data Lake Analytics job locally

 Task 4: Create a Data Lake Analytics account

 Task 5: Prepare data files for a cloud Data Lake Analytics instance

 Task 6: Run a Data Lake Analytics job in the cloud

 Task 7: Edit a Data Lake Analytics job to add rolling averages

 Task 8: Run the Data Lake Analytics updated job

Results: At the end of this exercise, you will have prepared data files for a local Data Lake Analytics
instance, created a new Data Lake project, and run this Data Lake Analytics job locally. Also, you will have
created a Data Lake Analytics account, prepared data files for a cloud Data Lake Analytics instance, run
this Data Lake Analytics job in the cloud—then edited the job to add rolling averages, and run the
updated job.

Exercise 2: Use Data Lake Analytics with data joins


Scenario
You need to be able to generate and send out speeding tickets that are addressed to the registered
owners of vehicles that have been caught exceeding the local speed limit by the cameras. In this exercise,
you will add functionality to your Data Lake Analytics job to generate fines/summonses from speed data
and vehicle owner address information.

The main tasks for this exercise are as follows:

1. Prepare data files for a local Data Lake Analytics instance

2. Create a new Data Lake project

3. Test the Data Lake Analytics job locally


4. Modify the Data Lake Analytics job to use joins

5. Test the modified Data Lake Analytics job locally

6. Run the updated Data Lake Analytics job in the cloud


MCT USE ONLY. STUDENT USE PROHIBITED
5-38 Processing big data using Azure Data Lake Analytics

 Task 1: Prepare data files for a local Data Lake Analytics instance

 Task 2: Create a new Data Lake project

 Task 3: Test the Data Lake Analytics job locally

 Task 4: Modify the Data Lake Analytics job to use joins

 Task 5: Test the modified Data Lake Analytics job locally

 Task 6: Run the updated Data Lake Analytics job in the cloud

Results: At the end of this exercise, you will have prepared data files for a local Data Lake Analytics
instance, created a new Data Lake project and tested this job locally, then modified the job to use joins,
and tested the job locally before running it in the cloud.

Exercise 3: Use Data Lake Analytics with SQL Database


Scenario
You need to scale up the speed camera system to use speed camera data stored in the cloud in SQL
Database, and then ensure that Data Lake Analytics can use this data in a secure manner. In this exercise,
you will replace the “temporary” vehicle owner CSV file used in the previous exercise with data held in a
SQL Database. You will create a data source for this database, using stored database credentials, and
update the job to use this database rather than the CSV file.

The main tasks for this exercise are as follows:

1. Create a SQL Database

2. Upload data to SQL Database and add an index


3. Store a SQL Database credential in the Data Lake Analytics catalog using PowerShell

4. Configure a Visual Studio-based Data Lake Analytics job to use stored credentials

5. Run a Data Lake Analytics job using stored credentials to access data in SQL Database

 Task 1: Create a SQL Database

 Task 2: Upload data to SQL Database and add an index

 Task 3: Store a SQL Database credential in the Data Lake Analytics catalog using
PowerShell

 Task 4: Configure a Visual Studio-based Data Lake Analytics job to use stored
credentials

 Task 5: Run a Data Lake Analytics job using stored credentials to access data in SQL
Database

Results: At the end of this exercise, you will have created a SQL Database, uploaded data to SQL Database
and added an index. You will also have stored a SQL Database credential in the Data Lake Analytics
catalog using PowerSheII, configured a Data Lake Analytics job to use stored credentials, and run this job
using stored credentials to access data in SQL Database.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-39

Exercise 4: Use Data Lake Analytics to categorize data and present results
using Power BI
Scenario
You need to be able to present average speed data on a digital map, so that it’s easy to see traffic
patterns across a city area. In this exercise, you will generate analytics for speed cameras, showing the
number of cars that passed each camera in different speed “buckets” (< 10 mph, 10-29 mph, 30-49 mph,
50-69 mph, 70-99 mph, and 100+ mph); this analysis should incorporate current and historical data from
the speed cameras. You will then visualize the data using the ARCGis map control in Power BI.

The main tasks for this exercise are as follows:


1. Create a new Data Lake project

2. Test the Data Lake Analytics job locally then run it in the cloud

3. Use Power BI to visualize the data

4. Lab closedown

 Task 1: Create a new Data Lake project

 Task 2: Test the Data Lake Analytics job locally then run it in the cloud

 Task 3: Use Power BI to visualize the data

 Task 4: Lab closedown

Results: At the end of this exercise, you will have created a new Data Lake project, tested the job locally,
run it in the cloud, and then used Power BI to visualize the data.
MCT USE ONLY. STUDENT USE PROHIBITED
5-40 Processing big data using Azure Data Lake Analytics

Module Review and Takeaways


In this module, you have learned about:

 The purpose of Azure Data Lake Analytics, and how you create and run jobs.

 Using U-SQL to process and analyze data.


 Using windowing to sort data and perform aggregated operations, and how to join data from
multiple sources.
MCT USE ONLY. STUDENT USE PROHIBITED
6-1

Module 6
Implementing Custom Operations and Monitoring
Performance in Azure Data Lake Analytics
Contents:
Module Overview 6-1 

Lesson 1: Incorporating custom functionality into analytics jobs 6-2 

Lesson 2: Optimizing jobs 6-19 


Lesson 3: Managing jobs and protecting resources 6-37 

Lab: Implementing custom operations and monitoring performance in Azure


Data Lake Analytics 6-41 

Module Review and Takeaways 6-45 

Module Overview
The built-in functionality available with U-SQL with Microsoft® Azure® Data Lake Analytics (ADLA) acts
as a powerful platform for performing common analytical operations. Additionally, the ability to use C#
functions and operators inline within U-SQL expressions adds further flexibility. However, there might be
times when you need functionality that cannot be easily implemented by using U-SQL or simple inline C#
expressions. Perhaps your data is in a format that is different from that used by the built-in extractors—or
maybe you need to implement a custom analytics function. ADLA has extension points that you use to
incorporate your own features into jobs.

Another key aspect of ADLA concerns managing and optimizing jobs. An ADLA job can consume
considerable resources, so you might need to control which users actually use an ADLA account. You
should also ensure that your jobs run in as optimal a manner as possible, both in terms of time taken and
resources utilized. Therefore, you need to understand how to monitor jobs and the options available for
tuning them.

Objectives
In this module, you will learn how to:

 Incorporate custom features and assemblies into U-SQL.

 Optimize jobs to ensure efficient operations.

 Implement security to protect ADLA jobs and resources.


MCT USE ONLY. STUDENT USE PROHIBITED
6-2 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

Lesson 1
Incorporating custom functionality into analytics jobs
You should consider the functionality built into ADLA through U-SQL as a starting point for performing
analytics operations rather than a definitive set of tools. You think of ADLA as a framework rather than a
solution (a little like Hadoop or Spark). You then use this framework to implement the analytical features
and processes required by your organization.

Lesson Objectives
After completing this lesson, you will be able to:

 Describe how to create custom extensions that incorporate code written using the .NET Framework.

 Create custom extractors to retrieve data held in files of various formats.

 Create custom outputters to save data in different formats.


 Implement user-defined functions that invoke custom code as part of the analytical processing for a
job.

 Create user-defined aggregators that combine the data processed by a job.

 Create user-defined operators that perform various tasks during the processing life cycle of a job.

 Incorporate R code into a U-SQL script.

 Incorporate Python code into a U-SQL script.

 Add cognitive capabilities to a U-SQL script.

Building and deploying .NET Framework custom extensions


The .NET Framework is a powerful technology
that you use to implement complex operations
that interact with the operating system and other
resources in a safe and secure manner. You write
code that runs using the .NET Framework in a
variety of languages, including C# and Visual
Basic.

.NET Framework code executes using the


common language runtime (CLR). The CLR is
responsible for handling the environment in
which the code runs in addition to the managing
of resources. Applications written for the CLR are
compiled into assemblies that contain pseudo machine language instructions (this is a hardware-neutral
set of instructions that can be deployed to any platform on which the CLR is installed). At runtime, the CLR
loads these assemblies, performs various verification and security checks, then converts the pseudo
machine language instructions into real machine code instructions for the underlying platform—and then
executes the resulting code. For further details on how the .NET Framework and CLR operate, see:

Overview of the .NET Framework


https://aka.ms/Mxcmt9
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-3

ADLA incorporates the CLR, enabling you to run .NET Framework code as part of a U-SQL job. However,
you must compile your code into an assembly, and arrange for the assembly to be deployed as part of the
job.

You build and deploy an assembly in two ways:

1. By using a code-behind file that contains your .NET Framework code (typically written using C#). The
Azure Data Lake Tools for Visual Studio support this development technique—they automatically
create an assembly from your code, add it to the job, and then arrange for the assembly to be
removed from memory when the job has completed. The advantage of this approach is that it is
quick and easy. The primary disadvantage is that it limits the reusability of your .NET Framework
code; if you want to include the same routine in several U-SQL jobs, you must copy the source code
for that routine into each job. This can make maintenance difficult.

2. By using Visual Studio® to write your code and compiling it into a separate assembly. You manually
arrange for the assembly to be uploaded to your ADLA account, and add U-SQL meta-statements
that reference the assembly, so that the job knows where to find it. In this way, you reuse the same
assembly across multiple jobs.

Assemblies are stored in the U-SQL catalog held in the underlying Data Lake Store account. You secure
them as you would any other item stored in the U-SQL catalog.
For more information, see Module 5: Processing Big Data using Azure Data Lake Analytics.

For more information about building and deploying assemblies for U-SQL jobs, see:

Using assemblies
https://aka.ms/Rry3zd

Demonstration: Creating and deploying a .NET Framework assembly to


ADLA
In this demonstration, you will see how to:

 Use a code-behind file to create a .NET Framework assembly.

 Manually create and deploy an assembly locally, for testing.

 Deploy an assembly to an ADLA account in Azure.


MCT USE ONLY. STUDENT USE PROHIBITED
6-4 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

Implementing custom extractors


The U-SQL runtime includes six default extractor
functions—two for CSV files, two for generic text
files, and two for TSV files. Each pair of extractors
comes in two flavors—a basic extractor with
default settings, and a more advanced extractor
that enables you to customize features such as
the delimiters, escape characters, rows to skip,
and so on.

The default extractors are defined in the


Microsoft.Analytics.Udo.Extractors class. If you
are familiar with C#, the following code fragment
shows the definition of this class. Each extractor
extends the Microsoft.Analytics.Interfaces.IExtractor abstract base class—this is a requirement of U-SQL:

The definition of the Extractors class


namespace Microsoft.Analytics.Udo
{
public sealed abstract class Extractors
{
public static Microsoft.Analytics.Interfaces.IExtractor Csv(System.Text.Encoding
encoding);
public static Microsoft.Analytics.Interfaces.IExtractor Csv(System.String
rowDelimiter, System.Nullable<System.Char> escapeCharacter, System.String nullEscape,
System.Text.Encoding encoding, System.Boolean quoting, System.Boolean silent,
System.Int32 skipFirstNRows, System.String charFormat);
public static Microsoft.Analytics.Interfaces.IExtractor Text(System.Text.Encoding
encoding);
public static Microsoft.Analytics.Interfaces.IExtractor Text(System.Char
delimiter, System.String rowDelimiter, System.Nullable<System.Char> escapeCharacter,
System.String nullEscape, System.Text.Encoding encoding, System.Boolean quoting,
System.Boolean silent, System.Int32 skipFirstNRows, System.String charFormat);
public static Microsoft.Analytics.Interfaces.IExtractor Tsv(System.Text.Encoding
encoding);
public static Microsoft.Analytics.Interfaces.IExtractor Tsv(System.String
rowDelimiter, System.Nullable<System.Char> escapeCharacter, System.String nullEscape,
System.Text.Encoding encoding, System.Boolean quoting, System.Boolean silent,
System.Int32 skipFirstNRows, System.String charFormat);
}
}

Although these extractors cover many basic cases, you will likely find that your data is frequently held in a
different format, such as JSON or XML. In this case, you will need to create a custom extractor. To do this,
you implement your own class that inherits from the IExtractor abstract class. The IExtractor class defines a
single abstract method named Extract that you should override:
public override IEnumerable<IRow> Extract(IUnstructuredReader input,
IUpdatableRow output)

The purpose of this method is to retrieve the data from the input source then parse it, and pass it back
one row at a time. The value returned should be the data for the next available row. This method is called
by an enumerator in the U-SQL runtime, which is responsible for requesting each row and passing the
rows off for processing.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-5

The parameters to the Extract method are:

 A reference to an IUnstructuredReader object that is used to read the raw input data, in whatever
format it is supplied (JSON, XML, and so on). This object supplies a property named BaseStream that
represents the input data as a stream—you use a StreamReader object to read the data from this
stream. Your code should then parse this data and use it to create a series of rows that will be passed
to U-SQL.

 An IUpdatableRow object that represents a row of data that has been read in using the
IUnstructuredReader object. The value returned is an enumerable collection of these objects.

You are free to implement the Extract method in your own way, according to the format of the data files
from which you are extracting data. For more detailed information and examples on creating custom
extractors, see:

User-defined Extractors
https://aka.ms/Qgy6cb

Optimizing an extractor
If the format of the input file contains multiple distinct input records, with each record on a single line, the
U-SQL runtime can split the file into pieces and read in each piece in parallel using multiple vertices.
However, if your input file consists of a single item (such as a JSON array or XML document), then the
data can only be extracted by a single vertex. You specify whether the file format supports splitting by
applying the [SqlUserDefinedExtractor(AtomicFileProcessing = false)] attribute to your extractor class
(setting this attribute to true, the default informs the USQL runtime that it should treat the file as a single,
indivisible item).

Creating custom outputters


You use an outputter to save data to persistent
storage. U-SQL provides outputters for writing
data to CSV, TSV, and text data but, as with
extractors, you define your own outputters if you
need to save data in a different format.

A custom outputter extends the


Microsoft.Analytics.Interfaces.IOutputter abstract
base class. This class contains an abstract method
named Output. U-SQL calls the Output method
once for each row being output.

The method takes two parameters:

 An IRow object that contains the data for a row to be output. You retrieve the individual columns
from the IRow object by using the Get<> generic method, and specifying the name of the column.
You obtain the details of each column (such as its name and type) by reading the Schema property of
the IRow object. This property contains a collection of IColumn objects, and each IColumn object
describes a single column.
 An IUnstructuredWriter object that represents the destination. The BaseStream property of this
object contains an output stream that you use to write data to the destination.

The IOutputter class also provides a Close method that you can override to close the destination and
release resources, if necessary.
MCT USE ONLY. STUDENT USE PROHIBITED
6-6 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

The following code shows an example of a simple outputter that saves data as an XML document. The
outputter has been simplified; a production version would use techniques such as reflection to support
any data type rather than the limited set available here:

A simple XML outputter


public class SimpleXMLOutputter : IOutputter
{
private string xmlDocType = String.Empty;
private string xmlRowType = String.Empty;
private Stream outputWriter;

// The constructor specifies the XML tags to use to wrap the data (defaults to
<Data><Row>...</Row><Row>...</Row>...</Data>)
public SimpleXMLOutputter(string xmlDocType = "Data", string xmlRowType = "Row")
{
this.xmlDocType = xmlDocType;
this.xmlRowType = xmlRowType;
this.outputWriter = null;
}

public override void Output(IRow input, IUnstructuredWriter output)


{
// If this is the first row, then also output the document type tag
if (this.outputWriter == null)
{
this.outputWriter = output.BaseStream;
string header = $@"<{this.xmlDocType}>{Environment.NewLine}";
this.outputWriter.Write(Encoding.UTF8.GetBytes(header), 0, header.Length);
}

// Get the schema of the row


var columnSchema = input.Schema;

// Iterate through the columns in the row and convert the data to XML encoded
strings
StringBuilder rowData = new StringBuilder($@"<{xmlRowType}>");
foreach (var column in columnSchema)
{
rowData.Append($@"<{column.Name}>");

// This outputter currently only recognizes int, double, string, and DateTime
data.
if (column.Type == typeof(int))
rowData.Append($@"{input.Get<int>(column.Name)}");
if (column.Type == typeof(double))
rowData.Append($@"{input.Get<double>(column.Name)}");
if (column.Type == typeof(string))
rowData.Append($@"{input.Get<string>(column.Name)}");
if (column.Type == typeof(DateTime))
rowData.Append($@"{input.Get<DateTime>(column.Name)}");

rowData.Append($@"</{column.Name}>");
}
rowData.Append($@"</{xmlRowType}>");
rowData.Append(Environment.NewLine);

// Send the XML encoded string to the output stream


string data = rowData.ToString();
this.outputWriter.Write(Encoding.UTF8.GetBytes(data), 0, data.Length);
}

// Write the document end tag, flush any remaining buffered output, and then close
the destination
public override void Close()
{
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-7

string trailer = $@"</{this.xmlDocType}>";


this.outputWriter.Write(Encoding.UTF8.GetBytes(trailer), 0, trailer.Length);
this.outputWriter.Flush();
this.outputWriter.Close();
}
}

You use this outputter in a USQL job as follows:

Using the simple XML outputter


REFERENCE ASSEMBLY CustomOutputters;

@testData =
SELECT * FROM
( VALUES
(99, "jhgjhgjh", "iuoioui", 100000, 1),
(100, "oiuoiu", "poipoi", 115000, 2),
(101, "slkjlkjlkj", "nbmnbmn", 250000, 3),
(102, "xzcvxvn", "bemnyu", 33800, 1),
(103, "qutii", "uyfiu", 0, 2),
(104, "sakak", "lkpoi", 190000, 3),
(105, "kjsakjk", "cvnbmnx", -500, 1),
(106, "wqytyr", "psagdj", 101000, 2)
) AS Employees(EmployeeID, ForeName, LastName, Salary, Dept);

// Write the data to an XML file


OUTPUT @testData
TO "/EmployeeData.xml"
USING new CustomOutputters.SimpleXMLOutputter(xmlDocType: "Employees", xmlRowType:
"Employee");

The results look like this:

<Employees>
<Employee><EmployeeID>99</EmployeeID><ForeName>jhgjhgjh</ForeName><LastName>
iuoioui</LastName><Salary>100000</Salary><Dept>1</Dept></Employee>

<Employee><EmployeeID>100</EmployeeID><ForeName>oiuoiu</ForeName><LastName>p
oipoi</LastName><Salary>115000</Salary><Dept>2</Dept></Employee>
<Employee><EmployeeID>101</EmployeeID><ForeName>slkjlkjlkj</ForeName><LastNa
me>nbmnbmn</LastName><Salary>250000</Salary><Dept>3</Dept></Employee>

<Employee><EmployeeID>102</EmployeeID><ForeName>xzcvxvn</ForeName><LastName>
bemnyu</LastName><Salary>33800</Salary><Dept>1</Dept></Employee>

<Employee><EmployeeID>103</EmployeeID><ForeName>qutii</ForeName><LastName>uy
fiu</LastName><Salary>0</Salary><Dept>2</Dept></Employee>

<Employee><EmployeeID>104</EmployeeID><ForeName>sakak</ForeName><LastName>lk
poi</LastName><Salary>190000</Salary><Dept>3</Dept></Employee>

<Employee><EmployeeID>105</EmployeeID><ForeName>kjsakjk</ForeName><LastName>
cvnbmnx</LastName><Salary>-500</Salary><Dept>1</Dept></Employee>

<Employee><EmployeeID>106</EmployeeID><ForeName>wqytyr</ForeName><LastName>p
sagdj</LastName><Salary>101000</Salary><Dept>2</Dept></Employee>
</Employees>
MCT USE ONLY. STUDENT USE PROHIBITED
6-8 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

Optimizing an outputter
If the output file contains multiple distinct input records, you can arrange for the U-SQL runtime to
output groups of records in parallel using separate vertices. You specify whether the file format supports
splitting by applying the [SqlUserDefinedOutputter(AtomicFileProcessing = false)] attribute to your
outputter class.
For more information, see:

Use user-defined outputters


https://aka.ms/Phm7h0

Creating user-defined functions


User-defined functions (UDFs) operate on a row-
by-row basis as part of a U-SQL job. They can
take any number of parameters, but return a
scalar value that is typically used in the SELECT or
WHERE clause (amongst others) of a U-SQL
statement.

You create a user-defined function as a regular


static function of a class. The following code
shows the definition of the SuspiciousMovement
function from Demo 1 in this lesson:

Creating a user-defined function


public class PriceAnalyzer
{
public static bool SuspiciousMovement(string ticker, int newPrice, DateTime
quoteTime)
{
// code for function goes here

}
}

In this example, the function takes three parameters and returns a Boolean value; it’s intended to be used
in the WHERE clause of a U-SQL statement, as shown here. The function runs once for each row
processed. Notice that you prefix the function call with the namespace and class in which the function is
defined:

Calling a user-defined function from a U-SQL script


...
// Use the SuspiciousMovement function to find suspect stock price movements
@result =
SELECT Ticker,
Price,
QuoteTime
FROM @stockData
WHERE PriceMovementAnalysis.PriceAnalyzer.SuspiciousMovement(Ticker, Price,
QuoteTime) == true;

MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-9

Note: User-defined functions enable you to maintain state between function calls. The class
in which the function is defined creates its own state cache that you use if you need to compare
the data in different rows. The SuspiciousMovement function illustrated in Demo 1 uses this
technique to store the price and quote time of a stock item in a List collection for comparison the
next time another row for the same stock item appears.

For more details, see:

Use user-defined functions: UDF


https://aka.ms/Wdfqdt

Creating user-defined aggregators


In contrast to a user-defined function that runs
once for each row, a user-defined aggregator is
used to accumulate a value across a collection of
rows. U-SQL provides a number of built-in
aggregation functions such as SUM, MAX, MIN,
and COUNT, but you extend the range of
aggregations available by defining your own
custom aggregator functions.

You define a user-defined aggregator by creating


a class that extends the
Microsoft.Analytics.Interfaces.IAggregate abstract
base class. IAggregate is a generic class that
expects type parameters for the arguments passed in to the aggregator (the type of the data being
aggregated), in addition to the type of the value returned by the aggregator. The base class contains the
following methods that you should override:

 Init. This method runs once, at the very beginning of the aggregation process. You use this method
to initialize any variables required to calculate the aggregate value.

 Accumulate. This method is executed for each row being aggregated. The parameters contain the
data for the row, passed in by the U-SQL runtime.

 Terminate. This method runs once, at the end of the aggregation process. The value returned by this
method is passed to the U-SQL runtime as the result of the aggregation.
MCT USE ONLY. STUDENT USE PROHIBITED
6-10 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

The following example shows a version of the AVG (average) aggregate function (called PositiveAVG) that
discards zero and negative values from its calculations. The value returned is the average of all the
positive data values only:

The PositiveAVG aggregator


// PositiveAVG expects data values to be doubles, and returns a double
public class PositiveAVG : IAggregate<double, double>
{
private double totalOfPositiveValues;
private int numOfPositiveValues;

public override void Init()


{
this.totalOfPositiveValues = 0;
this.numOfPositiveValues = 0;
}

public override void Accumulate(double dataValue)


{
// Only include positive values in the calculation
if (dataValue > 0)
{
this.totalOfPositiveValues += dataValue;
this.numOfPositiveValues++;
}
}

public override double Terminate()


{
if (numOfPositiveValues > 0)
{
return this.totalOfPositiveValues / this.numOfPositiveValues;
}
else
{
return 0;
}
}
}

To run a user-defined aggregator in a U-SQL script, you call the AGG generic function, specifying the
aggregator as the function type parameter, together with the data parameters expected by the
aggregator. Note that, as with any aggregation, you should use a GROUP BY clause to specify the
groupings for the aggregator.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-11

The following example shows how to call the PositiveAVG function from U-SQL:

Calling the PositiveAVG aggregator from U-SQL


REFERENCE ASSEMBLY CustomAggregations;

@testData =
SELECT * FROM
( VALUES
(99, "jhgjhgjh", "iuoioui", 100000, 1),
(100, "oiuoiu", "poipoi", 115000, 2),
(101, "slkjlkjlkj", "nbmnbmn", 250000, 3),
(102, "xzcvxvn", "bemnyu", 33800, 1),
(103, "qutii", "uyfiu", 0, 2),
(104, "sakak", "lkpoi", 190000, 3),
(105, "kjsakjk", "cvnbmnx", -500, 1),
(106, "wqytyr", "psagdj", 101000, 2)
) AS Employees(EmployeeID, ForeName, LastName, Salary, Dept);

// Find the average salary across the company


@averageSalary =
SELECT AGG<CustomAggregations.PositiveAVG>(Salary) AS AverageSalary
FROM @testData;

// Find the average salary by department


@averageSalaryByDepartment =
SELECT AGG<CustomAggregations.PositiveAVG>(Salary) AS AverageSalary, Dept
FROM @testData
GROUP BY Dept;

// Save the results


OUTPUT @averageSalary
TO "/AverageSalary.csv"
USING Outputters.Csv(quoting: false, outputHeader: true);

OUTPUT @averageSalaryByDepartment
TO "/AverageSalaryByDepartment.csv"
USING Outputters.Csv(quoting: false, outputHeader: true);

The results in the AverageSalary.csv file look like this:

AverageSalary

131633.33333333334

In the AverageSalaryByDepartment.csv file, they look like this:

AverageSalary Dept

66900 1

108000 2

220000 3

Optimizing user-defined aggregators


The code that implements a user-defined aggregator is opaque to the U-SQL runtime. Therefore, the U-
SQL runtime might not be able to optimize your job and ensure that it maximizes the parallelism available
by dividing the processing across vertices. However, if the calculation performed by an aggregator is
associative, you tag the IAggregator class with the SqlUserDefinedReducer(IsRecursive = true) attribute.
This informs the runtime that it can divide up the input data into parallel chunks, run parallel instances of
the aggregator over each chunk, and then repeat the operator over the results from each chunk to
calculate the final result (this is the recursive bit).
MCT USE ONLY. STUDENT USE PROHIBITED
6-12 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

You should note that not all aggregations are associative, and applying the
SqlUserDefinedReducer(IsRecursive = true) attribute to an inherently nonassociative operation will likely
lead to incorrect results.

Note: An associative operator returns the same value regardless of the order in which its
operands are evaluated. For example, the expression a + b + c can be evaluated as (a + b) + c (a
is added to b, and the result is then added to c), or a + (b + c) (b is added to c, and the result is
then added to a). This associativity is used by the built-in SUM aggregate function in U-SQL, and
enables the optimizer to split the data into subsets. The sum of each subset can be computed in
parallel, and the results then added together in a recursive operation. In pseudocode, you might
think of the situation like this:
SUM (a, b, c, d, e) = SUM(SUM(a, b), SUM(d, e))

For more information on creating and using user-defined aggregators, see:

Use user-defined aggregates: UDAGG


https://aka.ms/G6iod2

Demonstration: Building and using a user-defined extractor, outputter,


and aggregator
In this demonstration, you will see how to:
 Create and use a custom extractor.

 Create and use a custom outputter.

 Create and use a custom aggregator.

Incorporating R code into a U-SQL script


R is a language designed to perform statistical
analysis. There’s a plethora of R libraries available,
and many complex routines have been published
that users frequently adapt to their own use. You
incorporate these routines into your U-SQL jobs
by using the R extensions for ADLA.
The R extensions are available in a separate
assembly called ExtR. You add the extensions and
the other required supporting files to a catalog
by using the Azure portal; go to the Sample
Scripts page in the ADLA blade, and then click
Install U-SQL Extensions. You must then add a
reference to the ExtR assembly to your U-SQL code.

You invoke R code from U-SQL by creating an instance of the Extension.R.Reducer class using the REDUCE
statement. The reducer passes the data held in a U-SQL rowset to the R script as a data frame named
inputFromUSQL. You manipulate the data in this data frame in the same way that you would any regular
R data frame. You pass the results of the processing back to U-SQL by storing them in another data frame
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-13

and returning this data frame, or by writing the results to a data frame named outputToUSQL (this is the
default data frame returned by the reducer if you don’t specify otherwise). The reducer converts the data
frame back into a U-SQL rowset.

Note: Lesson 2 describes reducers and the REDUCE statement in more detail.

There are some limitations in the conversion process between a U-SQL rowset and an R data frame; the
rowset or data frame can only contain columns with the double, string, bool, integer, or byte types,
although you can pass byte arrays by serializing them as strings. You should also be aware that the R
Factor type is not available in U-SQL, but you can set the Boolean stringsAsFactors parameter to true
when you call the R reducer—this will convert string values in the U-SQL rowset into factors in the R data
frame. In your R code, you can transform the data into any valid R data type for processing, providing you
convert the data into a type recognized by U-SQL at the end of the routine.

You can include your R code inline in a U-SQL script. The following example shows a U-SQL script that
calls an R routine to create a basic statistical summary of stock price movement data. The R code
generates a data frame listing the ticker, opening price, closing price, lowest price, and highest price for
each stock:

Calling R code from a U-SQL script


REFERENCE ASSEMBLY [ExtR];

DECLARE @stockData = @"/StockPriceData.csv";


DECLARE @resultsData = @"/StockPriceAnalysis.csv";

@stockPriceMovements =
EXTRACT Ticker string,
Price int,
QuoteTime string
FROM @stockData
USING Extractors.Csv(skipFirstNRows : 1);

DECLARE @RScript = @"


stockData <- rxImport(inputFromUSQL, outFile = NULL)
refactoredStockData <- rxFactors(stockData, c('Ticker'))
sortedData <- refactoredStockData[order(refactoredStockData[,3]),]
openingPrices <- aggregate(Price ~ Ticker, sortedData, head)
closingPrices <- aggregate(Price ~ Ticker, sortedData, tail)
lowestPrices <- aggregate(Price ~ Ticker, sortedData, min)
highestPrices <- aggregate(Price ~ Ticker, sortedData, max)

results <- data.frame(openingPrices[1], openingPrices[[2]][,1], closingPrices[[2]][,6],


lowestPrices[2], highestPrices[2])
colnames(results) <- c('Ticker', 'OpeningPrice', 'ClosingPrice', 'LowestPrice',
'HighestPrice')
results
";

@RScriptOutput = REDUCE @stockPriceMovements


ON Ticker
PRODUCE Ticker string, OpeningPrice int, LowestPrice int, HighestPrice int,
ClosingPrice int
USING new Extension.R.Reducer(command : @RScript, stringsAsFactors : false,
rReturnType : "dataframe");

OUTPUT @RScriptOutput
TO @resultsData
USING Outputters.Csv(outputHeader : true, quoting : false);
MCT USE ONLY. STUDENT USE PROHIBITED
6-14 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

You can also deploy your R code as a separate script—you must upload it to the ADLS store associated
with your ADLA account, and use the DEPLOY RESOURCE U-SQL command to enable the U-SQL runtime
to locate the script and include it with the job when it is executed. The reducer references the R code
using the scriptFile parameter. The following example shows how to do this:

Running an R script from a U-SQL job


REFERENCE ASSEMBLY [ExtR];

DEPLOY RESOURCE @"/StockSummaryScript.R"; // R script deployed as a resource

DECLARE @stockData = @"/StockPriceData.csv";


DECLARE @resultsData = @"/StockPriceAnalysis.csv";

@stockPriceMovements =
EXTRACT Ticker string,
Price int,
QuoteTime string
FROM @stockData
USING Extractors.Csv(skipFirstNRows : 1);

@RScriptOutput = REDUCE @stockPriceMovements


ON Ticker
PRODUCE Ticker string, OpeningPrice int, LowestPrice int, HighestPrice int,
ClosingPrice int
USING new Extension.R.Reducer(scriptFile : "StockSummaryScript.R", stringsAsFactors :
false, rReturnType : "dataframe");

OUTPUT @RScriptOutput
TO @resultsData
USING Outputters.Csv(outputHeader : true, quoting : false);

U-SQL currently supports R 3.2.2, and the R runtime includes the base R package and standard R modules
together with the Microsoft ScaleR package. You install and reference other packages (providing they will
run on R 3.2.2). To do this, you upload the zip file containing the package to ADLS, add a DEPLOY
RESOURCE statement that references this zip file to the U-SQL script, and then use the R install.packages
command in your R code to load the package. Note that, for security reasons, your R code cannot
download packages from other locations (such as the CRAN repository), and you must set the package
repository (repos) parameter of the install.packages function to NULL. For further details and examples
showing how to use R with U-SQL, see:

Analyze your data in Azure Data Lake with R (R extension)


https://aka.ms/Widhra

Partitioning big data


R data frames have to fit into memory. If you are handling a large volume of data in a rowset, a single R
data frame holding the equivalent data could cause a vertex to exceed its memory limit. For this reason,
the maximum size of the input and output rowset passed between U-SQL and R cannot exceed 500 MB.
However, you use the REDUCE statement to partition data into smaller rowsets based on the values in a
key column in the rowset. Each partition can be processed separately. This means that your R code might
only be handed a subset of the U-SQL data. Your U-SQL code is responsible for combining the results
from processing this partitioned data back into an overall result.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-15

Incorporating Python code into a U-SQL script


Python is a useful general-purpose language
that, like R, has proved popular for performing
statistical analysis. You incorporate Python code
into a U-SQL script in a similar manner to that of
R scripts.
The ADLA extensions for Python are held in an
assembly named ExtPython. This assembly is
generated when you install U-SQL Extensions
from the Azure portal (they are installed at the
same time as the R extensions).

You call Python code by using the


Extension.Python.Reducer class with the REDUCE
statement. You reference the Python code by using the pyScript parameter. Your Python code must
contain a function named usqlml_main that takes a Pandas DataFrame object as its parameter, and
returns a Pandas DataFrame as the result. The U-SQL runtime converts the rowset passed to the function
into the Pandas DataFrame, and performs the opposite conversion of the result. All U-SQL numeric and
string types are supported, but U-SQL NULL values are converted into NA values in Python (and vice
versa). Note that the reducer does not support index vectors, and all input DataFrames in the Python
function use a zero-based numeric index.

The following code shows an example, calling a Python function that generates the same statistical
summary as the R example in the previous topic:

Calling Python code from a U-SQL script


REFERENCE ASSEMBLY [ExtPython];

DECLARE @stockData = @"/StockPriceData.csv";


DECLARE @resultsData = @"/StockPriceAnalysis.csv";

@stockPriceMovements =
EXTRACT Ticker string,
Price int,
QuoteTime string
FROM @stockData
USING Extractors.Csv(skipFirstNRows : 1);

DECLARE @PythonScript = @"


def usqlml_main(df):
import pandas as pd

df.sort_values(['QuoteTime'])
prices = df.groupby(['Ticker'])
openingPrices = prices['Price'].first()
lowestPrices = prices['Price'].min()
highestPrices = prices['Price'].max()
closingPrices = prices['Price'].last()
result = pd.concat([openingPrices, lowestPrices, highestPrices, closingPrices], axis =
1).reset_index()
result.columns = ['Ticker', 'OpeningPrice', 'LowestPrice', 'HighestPrice',
'ClosingPrice']
return result
";

@RScriptOutput = REDUCE @stockPriceMovements


ALL
MCT USE ONLY. STUDENT USE PROHIBITED
6-16 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

PRODUCE Ticker string, OpeningPrice int, LowestPrice int, HighestPrice int,


ClosingPrice int
USING new Extension.Python.Reducer(pyScript : @PythonScript);

OUTPUT @RScriptOutput
TO @resultsData
USING Outputters.Csv(outputHeader : true, quoting : false);

U-SQL currently supports Python version 3.5.1. The Python extensions include all the standard Python
modules plus Pandas, Numpy, and Numexpr.

For more information, visit:

Tutorial: Get started with extending U-SQL with Python


https://aka.ms/Ovc86y

Adding cognitive capabilities to a U-SQL script


Cognitive Services is a collection of machine
learning APIs that cover more than 25 different
scenarios, from sentiment analysis to facial
recognition. These services use machine learning
to make assessments of the accuracy of the
results. For further details about the various
cognitive services available, see:

Cognitive Services
https://aka.ms/Q1j77v

U-SQL supports the following subset of the


functionality available through Cognitive Services:

 Imaging—facial detection: Detects one or more human faces in an image. Rectangles show where
the faces are in the image, along with face attributes that contain machine learning-based predictions
of facial features.
 Imaging—emotion detection: Takes a facial expression in an image as an input, and returns the
confidence across a set of emotions for each face in the image.

 Imaging—object detection and tagging: Returns information about the different items detected in
an image and attempts to label them.

 Imaging—optical character recognition: Detects text in an image and converts recognized


characters into a character stream.

 Text—sentiment analysis: Detects sentiment (positive and negative) in text.

 Text—key phrase extraction: Performs textual analysis and identifies key phrases in blocks of text.

U-SQL provides a series of user-defined extractors, appliers, and processors for calling the supported
Cognitive Services APIs:

 Cognition.Vision.FaceDetectionExtractor. This extractor reads a graphics (JPG) file and returns a


rowset containing information about the faces (position in the image, estimated age, and gender) for
each face detected.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-17

 Cognition.Vision.EmotionExtractor. This is another extractor that operates in a similar manner to


the FaceDetctionExtractor, except that it includes a string indicating the perceived emotion (and the
confidence of that perception) of each face detected.

 Cognition.Vision.ImageExtractor. This extractor identifies objects in a graphics file, and returns a


rowset containing a serialized byte array that has the raw image data for each object detected.
 Cognition.Vision.ImageTagger. This is a processor that can take the rowset generated by an
ImageExtractor. It creates a text description for each object and returns a list in the Tags column of
the results rowset.

 Cognition.Vision.OcrExtractor. This processor also takes the rowset generated by an ImageExtractor


but returns a rowset containing the text results of OCR analysis on each object.

 Cognition.Text.SentimentAnalyzer. This is a processor that performs sentiment analysis on a rowset


containing string data. You can indicate whether the resulting rowset should contain matches for
positive or negative sentiment.

 Cognition.Text.KeyPhraseExtractor. This is a processor that identifies the key phrases from a rowset
containing string data.

 Cognition.Text.Splitter. This is an applier that you use to cross-correlate the results of the
KeyPhraseExtractor with the original text to tokenize the key phrases.
These user-defined extractors, appliers, and processors are supplied in a series of assemblies that are
installed as part of the U-SQL extensions. You must add the appropriate REFERENCE ASSEMBLY
statements in your code to load the corresponding assembly into your job. You invoke a user-defined
processor by using the PROCESS command in a U-SQL script. The results of the processing are returned in
the output rowset.

Note: Lesson 2 covers user-defined processors and the PROCESS command in more detail.

For more information on using Cognitive Services with U-SQL, and links to sample code, see:

Tutorial: Get started with the cognitive capabilities of U-SQL


https://aka.ms/Xhdfwf

Demonstration: Incorporating R, Python and cognitive capabilities into a


U-SQL job
In this demonstration, you will see how to:

 Incorporate R code into a U-SQL job.


 Call Python code from a U-SQL job.

 Add cognitive capabilities to a U-SQL job.


MCT USE ONLY. STUDENT USE PROHIBITED
6-18 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

Check Your Knowledge


Question

Under what circumstances is it suitable to apply


the SqlUserDefinedReducer(IsRecursive = true)
attribute to a user-defined aggregator?

Select the correct answer.

If the aggregation operation is not


associative

If the data is being read from a file using a


custom format that is not understood by
the built-in U-SQL aggregators

If the data is stored in the ADLA catalog

If the aggregation operation is associative

If the aggregation operation is


commutative
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-19

Lesson 2
Optimizing jobs
ADLA is designed to process massive volumes of data as quickly as possible. However, to achieve these
aims, you need to structure your data and implement your processing to take full advantage of the
scalability available to ADLA jobs. This lesson describes how to design jobs that meet these requirements,
together with some issues of which you should be aware when customizing jobs.

Lesson Objectives
After completing this lesson, you will be able to:

 Explain how best to partition data held in the ADLA catalog to improve scalability.

 Monitor how ADLA jobs are parallelized by examining the vertices created to run these jobs.

 Create user-defined processors.

 Create user-defined appliers.


 Create user-defined combiners.

 Create user-defined reducers.

Partitioning data to optimize jobs


Partitioning is concerned with breaking up your
input data into smaller chunks of related data so
that the U-SQL runtime processes these chunks
in parallel. In an ideal situation, U-SQL would
divide up your data into groups that can each be
processed in isolation by parallel vertices, and
then combine the results from each vertex
together to form the final result. To achieve this
ideal, your data should be organized so that all
rows that are related to each other are stored
together. Each collection of related rows is a
partition.

You classify the input data for U-SQL jobs into two primary types:

 Unstructured files, such as CSV, Text, JSON, XML, and other forms of data input where U-SQL has no
prior knowledge of the data schema. U-SQL requires additional information to partition this data
effectively.

 Structured data held in tables in the ADLA catalog. In these cases, the schema is defined by the
definitions of the tables, and the catalog can also contain statistical information about the distribution
and range of values in each column in a table. U-SQL utilizes this information to help optimize jobs.

Partitioning unstructured data


You use an extractor to read unstructured data. If the extractor has the
[SqlUserDefinedExtractor(AtomicFileProcessing = false)] attribute, the input data is considered to be
splittable and can be read by using a set of parallel tasks. The U-SQL runtime will split the file up into a
series of extents, each of which has a maximum size of 250 MB. A single vertex reads up to four extents. A
new vertex is required for each GB of data.
MCT USE ONLY. STUDENT USE PROHIBITED
6-20 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

U-SQL cannot easily partition unstructured data that is held in a single file. Even if a file is splittable, when
the data has been read in, the processing performed by the job might necessitate copying rows between
vertices. This is potentially time-consuming and expensive.

For example, the following U-SQL job reads sales data from a CSV file, and then performs an aggregation
that calculates the total sales value of all items sold in New York:

Aggregating sales information in unpartitioned data


@salesData =
EXTRACT
state string,
productID int,
numSold int,
pricePerItem int
FROM "/SalesData/Sales.csv"
USING Extractors.Csv();

@nyData =
SELECT state, SUM(numSold * pricePerItem) AS value
FROM @salesData
WHERE state == "NY"
GROUP BY state;

Although this approach works, it is very inefficient. The issue is that every row has to be retrieved and then
grouped by state to enable the aggregation to be performed. However, only the results for New York are
actually required, and the remainder are discarded. This technique wastes time performing unnecessary
I/O, in addition to compute resources.

A better approach is to partition data manually by using file groups. Module 5 described how it’s possible
to read multiple files, and how you generate filenames for the EXTRACT statement dynamically by using
virtual columns. In the following example, the sales data is now held in separate CSV files, one for each
state:

Using a virtual column to partition data


@salesData =
EXTRACT
state string,
productID int,
numSold int,
pricePerItem int,
filename string
FROM "/SalesData/{filename}.csv"
USING Extractors.Csv();

@nyData =
SELECT state, SUM(numSold * pricePerItem) AS value
FROM @salesData
WHERE filename == "NY"
GROUP BY state;

The U-SQL runtime applies the predicate in the WHERE clause of the SELECT statement to the EXTRACT
statement, effectively eliminating all irrelevant data from the input before it is read. This job now only
retrieves and processes data for New York.

Note: Hint: If you are aiming to achieve maximum efficiency with unstructured data files,
you should implement manual partitioning, and use a file format that supports parallel
processing by being splittable.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-21

Partitioning structured data


The ADLA catalog is designed to hold very large tables. As a consequence, all tables must be created with
a clustered index, and the data must be distributed according to a specified key. The purpose of these two
items is to improve lookup times, and to spread the load evenly across physical I/O devices. However, the
onus is on the designer to specify the correct columns for indexes and the appropriate distribution
strategy. Ideally, you should select a cluster key that minimizes the cost of the most frequent and
expensive queries, especially those that perform range scans. For example, in a table that holds sales data,
if you frequently look up sales by product ranges (or even individual products), then consider clustering
on the product ID column. If necessary, you can cluster the data by multiple columns.

You also have the option to partition tables. A partition ensures that related data is held in the same
locality in the catalog. Careful partition design can eliminate significant overhead when performing
queries by reducing the amount of I/O required.

For example, consider the sales data scenario from the previous section. You could store the sales data in
a table in the catalog rather than as a set of CSV files. The following CREATE TABLE statement shows one
possible implementation of this table:

Creating a table for holding sales data


CREATE TABLE IF NOT EXISTS SalesData.dbo.Sales
(
State string,
ProductID int,
NumSold int,
PricePerItem int,
INDEX SalesIdx CLUSTERED(ProductID ASC)
DISTRIBUTED BY HASH(ProductID)
);

Now consider the following ADLA job. This job generates a report summarizing the total value of all sales
for all products sold in New York:

Finding the value of all sales by product sold in New York


@salesData =
SELECT State,
ProductID,
NumSold,
PricePerItem
FROM SalesData.dbo.Sales;

@nySales =
SELECT ProductID, SUM(NumSold * PricePerItem) AS TotalSalesValue
FROM @salesData
WHERE State == "NY"
GROUP BY ProductID;

OUTPUT @nySales
TO "/NYSales.csv"
USING Outputters.Csv(outputHeader: true, quoting: false);

The index over the ProductID column helps to optimize the GROUP BY clause in the query run by this job,
but the data is potentially spread across the entire database. To satisfy this query, the U-SQL runtime will
need to examine the entire table.
MCT USE ONLY. STUDENT USE PROHIBITED
6-22 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

Partitioning the data by state can alleviate the situation. The version of the table shown here has the same
set of columns and index as the previous example, but the data is partitioned by state. This ensures that
the data for a given state is held in the same partition in the database. In this case, performing the
preceding query will only require that the U-SQL runtime retrieves data from the partition holding data
for New York; it can ignore all other partitions.

Partitioning sales data by state


CREATE TABLE IF NOT EXISTS SalesData.dbo.Sales
(
State string,
ProductID int,
NumSold int,
PricePerItem int,
INDEX SalesIdx
CLUSTERED(ProductID ASC)
PARTITIONED BY (State)
DISTRIBUTED BY HASH(ProductID)
);

Understanding vertices and optimization


As part of the compilation process of a job, the
U-SQL runtime attempts to determine the most
efficient way to run the job. There are two key
performance objectives:
 How best to run a job with the minimum
amount of effort. For example, if the data is
partitioned, and the partitioning strategy
suits the nature of the query, the optimizer
will likely use this information to quickly
home in on the partitions containing the
data required, and discard data that is not
located in these partitions.

 How best to spread the effort required across resources to minimize the response time. To
continue the previous example, it might be possible to retrieve and process rows independently,
generate a set of intermediate results, and then combine these results together to generate the final
output (much like the map/reduce strategy of systems such as Hadoop). The optimizer will determine
whether it’s possible to parallelize the tasks for each stage of the processing, and allocate each task to
a vertex for execution.

Note: This is a simplified description of the work performed by the U-SQL runtime. For
complex jobs, the runtime might have to perform several iterations of reading, processing, and
combining data as it processes and refines results.

A vertex represents the resources used to perform a given task. Each vertex runs using an Analytics Unit
(AU), and each AU provides the processing power of two CPU cores and 6 GB of RAM. Additionally, to
avoid runaway costs, a vertex is allotted a maximum of five hours of runtime before it is forcibly
terminated.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-23

Note: The time and memory limits apply only for each individual vertex. A job that
comprises many stages, each running its own set of vertices, runs for much longer than five hours
and consume more than 6 GB of memory in total.

If you find that a vertex terminates before completion, either by timing out or requiring too much
memory, you should look at the work it’s being asked to do. Typically, this is a result of skewed data;
perhaps there are a vast number of records in one partition compared to others, for example—in which
case, you might need to adjust your partitioning or distribution strategies. It might also be necessary to
rephrase your job to break up the processing into smaller, less resource-hungry steps.

For more advice, see:

Resolve data-skew problems by using Azure Data Lake Tools for Visual Studio
https://aka.ms/Umgfek

The U-SQL optimizer depends on several sources of information when deciding how best to run a job.
These sources include the structure of the data (the optimizer can generate an execution plan more easily
for a table in the ADLA catalog than it can for a data file), and estimates of the amount of data likely to be
retrieved and processed, based on the statistics for each table. These statistics are not maintained
automatically, so if the distribution of data in a table changes significantly, you should regenerate them
by using the UPDATE STATISTICS command.

For more information, see:


UPDATE STATISTICS (U-SQL)
https://aka.ms/Xp0lmw

Examining jobs and vertices


When you use Visual Studio to run a U-SQL job, you are presented with the Job View window that shows
how the job has been divided up into stages. There is typically an extraction stage at the start of the job
during which the job identifies and fetches the data to be processed, possibly a set of processing stages,
and finally an aggregation stage that combines the results of the processing. There might also be many
other intermediate stages, depending on the complexity of your job. The Job View summarizes the
number of vertices, the total processing time, and the amount of I/O performed by each stage.

You drill into each stage to view the following information:

 Vertex Execution View. This view contains a series of bar charts illustrating the time spent creating,
queuing, and running each vertex. The runtime for each vertex is arguably the item of most
significance. You also use this view to examine the relationships between vertices. For example, one
vertex (referred to as a downstream vertex) might depend on the work performed by another
(referred to as an upstream vertex). If you notice that one vertex is taking much longer than others,
this could be the result of data skew.
 Stage Scatter View. This view displays the amount of I/O performed by each vertex, and the time
spent performing this I/O. Again, this is a useful tool for spotting skewed data.

 Vertex Operator View. Use this view to examine the way in which the various U-SQL operators were
called to process the data in that stage. An operator is an item such as an extractor, outputter, or
aggregator (including the user-defined implementations described in Lesson 1), and also other
features such as processors, appliers, combiners, and reducers (described later in this lesson).
MCT USE ONLY. STUDENT USE PROHIBITED
6-24 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

For more information, see:

Use Job Browser and Job View for Azure Data Lake Analytics jobs
https://aka.ms/Kjg61l

Demonstration: Partitioning data in the ADLA catalog to optimize a U-SQL


job
In this demonstration, you will see how to:

 Partition data in the ADLA catalog.

 Assess the impact of incorrect data partitioning.


 Partition data in the ADLA catalog.

Creating user-defined processors


A user-defined processor lets you provide your
own logic for handling data on a row-by-row
basis. A user-defined processor extends the
Microsoft.Analytics.Interfaces.IProcessor abstract
base class. This base class defines the Process
method:

public abstract IRow Process(IRow


input, IUpdatableRow output);
The U-SQL runtime calls the Process method
once for each row of input. You enumerate the
columns in the input row using the same
techniques that are available to outputters, and
you retrieve the Schema property of the input row and iterate through the collection of IColumn objects
in the property. You perform your processing using this data, and write the results to the output variable
passed as the second parameter to the Process method. The IUpdatableRow type provides the generic
Set<> function that enables you to store data in the output variable. The schema of the output variable
does not have to be the same as that of the input variable (you add or remove columns, for example),
providing it’s the same for every row being output.
The following example incorporates the “suspicious movement” code that was used in Lesson 1, to
analyze stock market prices. This example processes each movement and adds a flag to each row
indicating whether this price movement should be investigated.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-25

Implementing a user-defined processor


public class FlagRowsForInvestigation : IProcessor
{
public override IRow Process(IRow input, IUpdatableRow output)
{
// Retrieve the data for the row from the input
string ticker = input.Get<string>("Ticker");
int price = input.Get<int>("Price");
DateTime quoteTime = input.Get<DateTime>("QuoteTime");

// Mark suspicious movements with an "X" in the Suspicious flag column


bool isSuspicious = SuspiciousMovement(ticker, price, quoteTime);
string suspiciousFlag = isSuspicious ? "X" : "";

// Create the output row, including the suspicious flag


output.Set<string>("Ticker", ticker);
output.Set<int>("Price", price);
output.Set<DateTime>("QuoteTime", quoteTime);
output.Set<string>("Suspicious", suspiciousFlag);

// The value returned should be a read-only copy of the output row


return output.AsReadOnly();
}

public static bool SuspiciousMovement(string ticker, int newPrice, DateTime quoteTime)


{
// Same code as earlier demonstration—not shown
}
}

You use the PROCESS U-SQL command to invoke a user-defined processor. You provide a rowset
containing the data to be processed, and the definition of the expected result set, as shown here:

Calling a user-defined processor from a U-SQL script


REFERENCE ASSEMBLY CustomOperators;

// Retrieve the data from the StockPriceData.csv file in Data Lake Storage
@stockData = EXTRACT
Ticker string,
Price int,
QuoteTime DateTime
FROM "/StockPriceData.csv"
USING Extractors.Csv(skipFirstNRows: 1);

// Use the custom processor to find suspect stock price movements


@result =
PROCESS @stockData
PRODUCE Ticker String,
Price int,
QuoteTime DateTime,
Suspicious String
USING new CustomOperators.FlagRowsForInvestigation();

// Save the results


OUTPUT @result
TO "/Movements.csv"
USING Outputters.Csv(quoting: false, outputHeader: true);
MCT USE ONLY. STUDENT USE PROHIBITED
6-26 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

Each row produced by the processor will either have an “X” in the Suspicious column, or it will be blank.
Those with an “X” are suspect:
Ticker Price QuoteTime Suspicious

...

SKGG 34 2017-08-27T15:46:05.7051528+01:00

HXA2 11 2017-08-27T15:45:05.7051528+01:00

BW5P 59 2017-08-27T15:45:05.7051528+01:00 X

YVGA 73 2017-08-27T15:49:05.7051528+01:00
SUR1 27 2017-08-27T15:48:05.7051528+01:00

GHRC 82 2017-08-27T15:48:05.7051528+01:00

DBWM 14 2017-08-27T15:48:05.7051528+01:00

1PZ2 25 2017-08-27T15:45:05.7051528+01:00 X

SWDN 7 2017-08-27T15:45:05.7051528+01:00

CU14 26 2017-08-27T15:46:05.7051528+01:00

KV5X 39 2017-08-27T15:45:05.7051528+01:00

UYP5 2 2017-08-27T15:48:05.7051528+01:00 X

ALNY 91 2017-08-27T15:45:05.7051528+01:00 X

Q3DH 92 2017-08-27T15:46:05.7051528+01:00

LJU5 11 2017-08-27T15:46:05.7051528+01:00

...

IMPORTANT: The code inside a user-defined processor is opaque to the optimization


mechanisms implemented by ADLA. This reduces the ability of ADLA to scale jobs. Therefore, you
should only consider writing a user-defined processor if it’s not possible to perform the same
tasks using U-SQL. You should ensure that your code is as efficient as possible and maximizes the
potential for concurrency.

For more information about creating user-defined processors, see:


Use user-defined processors
https://aka.ms/N9krik
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-27

Creating user-defined appliers


User-defined appliers work with the CROSS
APPLY operator in U-SQL SELECT statements. The
CROSS APPLY operator combines two rowsets,
joining rows from the first rowset (the left rowset)
with the corresponding rows in the second
rowset (the right rowset) across a common set of
column values. The CROSS APPLY operator in U-
SQL is limited to working with two types of
rowsets for the right side; rowsets generated
using the EXPLODE operator (discussed in
Module 5), and rowsets generated by using a
user-defined applier.

A user-defined applier extends the Microsoft.Analytics.Interfaces.IApplier abstract base. This class provides
the Apply method that you should override in your own code:

public override IEnumerable<IRow> Apply(IRow input, IUpdatableRow output)


The Apply method is called once for each row of input from the left rowset. You use this data to create an
output row, in a manner similar to that of a user-defined processor. The primary difference is that this
method is called iteratively as part of an enumeration process by the U-SQL runtime, and this method
must return an object that implements the IEnumerable interface. The simplest strategy for performing
this task is to use the C# yield operator to pass each row back to the U-SQL runtime as it is generated.

The following example shows an applier for the stock market scenario. The input rows contain the ticker,
current price, and quote time. The applier finds the opening price for the stock (code not shown) and the
percentage difference between the opening price and the current price. The output row generated
contains the ticker, opening price, and percentage change in price. You then use this data in a U-SQL job
with the CROSS APPLY operator to join each stock price change with its opening price and percentage
change. An analyst sees at a glance whether a stock is performing poorly or well.

Creating a user-defined applier


public class GetStockAnalytics : IApplier
{
// Generate a row that contains the opening price and percentage change in price for
a stock item, identified by the ticker in the input row
public override IEnumerable<IRow> Apply(IRow input, IUpdatableRow output)
{
// Retrieve the ticker, price, and quote time from the input row
string ticker = input.Get<string>("Ticker");
int price = input.Get<int>("Price");
DateTime quoteTime = input.Get<DateTime>("QuoteTime");

// Find the opening price for this stock item


int openingPrice = getOpeningPrice(ticker, price, quoteTime);

// Calculate the percentage change


int priceDiff = price - openingPrice;
double percentChange = ((double)priceDiff / openingPrice) * 100;

// Generate and return the new row


output.Set<string>("Ticker", ticker);
output.Set<int>("OpeningPrice", openingPrice);
output.Set<double>("PercentChange", percentChange);
yield return output.AsReadOnly();
}

MCT USE ONLY. STUDENT USE PROHIBITED
6-28 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

You use this applier in a U-SQL job like this:

Running an applier using the CROSS APPLY operator in a U-SQL script


REFERENCE ASSEMBLY CustomOperators;

// Retrieve the data from the StockPriceData.csv file in Data Lake Storage
@stockData = EXTRACT
Ticker string,
Price int,
QuoteTime DateTime
FROM "/StockPriceData.csv"
USING Extractors.Csv(skipFirstNRows: 1);

// Use the custom applier to add the opening price and percentage price change to the
data
@result =
SELECT S.Ticker, S.Price, S.QuoteTime, OpeningPrice, PercentChange
FROM @stockData AS S
CROSS APPLY
USING new CustomOperators.GetStockAnalytics() AS PriceData(Ticker string,
OpeningPrice int, PercentChange double);

// Save the results


OUTPUT @result
TO "/PriceChanges.csv"
USING Outputters.Csv(quoting: false, outputHeader: true);

The results look similar to this:

Ticker Price QuoteTime OpeningPrice PercentChange


2T0Y 0 2017-08-28T11:57:05.7051528+01:00 11 -100


VZXG 1 2017-08-28T12:06:05.7051528+01:00 3 -
66.666666666666657

DBZG 8 2017-08-28T12:07:05.7051528+01:00 14 -
42.857142857142854

UH1P 3 2017-08-28T11:44:05.7051528+01:00 58 -
94.827586206896555

FFAU 1 2017-08-28T11:41:05.7051528+01:00 7 -
85.71428571428570

H3XQ 35 2017-08-28T11:40:05.7051528+01:00 48 -
27.08333333333333

CU14 0 2017-08-28T11:38:05.7051528+01:00 26 -100

HXA2 4 2017-08-28T12:09:05.7051528+01:00 11 -
63.636363636363633

4U0Z 4 2017-08-28T12:16:05.7051528+01:00 51 -
92.156862745098039

BW5P 4 2017-08-28T12:51:05.7051528+01:00 59 -
93.220338983050837
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-29

For more information, see:

Use user-defined appliers


https://aka.ms/Wp87eu

Creating user-defined combiners


A user-defined combiner acts like a JOIN
operator, except that it performs joins that are
more complex than those supported by the
standard U-SQL JOIN feature. A custom
combiner is especially useful if you need to
coalesce nontabular data, such as that held in an
array or document structure, into a tabular result.

A user-defined combiner extends the


Microsoft.Analytics.Interfaces.ICombiner abstract
base. This class defines two overloads for a
method named Combine:

public override IEnumerable<IRow>


Combine(IRowset left, IRowset right, IUpdatableRow output)

public override IEnumerable<IRow> Combine(List<IRowset> inputs,


IUpdatableRow output)

The first version provides references to the rowsets on the left and right sides of the join operation. You
iterate through these rowsets to combine rows in whatever manner you require, and then return each row
one at a time using the C# yield statement. Note that enumerating either rowset is a forward-only action
that you can perform only once, so it’s common practice to cache the data from these rowsets in a list or
similar collection before processing them.

The second version takes the left and right rowsets as a list, but otherwise its purpose is the same as the
first version. It’s more common to override the first version and leave the second with its default
implementation in the ICombiner base class.

As an example, consider the following sample data containing a list of employees in an organization, the
departments in which they work, and the roles that they have fulfilled over time in that department. The
Role History column is a comma separated list of varying length, depending on the roles that the
employee has had:

EmployeeID Name Department Role History

100043 Dgssbyfz 108 associate

100044 Xqqmjsfib 107 associate,employee

100045 Ixrtkcdj 105 associate

100046 Hlscwrcsld 106 associate,employee,teamleader

100047 Licxryj 102 associate,employee


MCT USE ONLY. STUDENT USE PROHIBITED
6-30 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

100048 Rghkzuhkjv 105


associate,employee,teamleader,manager,vicepresident

A separate file lists department IDs and names:

Department Name

100 Sales

101 Marketing

102 Accounts
103 Personnel

104 Engineering

105 Purchasing

106 Manufacturing

107 Design

108 Product Support

109 Customer Liaison

You join the Employee and Department data quite easily across the department ID column to combine
this data. However, suppose you want to generate a report that lists each employee and their roles in a
department in a tabular format? It might look like this:

EmpID EmpName DeptName Role


100043 Dgssbyfz Product Support associate

100044 Xqqmjsfib Design associate

100044 Xqqmjsfib Design employee


100045 Ixrtkcdj Purchasing associate

100046 Hlscwrcsld Manufacturing associate

100046 Hlscwrcsld Manufacturing employee

100046 Hlscwrcsld Manufacturing teamleader


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-31

This type of transformation is not easy to generate from the specified input formats using the built-in U-
SQL operators. However, this is the type of operation for which you can create a user-defined combiner.
The following code shows one possible solution:

Creating a user-defined combiner


public class FindDepartmentRoles : ICombiner
{
// Combine the data in both rowsets to produce a list of employees and the names of
the departments in which they have worked
// The left input data contains the employee ID, employee name, department ID, and
comma separated list of roles the employee has performed in that department
// The right input data contains the department name and department ID
// The output data should contain employee ID, employee name, department name, and
role
// A single employee will likely generate multiple rows of output
public override IEnumerable<IRow> Combine(IRowset left, IRowset right, IUpdatableRow
output)
{
// Read the right rowset containing department details into a local List
collection
// (you can only enumerate an IRowset collection once, and we need to perform
multiple iterations, so the data must be cached locally)
var rightRowset = (from row in right.Rows
select new
{
deptID = row.Get<int>("DepartmentID"),
deptName = row.Get<string>("DepartmentName")
}).ToList();

// Join the rows in each collection across the Department ID column


foreach (var leftRow in left.Rows)
{
// Find the name for the department
var department = (from deptInfo in rightRowset
where deptInfo.deptID == leftRow.Get<int>("DepartmentID")
select new { id = deptInfo.deptID, name = deptInfo.deptName
}).FirstOrDefault();

// Output the employee and department role details


if (department != null)
{
// Split the comma separated list of roles in the employee data into
individual values
string[] rolesForEmployee = leftRow.Get<string>("Roles").Split(',');

// Iterate through this list of roles for the employee and output each
one in turn
foreach (string role in rolesForEmployee)
{
output.Set<int>("EmpID", leftRow.Get<int>("EmployeeID"));
output.Set<string>("EmpName", leftRow.Get<string>("EmployeeName"));
output.Set<string>("DeptName", department.name);
output.Set<string>("Role", role);
yield return output.AsReadOnly();
}
}
}
}
}

The following U-SQL script shows how to call this combiner using the COMBINE statement. Note that the
ON clause still specifies how to join data in both rowsets, but the FindDepartmentRoles combiner
performs the actual join operation.
MCT USE ONLY. STUDENT USE PROHIBITED
6-32 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

Calling the FindDepartmentRoles user-defined combiner from U-SQL


REFERENCE ASSEMBLY CustomOperators;

// Retrieve the input data from the TSV files


@employeeWorkHistoryData =
EXTRACT
EmployeeID int,
EmployeeName string,
DepartmentID int,
Roles string
FROM "/EmployeeWorkHistory.tsv"
USING Extractors.Tsv(skipFirstNRows: 1, silent:true);

@departmentDetails =
EXTRACT
DepartmentID int,
DepartmentName string
FROM "/Departments.tsv"
USING Extractors.Tsv(skipFirstNRows: 1);

// Use the custom combiner to join the nonrelational employeeHistoryData with


departmentDetails
@empsRoles =
COMBINE @employeeWorkHistoryData AS E WITH @departmentDetails AS D
ON E.DepartmentID == D.DepartmentID
PRODUCE EmpID int,
EmpName string,
DeptName string,
Role string
USING new CustomOperators.FindDepartmentRoles();

// Save the results


OUTPUT @empsRoles
TO "/EmployeeRoles.csv"
USING Outputters.Csv(quoting: false, outputHeader: true);

Optimizing user-defined combiners


To maximize the potential for parallelism in a combiner, you provide information that enables the U-SQL
runtime to partition the data passed in to the left and right parameters. To do this, you apply the
SqlUserDefinedCombiner attribute with one of the following CombinerMode values:

 Full. This value indicates that the combiner performs a full join (Cartesian product), and that every
row in the left rowset will be joined with every row in the right rowset. This causes the entire left and
right rowsets to be passed to a single instance of the combiner, and results in the lowest degree of
parallelism.

 Inner. This value specifies that the combiner performs an inner join. Rows in the left rowset will only
be joined with the corresponding rows in the right rowset. Depending on the data sources, the U-SQL
runtime uses this information to partition the rowsets and arrange for each partition to be processed
in parallel.

 Left. This value indicates that the combiner implements a left outer join, and that all rows in the left
rowset will be utilized by every instance of the combiner, even if there are no corresponding rows in
the right rowset.
 Right. This value specifies that the combiner performs a right outer join, and that each instance of the
combiner should be passed for all of the data in the right rowset.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-33

However, you should only do this if the groups processed by the reducer are independent of each other.

For detailed information on creating and using combiners, see:

Use user-defined combiners


https://aka.ms/Mlm3ck

Creating user-defined reducers


A reducer operates on data when you read it in.
You use a reducer to filter data and generate
custom groupings, and then pass these reduced
rowsets on to U-SQL for further processing. You
use the REDUCE statement in a U-SQL script to
divide the data into groups—the reducer
processes the rows in each group to determine
which rows to include or discard.

A user-defined reducer extends the


Microsoft.Analytics.Interfaces.IReducer abstract
base that defines the Reduce method as follows:

public override IEnumerable<IRow>


Reduce(IRowset input, IUpdatableRow output)

The pattern is similar to that of combiners and appliers; the input variable contains the original rows, and
you filter (reduce) or transform these rows, write them to the output variable, and return each instance of
the output variable using the C# yield statement. There is, however, an important difference in the way in
which the data is presented by the U-SQL runtime to the reducer. To optimize the reduction process, the
U-SQL runtime will break the data down into groups, according to the values in a column or list of
columns that you specify at runtime. Each instance of the reducer receives the data for a single group.
This approach enables the U-SQL runtime to parallelize the process; each group can be handled by a
separate vertex. When you iterate through the data in the input rowset, you should therefore remember
that the rowset will contain the data for a single group only.

The following example shows a reducer that finds the number of employees for each role in each
department. It’s assumed that the data is grouped by department (you use the REDUCE statement, shown
later, to do this), so each input rowset will contain the data for a single department:

Creating a user-defined reducer


public class ReduceByRole : IReducer
{
// Reduce the input rowset to summarize the number of employees in each role in each
department
// The data in the input rowset contains the employee ID, employee name, department
ID, and comma separated list of roles that the employee has performed in that department

// For each department, return:


// DepartmentID int,
// NumberOfAssociates int,
// NumberOfEmployees int,
// NumberOfTeamLeaders int,
// NumberOfManagers int,
// NumberOfVicePresidents int,
// NumberOfPresidents int
MCT USE ONLY. STUDENT USE PROHIBITED
6-34 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

public override IEnumerable<IRow> Reduce(IRowset input, IUpdatableRow output)


{
Dictionary<string, int> roles = new Dictionary<string, int>();
roles["associate"] = 0;
roles["employee"] = 0;
roles["teamleader"] = 0;
roles["manager"] = 0;
roles["vicepresident"] = 0;
roles["president"] = 0;

var groupedRows = (from row in input.Rows


select new
{
dept = row.Get<int>("DepartmentID"),
roles = row.Get<string>("Roles")
}).ToList();

foreach (var row in groupedRows)


{
// Parse the comma separated list of roles into individual values
string[] rolesForEmployee = row.roles.Split(',');

// Iterate through this list of roles for the employee aggregate the role
counts
foreach (string role in rolesForEmployee)
{
roles[role]++;
}
}

// Output the aggregate results


output.Set<int>("DepartmentID", groupedRows.First().dept);
output.Set<int>("NumberOfAssociates", roles["associate"]);
output.Set<int>("NumberOfEmployees", roles["employee"]);
output.Set<int>("NumberOfTeamLeaders", roles["teamleader"]);
output.Set<int>("NumberOfManagers", roles["manager"]);
output.Set<int>("NumberOfVicePresidents", roles["vicepresident"]);
output.Set<int>("NumberOfPresidents", roles["president"]);
yield return output.AsReadOnly();
}
}

You run the reducer from a U-SQL script by using the REDUCE statement. The following code sorts
employee data by department, and then splits the data into groups based on the department values. Each
group is passed to an instance of the ReduceByRole reducer that returns the department ID and the
number of instances of each role in that department.

Calling the ReduceByRole reducer from U-SQL


REFERENCE ASSEMBLY CustomOperators;

// Retrieve the input data from the TSV file


@employeeWorkHistoryData =
EXTRACT
EmployeeID int,
EmployeeName string,
DepartmentID int,
Roles string
FROM "/EmployeeWorkHistory.tsv"
USING Extractors.Tsv(skipFirstNRows: 1, silent:true);

// Use the custom reducer to find the number of records for each role in each department
@rolesByDepartment =
REDUCE @employeeWorkHistoryData
PRESORT DepartmentID
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-35

ON DepartmentID
PRODUCE DepartmentID int,
NumberOfAssociates int,
NumberOfEmployees int,
NumberOfTeamLeaders int,
NumberOfManagers int,
NumberOfVicePresidents int,
NumberOfPresidents int
USING new CustomOperators.ReduceByRole();

// Save the results


OUTPUT @rolesByDepartment
TO "/DepartmentRoleSummary.csv"
USING Outputters.Csv(quoting: false, outputHeader: true);

The results look like this:

DepartmentID NumberOfAssociates NumberOfEmployees NumberOfTeamLeaders …

100 30092 15673 1136

101 30039 15568 1134


102 30101 15620 1172

103 30110 15662 1157

104 29966 15482 1128

105 29999 15536 1187

106 30025 15504 1170

107 29675 15342 1132


108 29888 15466 1112

109 30105 15514 1157

Optimizing user-defined reducers


As with user-defined aggregators, you inform the U-SQL runtime that it reads and processes data in
parallel by applying the SqlUserDefinedReducer(IsRecursive = true) attribute. However, you should only
do this if the groups processed by the reducer are independent of each other.

For more information, see:

Use user-defined reducers


https://aka.ms/olpxj9

Demonstration: Creating and using user-defined operators


In this demonstration, you will see how to create and run a:

 User-defined processor

 User-defined applier
 User-defined combiner

 User-defined reducer

Verify the correctness of the statement by placing a mark in the column to the right.
MCT USE ONLY. STUDENT USE PROHIBITED
6-36 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

Statement Answer

True or false? A job that uses many


vertices running in parallel is always
quicker than a job that performs the
same job but uses fewer vertices.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-37

Lesson 3
Managing jobs and protecting resources
ADLA jobs utilize resources that might be expensive, or access data that should remain confidential.
Therefore, you need to protect the data and resources available to an ADLA to prevent unauthorized use
and access. This lesson describes how to perform these tasks.

Note: Many of the concepts that apply for protecting ADLA jobs are similar to those of
ADLS. For example, you protect an ADLA account at the network level by setting up firewall rules,
and you use Role-Based Access Control (RBAC) to control how users interact with an account.

Lesson Objectives
After completing this lesson, you will be able to:

 Authenticate and authorize users for an ADLA account.


 Manage the resources that are available to ADLA jobs.

 Audit ADLA jobs.

 Monitor the performance of ADLA jobs.

Authenticating and authorizing users


Like ADLS, ADLA utilizes Azure Active Directory
(Azure AD) to authenticate users. You implement
end-user authentication and service-to-service
authentication if you are building applications
that integrate with ADLA. These options are
described in Module 4: Managing Big Data in
Azure Data Lake Store.
For more information, see:

Azure Data Lake authentication options


for .NET
https://aka.ms/Kriq0n

As with ADLS, you authorize the access to resources in ADLA at two levels—by using RBAC to control the
operations that users can perform, and by using Access Control Lists (ACLs) to specify which files and
catalogs they use.

You assign users to roles using the Access Control blade for the ADLA account in the Azure portal. The set
of roles, and the permissions that they enable, are the same as for other services in Azure (Owner,
Contributor, Reader, and so on). For more information about RBAC, see:

Get started with Role-Based Access Control in the Azure portal


https://aka.ms/X6u5vf
MCT USE ONLY. STUDENT USE PROHIBITED
6-38 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

ADLA uses a subset of the ACLs available to other services for controlling access to resources in catalogs.
Specifically, you use the Data Explorer for the ADLA account in the Azure portal to assign read/write
permissions at the catalog level; execute permission is not applicable to a catalog and is not an available
option. Note that these permissions apply across an entire catalog. You cannot currently grant or deny
access to individual objects in a catalog at the user level.

The ADLA Overview blade includes the Add User Wizard that you use to step through the tasks of adding
a user to the account, assigning roles, and setting file permissions in a systematic manner.

Note: ADLA is provided with a simple software-based firewall that enables you to restrict
access to users connecting from known locations only. The firewall is disabled by default, but you
enable it by using the setting in the ADLA blade in the portal. If you have other Azure services
that access the ADLA account, you should also enable access to Azure services through the
firewall (this is a separate switch in the same blade).

Managing resources for jobs


An ADLA account is associated with a pricing tier,
which specifies the number of AU hours available
per month. If you have multiple users submitting
jobs to an ADLA account, you need to ensure
that one user or job does not monopolize the
resources available because this could unfairly
limit other users and their jobs. You achieve this
by defining resource policies.

Note: If your account exceeds the number


of AU hours indicated by the pricing tier, your
jobs will continue to run, but your account will be
charged at the rate of 1.5 USD/AU.

ADLA supports two types of policies:


 Account level policies, which are applied to all jobs run by using an account.

 Job level policies, which applies to jobs run by specific users of the account.

You create a policy using the Properties section in the ADLA blade, in the Azure portal. An account level
policy enables you to specify:

 The maximum number of AUs that an account can use when submitting jobs. These AUs will be
shared by all users running jobs concurrently. For example, if you set the limit to 200, a single user
could execute a job that consumes 200 AUs, or 10 users could run jobs that consume 20 AUs each.

 The maximum number of concurrent jobs that an account can execute. When this limit is reached,
jobs are queued until other jobs have finished running.

 The number of days to save U-SQL job resources (such as scripts and job graphs) in the ADLS account
associated with the ADLA account.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-39

A job level policy is applied to specific users, or groups in an ADLA account. It determines:

 The maximum number of AUs that a single job can consume.

 The highest priority that a user specifies for the jobs.

For examples of how to create and apply resource policies, see:

Managing your Azure Data Lake Analytics compute resources (overview)


https://aka.ms/Yqgqo6

Auditing jobs
The data used by an ADLA job might be sensitive.
Additionally, the resources consumed by a job
can be expensive if the job is long-running.
Therefore, it’s vital that you maintain an audit
trail of jobs, so that you can trace back to the
sources of any security and resource issues.

You capture audit information by enabling the


diagnostic logs for the account. You can send
audit records to a storage account, stream them
to an event hub, and save them to Azure Log
Analytics.

When you have enabled auditing, you view audit


data using the Activity Log section of the ADLA blade in the Azure portal. You also download the data
directly from the storage account in which it is held, and analyze it locally.
For more information, see:

Accessing diagnostic logs for Azure Data Lake Analytics


https://aka.ms/Q12w9x

Monitoring jobs
The ADLA monitoring tools provide a low-level
insight into the details of jobs, both as they are
running and after they have completed. The
ADLA blade contains three utilities:

 Job Management. This utility displays the


status of current and previous jobs. If you
click a job, you will see the job graph. This is
similar to the corresponding feature in Visual
Studio, except that you cannot drill into the
details. You use the pane on the left of the
blade to view the inputs, outputs, and other
resources used by the job.

The Duplicate Script button opens an editor window with a copy of the U-SQL script for the job. You can
edit and save this script, and also resubmit it to run the job again.
MCT USE ONLY. STUDENT USE PROHIBITED
6-40 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

If a job has failed, you will see the error that caused the failure. You click the error message to obtain
additional information.
 Job Insights. This utility displays a graph showing the number of jobs that have been submitted, and
indicating which ones failed, succeeded, or were cancelled. Currently running jobs are also included.
You will also see whether jobs are being queued before execution. A long queue could be the result
of a series of resource-hogging jobs currently being executed, or might be an indication that the
ADLA polices being applied are too limiting.

If you switch the graph to display Compute Hours, you will see how many AU hours have been consumed
recently.

 Metrics. This utility gives quick access to a set of graphs showing the numbers of jobs or compute
hours for tasks that have succeeded, failed, or been cancelled.

For more information about monitoring jobs using the Azure portal, see:

Troubleshoot Azure Data Lake Analytics jobs using Azure portal

https://aka.ms/Hvh5lq

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? If you exceed the number


of AU hours permitted by the pricing tier
for an ADLA account in any given month,
you will not be able to run any more jobs
until the following month.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-41

Lab: Implementing custom operations and monitoring


performance in Azure Data Lake Analytics
Scenario
You work for Adatum as a data engineer, and you have been asked to build a traffic surveillance system
for traffic police. This system must be able to analyze significant amounts of dynamically streamed data,
captured from speed cameras and automatic number plate recognition (ANPR) devices, and then
crosscheck the outputs against large volumes of reference data holding vehicle, driver, and location
information. Fixed roadside cameras, hand-held cameras (held by traffic police), and mobile cameras (in
police patrol cars) are used to monitor traffic speeds and raise an alert if a vehicle is travelling too quickly
for the local speed limit. The cameras also have built-in ANPR software that reads vehicle registration
plates.

For this phase of the project, you are going to use ADLA to perform a range of analyses on data that has
been captured and saved into the traffic surveillance system by using tools such as Azure Stream
Analytics.

Objectives
After completing this lab, you will be able to:

Use a custom extractor, read JSON file data, and use a custom outputter to XML.

 Optimize a table in the ADLA catalog.

 Implement a custom processor in ADLA.

 Use existing analytics, developed in R, in an ADLA solution.

Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.

Estimated Time: 90 minutes

Virtual machine: 20776A-LON-DEV

Username: ADATUM\AdatumAdmin
Password: Pa55w.rd

Exercise 1: Use a custom extractor, read JSON file data, and use a custom
outputter to XML
Scenario
The Azure Stream Analytics job that you worked with in previous labs saves speed camera data to ADLS in
JSON format. You want to be able to read and analyze this data using ADLA. You also want to be able to
write a summary of the data to XML files for use by other tools.

In this exercise, you will create and use a custom extractor, to read JSON file data into ADLA, and a
custom formatter to output data to XML.
MCT USE ONLY. STUDENT USE PROHIBITED
6-42 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

The main tasks for this exercise are as follows:

1. Deploy and test a JSON extractor

2. Deploy and test an XML outputter

 Task 1: Deploy and test a JSON extractor

 Task 2: Deploy and test an XML outputter

Results: At the end of this exercise, you will have deployed and tested a custom extractor, and deployed
and tested a custom outputter.

Exercise 2: Optimize a table in the ADLA catalog


Scenario
You want to save speed camera data to the ADLA catalog, but need to ensure that the data is structured
to optimize the analytics that your ADLA jobs perform. The most common types of query that are
currently run are primarily focused on identifying patterns related to speed camera locations, such as
average speeds at each camera location. Therefore, you want to ensure that tables in the ADLA catalog
are optimized for such queries.

In this exercise, you will first create a table in the ADLA catalog, then use a U-SQL job to analyze data at a
specific camera location; finally, you will redistribute the data in the catalog to optimize query
performance.

The main tasks for this exercise are as follows:

1. Create the SpeedData table in the catalog

2. Create a U-SQL job that analyzes data for a specific camera

3. Redistribute the data to optimize data retrieval

 Task 1: Create the SpeedData table in the catalog

 Task 2: Create a U-SQL job that analyzes data for a specific camera

 Task 3: Redistribute the data to optimize data retrieval

Results: At the end of this exercise, you will have created a table in the ADLA catalog, analyzed data in
this table, and redistributed data in the catalog for optimal retrieval.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-43

Exercise 3: Implement a custom processor in ADLA


Scenario
You need to find the proportion of stolen vehicles passing each speed camera; this will then help you
identify the locations that are likely to be "criminal" hotspots, where incidences of vehicle theft are
particularly high.

You have stolen vehicle data covering eight years, organized in folders by year/month/day, with a total of
2,914 separate CSV files. The data contains the vehicle registration, date stolen, and date recovered (which
could be empty). The same vehicle could be reported stolen and recovered several times in these records.
Additionally, the police aren't always informed when a vehicle is recovered, so some records might have
an empty date recovered, even if the vehicle is no longer missing. This means that a vehicle could be
reported as stolen on several dates, but not recovered in the intervening period. The vehicle might be
recovered later. Therefore, to determine whether a vehicle should be considered as stolen, you need to
look at the most recent record in the history. If the vehicle has a recovery date, it is not missing, regardless
of what any previous records might infer. If no recovery date is shown on this record, the date stolen
should be used.

Because of these peculiarities in the data concerning stolen and recovery dates, it’s difficult to process this
data using regular U-SQL operators; you will, therefore, use a custom reducer that performs the necessary
magic.
In this exercise, you will first upload the dataset to ADLS. You will deploy and test the custom reducer, and
then perform some analyses to identify stolen vehicles in the speed camera data.

The main tasks for this exercise are as follows:

1. Preparation: upload stolen vehicle data to ADLS (using AzCopy and Adlcopy)
2. Examine and deploy a custom reducer

3. Test the custom reducer

4. Analyze the speed camera data to check for stolen vehicles

 Task 1: Preparation: upload stolen vehicle data to ADLS (using AzCopy and Adlcopy)

 Task 2: Examine and deploy a custom reducer

 Task 3: Test the custom reducer

 Task 4: Analyze the speed camera data to check for stolen vehicles

Results: At the end of this exercise, you will have uploaded speed camera data to ADLS, examined the
code in the custom reducer, deployed and tested the custom reducer, and then used the reducer to
attempt to identify stolen vehicles in the speed camera data.
MCT USE ONLY. STUDENT USE PROHIBITED
6-44 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics

Exercise 4: Use existing analytics, developed in R, in an ADLA solution


Scenario
You want to determine if there is any correlation between whether a vehicle is caught speeding and it
possibly being stolen. You are familiar with R, and want to use an analytics package provided by either of
these languages to help with this analysis.

In this exercise, you will call an R script from a U-SQL job to determine if there is any correlation between
a vehicle being identified as speeding and being identified as being stolen; you will then repeat this
analysis using a Python script in a U-SQL job.

The main tasks for this exercise are as follows:

1. Determining correlations by using R

 Task 1: Determining correlations by using R

Results: At the end of this exercise, you will have used an R script script in a U-SQL job.

Question: Why do you need to deploy your own custom extractor for JSON file data?

Question: Why is it important to optimize the table structure in your ADLA catalog?
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-45

Module Review and Takeaways


In this module, you learned how to:

 Incorporate custom features and assemblies into U-SQL.

 Optimize jobs to ensure efficient operations.


 Implement security to protect ADLA jobs and resources.
MCT USE ONLY. STUDENT USE PROHIBITED
 
MCT USE ONLY. STUDENT USE PROHIBITED
7-1

Module 7
Implementing Azure SQL Data Warehouse
Contents:
Module Overview 7-1 
Lesson 1: Introduction to SQL Data Warehouse 7-2 

Lesson 2: Designing tables for efficient queries 7-8 

Lesson 3: Importing data into SQL Data Warehouse 7-17 


Lab: Implementing SQL Data Warehouse 7-28 

Module Review and Takeaways 7-32 

Module Overview
This module describes how to utilize the power of Microsoft® Azure® SQL Data Warehouse to store and
analyze large volumes of data.

Objectives
By the end of this module, you will be able to:

 Describe the purpose and structure of Azure SQL Data Warehouse.


 Design tables to optimize queries performed by analytical processing.

 Use tools and techniques for importing data into SQL Data Warehouse at scale.
MCT USE ONLY. STUDENT USE PROHIBITED
7-2 Implementing Azure SQL Data Warehouse

Lesson 1
Introduction to SQL Data Warehouse
Historically, companies had to invest heavily in infrastructure up front to store and process massive data
volumes. By using SQL Data Warehouse, companies can now store large volumes of data and process
them efficiently, in terms of cost and effort, without having to maintain an expensive infrastructure.

SQL Data Warehouse is a massively parallel processing (MPP) cloud-based platform that uses distributed
database technology. SQL Data Warehouse is one of the many pay-as-you-go services within Azure that
you can use to store and process large volumes of data, and produce rich analytics and insights very
quickly.

Lesson Objectives
By the end of this lesson, you should be able to:

 Explain what a SQL Data Warehouse is and how it works.


 Describe typical workloads for a SQL Data Warehouse.

 Create a new SQL Data Warehouse.

 Access a SQL Data Warehouse using various tools.

Overview of SQL Data Warehouse

What is SQL Data Warehouse?


SQL Data Warehouse is a massively parallel
processing (MPP) cloud-based distributed
database platform that stores and processes
large volumes of data and provides high-speed
query performance. SQL Data Warehouse stores
the data across many shared-nothing storage
and processing units.

There are four key elements of SQL Data


Warehouse:
 Control node

 Compute nodes

 Storage

 Data Movement Service

Control node—the control node receives SQL queries from users and optimizes them. All applications
and connections communicate via the control node because it is the user-facing part of SQL Data
Warehouse. In fact, the control node is a SQL Server® Database, so connecting to it is exactly like
connecting to a SQL Server Database. The control node receives a SQL query and converts it into multiple
SQL queries that execute on multiple compute nodes. The control node then coordinates the data
movement and computation required to execute those parallel queries over the distributed data.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-3

Compute nodes—the compute nodes are the brains behind the SQL Data Warehouse. Compute nodes
receive the required data and the query from the control node, then execute the query in parallel with
other compute nodes and produce output that is sent back to the control node. The control node
aggregates all the results from various compute nodes and produces the results that are sent back to the
user. Each compute node is a SQL Server Database.

Storage—all data within SQL Data Warehouse is stored in Microsoft Azure Blob storage. Compute nodes
read and write data directly from Blob storage. In SQL Data Warehouse, storage and compute are
independent so they can be scaled up or down separately as per the user’s needs.

Data Movement Service (DMS)—this is an internal Windows® service that is not exposed to the end
user. DMS helps to move data between nodes so that queries execute in parallel.

For a detailed description of the components of SQL Data Warehouse, see:

What is SQL Data Warehouse?


https://aka.ms/Ol95vb

What is a Data Warehouse Unit (DWU)?


SQL Data Warehouse is highly scalable and it’s important to note that compute power can be scaled
independently of storage—it’s also important to understand how this translates into cost. The DWU is a
chargeable unit of measure that provides CPU, memory and IOPS to SQL Data Warehouses. As you
increase the number of DWUs to a SQL Data Warehouse, the query execution performance improves
linearly, but the cost also increases. For detailed information on how the number of DWUs available to a
data warehouse can affect performance, see:

Concurrency and workload management in SQL Data Warehouse


https://aka.ms/iiu18g

Typical workloads of SQL Data Warehouse


SQL Data Warehouse and Microsoft Azure SQL
Database have many similarities—they can store
relational data and use the same form of
Transact-SQL. However, there are also many
differences. For example, the maximum database
size in SQL Data Warehouse is unlimited (1 TB for
SQL Database), but the architecture is designed
to handle a smaller number of concurrent
queries, (32 compared to 6,400 for SQL
Database) each of which operates on massive
datasets. Therefore, it’s important to understand
which workloads are suitable for SQL Data
Warehouse and which ones are more appropriate for SQL Database.

SQL Data Warehouse is suitable for big data scenarios that involve working with large datasets:

 Long running queries that analyze a large dataset (for example, big data).

 Short running queries that you use for reporting and generating aggregations.

 Importing large volumes of data via batch processes.

 Performing analytics over large volumes of historical data that doesn’t change.
MCT USE ONLY. STUDENT USE PROHIBITED
7-4 Implementing Azure SQL Data Warehouse

SQL Database is typically preferred for OLTP-type scenarios, characterized as follows:

 Operational workloads that tend to have high frequency of reads and writes.

 Ingesting streaming datasets that incur a large number of small inserts.

 Datasets that require row-by-row processing needs.

 Large numbers of simple queries that return one row of data.

For further information, see:

Data warehouse workload


https://aka.ms/Kfzycu

Provisioning a SQL Data Warehouse


There are many ways to create a SQL Data
Warehouse—one of the easiest ways is to use the
Azure portal. You will need the following
information and an active Azure subscription to
create a SQL Data Warehouse:

Attribute Description

Database name A unique name for the SQL Data Warehouse within your database server.

Subscription Name of the active Azure subscription that will be billed.

Resource group Name of the resource group where the SQL Data Warehouse will reside—
this provides a way to group resources together in Azure so it’s easy to
locate them.

Source  Three options that could be used as a source are currently available:
 Blank database—creates a blank SQL Data Warehouse.
 Sample—creates a SQL Data Warehouse based on the
AdventureWorks database.
 Backup—creates a SQL Data Warehouse from an existing backup.

Server Name of the database server that contains the SQL Data Warehouse. You
could create a new server that hosts this or choose an existing database
server.

Location The location of the database server that hosts the SQL Data Warehouse.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-5

Attribute Description

Collation The collation you use when you create the SQL Data Warehouse. The
default collation is SQL_Latin1_General_CP1_CI_AS. This cannot be
changed after the database is created.

Performance level This is the compute power required for running the SQL Data Warehouse.
It’s good practice to start with 400 DWUs, and then scale up or down as
required. If needed, you can increase this to improve performance.

You could also create a SQL Data Warehouse using PowerShell™ commands. Remember the following key
information when you create SQL Data Warehouses using PowerShell:

 You require PowerShell version 1.0.3 or higher.

 You must always set the Edition parameter to “DataWarehouse” to create a SQL Data Warehouse.

 The CollationName parameter is optional. If not specified, the SQL Data Warehouse will use the
default collation (SQL_Latin1_General_CP1_CI_AS).
 The MaxSizeBytes parameter is also optional. If not specified, the database maximum size is set as 10
GB.

For example, the following PowerShell command will create a new SQL Data Warehouse, called DWDB1,
on a server called DWSERVER1, in a resource group called DWRG1, and with 400 DWUs:

New-AzureRmSqlDatabase -RequestedServiceObjectiveName "DW400" -DatabaseName "DWDB1" -


ServerName "DWSERVER1" -ResourceGroupName "DWRG1" -Edition "DataWarehouse" -
CollationName "SQL_Latin1_General_CP1_CI_AS" -MaxSizeBytes 10995116277760

Accessing SQL Data Warehouse


You can access a SQL Data Warehouse in several
ways:

 Azure portal

 SQL Server Management Studio (SSMS)

 Visual Studio

 PowerShell

To connect to a SQL Data Warehouse using any


of these methods, you need to have the
following information available:

 An existing SQL Data Warehouse.

 The fully qualified name of the server that hosts the SQL Data Warehouse (you get this information
from the Azure portal).

 The username and password to access the SQL Data Warehouse.


MCT USE ONLY. STUDENT USE PROHIBITED
7-6 Implementing Azure SQL Data Warehouse

Access SQL Data Warehouse using Azure portal


After you log in to the Azure portal, you use Query Editor (at the time of writing, this is a preview feature)
to connect and access the SQL Data Warehouse. The Azure portal is typically used for very specific, or
one-off, operations, such as reconfiguring performance levels. Other tools, such as PowerShell, are more
suitable for regular, repeatable, operations such as adding firewall rules or disaster recovery
configurations.

Access SQL Data Warehouse using SQL Server Management Studio


You need to install the SQL Server Management Studio (SSMS) tool on the PC from where you want to
access the SQL Data Warehouse. After you open the SSMS tool, you provide the following information to
connect to the SQL Data Warehouse:

 Server type: database engine.

 Server name: the fully qualified name of the SQL Data Warehouse server. This will be of the form
<server name>.database.windows.net.

 Authentication: either SQL Server Authentication or Active Directory Integrated Authentication. If


you select SQL Server Authentication, you require a username and password to connect.

Access SQL Data Warehouse using Visual Studio


You need Visual Studio 2013 or later (with the latest service packs) to access SQL Data Warehouse using
Visual Studio. You use SQL Server Object Explorer to connect to the SQL Data Warehouse and provide the
following information to connect to the SQL Data Warehouse:

 Server type: database engine.


 Server name: the fully qualified name of the SQL Data Warehouse (<server
name>.database.windows.net).

 Authentication: either SQL Server Authentication or Active Directory Integrated Authentication. If


you select SQL Server Authentication, you require a username and password to connect.

Access SQL Data Warehouse using PowerShell


To access SQL Data Warehouse, you need Azure PowerShell version 1.0.3 or higher. The main SQL Data
Warehouse cmdlets are part of the AzureRM.Sql module. The following cmdlets are available specifically
for managing SQL Data Warehouse from Powershell:

 Get-AzureRmSqlDatabase

 Get-AzureRmSqlDeletedDatabaseBackup

 Get-AzureRmSqlDatabaseRestorePoints

 New-AzureRmSqlDatabase

 Remove-AzureRmSqlDatabase

 Restore-AzureRmSqlDatabase

 Resume-AzureRmSqlDatabase

 Select-AzureRmSubscription

 Set-AzureRmSqlDatabase

 Suspend-AzureRmSqlDatabase
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-7

For more details about these cmdlets, see:

PowerShell cmdlets and REST APIs for SQL Data Warehouse


https://aka.ms/lt96q8

Question: Would you use SQL Data Warehouse for long running queries that analyze a large
dataset?

Demonstration: Creating and accessing a SQL Data Warehouse


In this demonstration, you will see how to:

 Provision a SQL Data Warehouse using the Azure portal.

 Connect to the SQL Data Warehouse using Visual Studio and SQL Server Management Studio.
 Access the SQL Data Warehouse using PowerShell.

Check Your Knowledge


Question

What is the default size of a SQL Data


Warehouse created by using PowerShell, if a size
is not specified?

Select the correct answer.

5 GB

10 GB

15 GB

20 GB

25 GB

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? You can scale compute


power independently of storage in a SQL
Data Warehouse.
MCT USE ONLY. STUDENT USE PROHIBITED
7-8 Implementing Azure SQL Data Warehouse

Lesson 2
Designing tables for efficient queries
SQL Data Warehouse is a very large scale distributed database system. Based on how the tables are
designed, you can improve the performance of executing queries. If you choose the appropriate data
distribution strategy, you will achieve optimal query performance. To ensure good query performance, it’s
also vital to minimize data movement.

Lesson Objectives
By the end of this lesson, you should be able to:

 Describe how SQL Data Warehouse distributes data across databases.

 Choose an appropriate data distribution strategy that minimizes data movement.

 Explain the difference between clustered and nonclustered indexes.

 Create heap tables.


 Create columnstore indexes.

 Understand how table partitioning works.

Distributing data in SQL Data Warehouse


SQL Data Warehouse is a massively parallel
processing (MPP) database system—data within
each SQL Data Warehouse is spread across 60
underlying databases. These 60 databases are
referred to as “distributions”. As the data is
distributed, there is a need to organize the data
in a way that makes querying faster and more
efficient.

The basic issue is that, although distributing data


helps to spread the load, it also leads to network
inefficiencies, increased I/O, and consequent
poor performance if common queries regularly
require the data warehouse to retrieve data from many databases. For example, if you regularly perform a
query that joins data across two tables, but the data to be joined is held in different databases, then the
data warehouse will have to fetch the data over the network. Therefore, you need to design your data
warehouse carefully to minimize the likelihood of requiring such “data movements”. This requires that you
fully understand how the data is structured, and how it’s likely to be used. Consider the following
scenarios:

 In the sales database for a global organization, the details of customers and the purchases that they
have made are most likely to be used in the same locality (for example, US East, US West, UK,
Australia, and so on). The system is unlikely to perform queries that require combining the details of
sales made in the US East territory with customers located in Australia with any frequency. Therefore,
it would make sense to organize the data according to this locality.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-9

 Queries that report on products sold by the organization might often need to fetch the data for items
that are used together. For example, customers might frequently purchase hammers and nails
together, so it could make sense to ensure that the sales records for hammers and nails are stored in
the same database.
 The corporation performs daily financial analyses of sales, calculating the revenue for the previous
day. In this case, it would make sense to ensure that the financial records for a given day are all held
together.

 If the corporation also performs analyses by week, month, quarter, or year (to help spot trends), the
historical summary information generated by the daily analysis—and for previous weeks, months, and
years—can be stored in lookup tables. These lookup tables are likely to be small compared to the rest
of the financial data. The information that the tables contain is static, so they could be copied to
every database in the data warehouse to reduce the costs of retrieving this data.

Selecting a distribution strategy


There are three ways to distribute data in SQL
Data Warehouse:

 Round robin
 Hashing

 Replication

The round robin method distributes the data


equally among all the 60 underlying
distributions. There is no specific key used to
distribute the data. This is the default method
used when no data distribution strategy is
specified.

The round robin method is good for the following scenarios:

 When you cannot identify a single key to distribute your data.

 If your data doesn’t frequently join with data from other tables.

 When there are no obvious keys to join.


 If the table is being used to hold temporary data.

The following example shows how to create a table using the round robin distribution method:

Using the round robin distribution method


CREATE TABLE dbo.CustomerPortfolio
(
CustomerID int NOT NULL,
Ticker char(4) NOT NULL,
VolumeBought int NOT NULL,
WhenBought datetime NOT NULL
)
WITH
(
DISTRIBUTION = ROUND_ROBIN
)
MCT USE ONLY. STUDENT USE PROHIBITED
7-10 Implementing Azure SQL Data Warehouse

Hashing is a very common and effective data distribution method. The data is distributed based on the
hash value of a single column that you select, according to some hashing algorithm. This distribution
column then dictates how the data is spread across the underlying distributions. Items that have the same
data value and data type for the hash key always end up in the same distribution.

The following example creates a table using the hash distribution method:

Using the hash distribution method


CREATE TABLE dbo.StockPriceMovement
(
Ticker char(4) NOT NULL,
Price int NOT NULL,
PriceChangedWhen datetime NOT NULL
)
WITH
(
DISTRIBUTION = HASH(Ticker)
)

Note the following key points you need to consider to identify the best distribution column:

 Choose a column that won’t be updated.


 Always choose a column that leads to even distribution (skew) of rows across all the databases.

 Choose an appropriate distribution column that reduces the data movement activity between
different distributions.

For more information about using the round robin and hash methods of distributing data, see:
Distributing tables in SQL Data Warehouse

https://aka.ms/S5fh5a

Replication is very good for small lookup tables or dimension tables that are frequently joined with other
big tables. Instead of distributing small sections of data across the underlying distributions for small
tables, the replication data strategy creates a copy of the entire table in each of the underlying
distributions.
For further details and considerations for replicating data in SQL Data Warehouse, see:

Design guidance for using replicated tables in Azure SQL Data Warehouse
https://aka.ms/Wkdg8p
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-11

You examine the system catalog views to see the distribution policy for each table in the data warehouse,
as follows:

Finding the distribution policy for tables in a data warehouse


SELECT T.name ,p.distribution_policy_desc
FROM sys.tables AS T
JOIN sys.pdw_table_distribution_properties AS P
ON T.object_id = P.object_id

Results
--------
Name Distribution_policy_desc
CustomerPortfolio ROUND_ROBIN
StockPriceMovement HASH
Stocks REPLICATE

Clustered and nonclustered indexes


The purpose of adding indexes to database
tables is to increase the speed of retrieving data
from these tables. An index contains key values
that are created from one or more columns from
a database table. These key values are then
stored in a binary tree structure that enables the
database to find relevant rows quickly.

There are two different types of indexes:

 Clustered indexes
 Nonclustered indexes

Clustered indexes essentially dictate the way data


rows are sorted and stored physically in that sorted order when inserted into a table. Clustered indexes are
good for quickly retrieving a range of rows based on the key values—because the table is already sorted,
based on those key values. One of the disadvantages of clustered indexes is that the insertion of data
might seem slow if the data has to be rearranged frequently to maintain the order of the data.
When you create a clustered index, you specify the column or set of columns on which a clustered index is
created. You specify that a table should have a clustered index when you first create it, or you add a
clustered index to a table later. The following code shows an example that adds a clustered index over a
column named EmployeeID in the Employee table. Note that there can only be one clustered index per
table:

Creating a clustered index


CREATE CLUSTERED INDEX Idx1
ON dbo.Employee
(
EmployeeID
)
GO
MCT USE ONLY. STUDENT USE PROHIBITED
7-12 Implementing Azure SQL Data Warehouse

By contrast, nonclustered indexes do not alter the way in which the data rows are stored in a table.
Nonclustered indexes are created as separate objects from the database table and have pointers back to
the data rows in the table. You create a nonclustered index by using the CREATE INDEX construct without
specifying any other keywords. You can have more than one nonclustered index in a table. Nonclustered
indexes are good for queries that filter on a column. If you create a nonclustered index on that filtered
column, you will improve query performance.

For more information about clustered and nonclustered indexes, see:

Clustered and nonclustered indexes


https://aka.ms/Cpul42

Creating heap tables


Loading large volumes of data into a table that
has clustered or nonclustered indexes is typically
slow because the loading process might need to
update the underlying index objects (based on
the index type). To improve performance, you
create the table as a heap table.

A heap table does not have any specific ordering;


it is simply a set of rows. Because the database
does not have to maintain the data in any
specific sequence, performing bulk loading of
data incurs little organizational overhead. When
the table is populated, you then create any
required indexes.

As a rule, if a table has less than 100 million rows, it’s advisable to create it as a heap table within SQL
Data Warehouse. To create a heap table, you specify the HEAP keyword in the WITH clause. The following
example shows the syntax you need to create a heap table:

Creating a heap table


CREATE TABLE heapTable
(
id int NOT NULL,
firstName varchar (50),
lastName varchar (50),
zipCode varchar (10)
)
WITH (HEAP);
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-13

If you have an existing table with indexes into which you need to load additional data, you drop the
indexes first, upload the data, and then rebuild the indexes. To do this, you use the DROP INDEX and
CREATE INDEX statements:

Dropping and rebuilding indexes


-- Remove the Idx1 index from the Employee table
DROP INDEX Idx1 ON dbo.Employee
GO

-- Populate the Employee table


-- Rebuild the Idx1 index


CREATE CLUSTERED INDEX Idx1
ON dbo.Employee
(
EmployeeID
)

Creating clustered columnstore indexes


Traditionally, many relational databases
physically store data as sets of rows, where all the
data for a row is located together. This
organization is suitable for queries that fetch a
small number of rows (typically via an index), or
operations that require access to most of the
columns in each row. This behavior is common
for many OLTP systems.

SQL Data Warehouse is geared towards


supporting large-scale analytics over big data
rather than OLTP workloads. The types of queries
performed in this environment characteristically
involve aggregations and other summary functions that retrieve data from a small set of columns—but
over a large number of rows. You use a columnstore index to organize and access the data by column.
Queries that fetch large volumes of data by column will consequently run more quickly.

A clustered columnstore index physically reorganizes a table. The data is divided into a series of
rowgroups of up to 1 million rows (approximately) that are compressed to improve I/O performance; the
greater the compression ratio, the more data is retrieved in a single I/O operation. Each rowgroup is then
divided into a set of column segments, one segment for each column. The contents of each column
segment are stored together. When querying data by column, the data warehouse simply needs to read
the column segments for that column. Decompression is performed quickly in memory, and the results
returned to the query. Note that, when you create a clustered columnstore index over a table, you don’t
specify which columns to index; the entire table is indexed.
MCT USE ONLY. STUDENT USE PROHIBITED
7-14 Implementing Azure SQL Data Warehouse

You create a clustered columnstore index by using CLUSTERED COLUMNSTORE INDEX in the WITH clause
of the CREATE TABLE command:

Creating a clustered columnstore index


CREATE TABLE clusteredColumnstoreTable
(
id int NOT NULL,
firstName varchar (50),
lastName varchar (50),
zipCode varchar (10)
)
WITH (CLUSTERED COLUMNSTORE INDEX);

When you create a table in SQL Data Warehouse, a clustered columnstore index is built by default if no
other index is specified. Clustered columnstore indexes are usually the preferred option when you don’t
know what kind of queries users are likely to run.

It’s important to note that a table created with a clustered columnstore index can also have other
nonclustered indexes defined on the table.

For additional information on how columnstore indexes work, see:

Columnstore indexes—overview
https://aka.ms/Csd3kb

Partitioning tables
You use partitioning to group the data in a table
into a series of chunks according to the value
held by a specified column. The data for each
chunk is stored together. The primary intent
behind partitioning in SQL Data Warehouse is to
improve the performance of bulk load
operations; if the data being uploaded conforms
to the partitioning scheme, SQL Data Warehouse
quickly determines where data should be stored.
Partitioning is also useful with queries that filter
data based on the partition column because data
in partitions that don’t match the filter is quickly
eliminated.

Partitioning works on top of the distribution mechanism implemented by a table—you apply partitioning
to round robin and hashed tables. Bear in mind that, before partitioning, the data is already spread across
60 distributions, and partitioning operates at the database (distribution) level. You should be mindful of
how many partitions you create and the size of each one—dividing the data into a large number of small
partitions might hinder performance rather than improve it. The recommendation is to avoid having
fewer than one million rows per partition per distribution.

Generally, you partition tables based on a date column. For example, you might partition sales data by
month. When you query data for a specific month or a few days of data within a month, the database
retrieves that specific month partition instead of performing a full table scan. Similarly, when you want to
delete a month’s sales data because it is too old, it’s easy to delete a single partition instead of deleting
row by row.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-15

Note that the partitioning scheme used by SQL Data Warehouse is simpler than that implemented by SQL
Database—you specify the partition column and the ranges for each partition. You can’t create your own
partition functions.

The following example creates a table with data distributed by using a hash function, organized as a
columnstore, and partitioned by month. This is a common approach for defining large fact tables with
many millions of rows that are time-oriented:

Creating a partitioned table


CREATE TABLE dbo.StockPriceMovement
(
Ticker char(4) NOT NULL,
Price int NOT NULL,
PriceChangedWhen datetime NOT NULL,
PriceChangedMonth int NOT NULL
)
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = HASH(Ticker),
PARTITION (PriceChangedMonth RANGE FOR VALUES (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,
12))
)
GO

For more information about partitioning tables in SQL Data Warehouse, see:

Partitioning tables in SQL Data Warehouse


https://aka.ms/Mduiw6

Demonstration: Creating tables and indexes in a SQL Data Warehouse


In this demonstration, you will see how to:

 Create tables with various distributions.


 Create nonindexed heap tables.

 Create partitioned tables.

 Query the system catalog to examine the structure and indexes of tables.

Question: When you are not sure about which queries are run against a table in SQL Data
Warehouse, what type of index will you use?
MCT USE ONLY. STUDENT USE PROHIBITED
7-16 Implementing Azure SQL Data Warehouse

Check Your Knowledge


Question

How many distributions are created for a SQL


Data Warehouse?

Select the correct answer.

20

40

60

80

100

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? You can’t create a


nonclustered index for a table defined as
a clustered columnstore.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-17

Lesson 3
Importing data into SQL Data Warehouse
When you create a SQL Data Warehouse, the next logical step is to import data into it. This lesson explains
the various ways in which you can import data into SQL Data Warehouse.

Lesson Objectives
By the end of this lesson, you should be able to:

 Import data into a SQL Data Warehouse from a SQL Server database using the bulk copy program
(bcp).

 Import data into a SQL Data Warehouse from a SQL Server database using the AzCopy utility.
 Import data into a SQL Data Warehouse from a SQL Server database using SQL Server Integration
Services (SSIS).

 Import data into a SQL Data Warehouse from various sources using PolyBase.
 Import data into a SQL Data Warehouse from Blob storage.

 Import data into a SQL Data Warehouse from an Azure Data Lake Store.

 Import data into a SQL Data Warehouse from Azure Stream Analytics.

 Describe best practices for loading data into a SQL Data Warehouse.

Importing data from SQL Server with bcp


You use the bcp command-line utility for
importing data from binary and text files into a
SQL Server database—you specify the format of
the data to be read. You can also use bcp to
transfer data out of a database. Additionally, SQL
Data Warehouse supports bcp as a data transfer
mechanism. Note that bcp is good for loading
small datasets but is not the preferred way for
loading large volumes of data. For more
complete documentation on bcp, see:

bcp utility
https://aka.ms/Px5ae0

Use the following two-stage process to import data from SQL Server to SQL Data Warehouse using bcp:

1. Export data from SQL Server to a flat file using bcp.

2. Import data from the flat file to SQL Data Warehouse using bcp.
MCT USE ONLY. STUDENT USE PROHIBITED
7-18 Implementing Azure SQL Data Warehouse

The following code example illustrates the syntax for exporting data from SQL Server to flat file
(DimProduct.txt) using bcp:

Export data from a SQL Server database to a text file


bcp DimProduct out C:\Output\DimProduct_export.txt -S <Server Name> -d <Database Name> -U
Username> -P <password> -q -c -t ','

To import the data from the text file to SQL Data Warehouse, you use a command similar to that shown
here:

Importing data from a text file into SQL Data Warehouse


bcp DimProduct in C:\Input\DimProduct.txt -S <Server Name> -d <Database Name> -U
<Username> -P <password> -q -c -t ','

Note that, while loading into a columnstore index table, you can execute multiple concurrent bulk loads
from separate flat files using bcp.

Importing data from SQL Server with AzCopy


If the amount of data is less than 10 TB, consider
using AzCopy—it’s highly optimized for
transferring local data files into Azure Blob
storage.
You first need to use bcp to retrieve the data
from the SQL Server database and move local
files as before—but then use AzCopy to transfer
the files into Blob storage.

The following code shows the syntax of the


AzCopy command:

Syntax of the AzCopy command


AzCopy /Source:<source> /Dest:<destination> [Options]

The source and destination for AzCopy can be a local folder, or the name of a container in Blob storage (if
you are uploading files to Blob storage, the container will be created if it doesn’t already exist). The
options available enable you to specify the storage keys to use, in addition to filename-matching patterns,
data delimiters, and other attributes. The following example uploads CSV files from a local folder named
StockData to a container called stocknames in Blob storage:

Uploading a file to Blob storage using AzCopy


AzCopy /Source:StockData /Dest:https://stockdata.blob.core.windows.net/stocknames
/DestKey:<storage key> /Pattern:”*.csv”

For more information about the AzCopy utility, see:

Transfer data with AzCopy on Windows


https://aka.ms/Bihjux
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-19

When the file is in Blob storage, you can use PolyBase to transfer the data from this file into the SQL Data
Warehouse. The process is described later in this lesson.

Importing data from SQL Server using the import/export service


To deal with massive volumes of data, Microsoft
provides a separate import/export service. You
can use this service to physically ship hard drives
to Azure datacenters for the data to be uploaded
into virtual drives or tables. You typically use this
service when network bandwidth is not sufficient
for faster data movement. This service is suitable
for the following scenarios:

 Migrating a large amount of data to Azure,


usually as part of a cloud migration strategy.

 Backing up data from on-premises to Azure.

 Data recovery from Azure to on-premises.


You require an existing Azure subscription with at least one associated storage account. To use the
import/export service, you need to create an import job or an export job. Each of these jobs can be
associated with only one storage account. It means you can transfer data to one storage account for one
job or you transfer data from one storage account to the hard drives. Each job can only handle up to 10
physical hard disk drives. You need to use the WAImportExport tool to prepare each disk. The tool helps
you to copy data into the hard drives, encrypts the data with BitLocker and generates the drive journal
files. The journal files store basic information about the drive, in addition to the storage account
information. The WAImportExport tool is only compatible with 64-bit Windows Operating Systems.

Use the following steps to create an import job—import data from on-premises that is to be transferred
to the Azure datacenter:
1. Identify the data you need and the hard disk drives that are required.

2. Identify the destination Blob storage location within your Azure account.
3. Use the WAImportExport tool to transfer your data to one or more hard disk drives and encrypt them
using BitLocker.

4. Create an import job using Azure portal within your destination Blob storage account.

5. Upload the drive journal files using Azure portal.

6. Provide the carrier details and return address for Microsoft to ship the drives back to you.

7. Ship the hard disk drives to the shipping address provided during the import job creation.

8. Update the tracking information in the import job and submit the job.

Hard disk drives are received at the shipped datacenter and processed as per the instructions. When the
data transfer is successful, the hard disk drives are shipped back to the customer.
Use the following steps to create an export job—to export data from Azure that is to be transferred back
to the customer:

1. Identify the data that needs to be exported and the number of hard disk drives that are required.

2. Identify the source Blob storage location within your Azure account.
MCT USE ONLY. STUDENT USE PROHIBITED
7-20 Implementing Azure SQL Data Warehouse

3. Create an export job using Azure portal within your source Blob storage account.

4. Specify the source blobs or container that need to be exported.

5. Provide the carrier details and return address for Microsoft to ship the drives back to you.

6. Ship the empty hard disk drives to the shipping address provided during the export job creation.

7. Update the tracking information in the export job and submit the job.

Hard disk drives that are received at the shipped datacenter are processed as per the instructions. The
drives are encrypted using BitLocker and the keys are uploaded to the Azure portal. When the data
transfer is successful, the hard disk drives are shipped back to the customer.

For more information about the import/export service, see:

Use the Microsoft Azure import/export service to transfer data to Blob storage
https://aka.ms/yeyh37

Importing data from Blob storage using PolyBase and CTAS


You use PolyBase to join relational data with
nonrelational data held in Blob storage or
Hadoop using T-SQL constructs. For example,
you use PolyBase to execute T-SQL queries
against data held in Blob storage, and then copy
this data into SQL Server or SQL Data Warehouse.
The following steps describe the process:

1. Use AzCopy, or perform some other process,


to upload the data to Blob storage.

2. Connect to the SQL Data Warehouse using


SQL Server Management Studio (SSMS) and
create a master key to protect the private
keys present in the database, as follows:

CREATE MASTER KEY ENCRYPTION BY PASSWORD = 'Your Password';

3. Create a scoped database credential using the following syntax to access the external data source—in
this case, Blob storage:

CREATE DATABASE SCOPED CREDENTIAL CredentialsToBlobStorage


WITH IDENTITY = 'Storage Account Name',
SECRET = 'Storage Account Key';

4. Create an external data source in SQL Data Warehouse to point to the Blob storage. Note that
PolyBase uses the Hadoop API with a wasbs URL to access Blob storage:

CREATE EXTERNAL DATA SOURCE AzureDataSource1


WITH (
TYPE = HADOOP,
LOCATION =
'wasbs://<blob_container_name>@<azure_storage_account_name>.blob.core.windows.net',
CREDENTIAL = CredentialsToBlobStorage
);
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-21

5. Define the external data file format within Blob storage. The following example shows how to define
a comma separated data format:

CREATE EXTERNAL FILE FORMAT CommaSeparatedFileFormat


WITH (
FORMAT_TYPE = DelimitedText,
FORMAT_OPTIONS (FIELD_TERMINATOR =',')
);

6. Create the external table that references Blob storage. The following example assumes that the data
in Blob storage has four columns—id, firstName, lastName and zipCode. In this example, FileLocation
is the name of the CSV file held in the container in the Blob storage account:

CREATE EXTERNAL TABLE ExternalTable1 (


id int NOT NULL,
firstName varchar (50),
lastName varchar (50),
zipCode varchar (10))
WITH (
LOCATION='FileLocation',
DATA_SOURCE = AzureDataSource1,
FILE_FORMAT = CommaSeparatedFileFormat,
REJECT_TYPE = VALUE,
REJECT_VALUE = 0
);

It’s important to note that PolyBase is strongly typed and any records that do not honor the schema
format will be rejected. You control the number or percentage of rows rejected by adjusting
REJECT_TYPE and REJECT_VALUE parameters before the load fails.
You now access the data held in Blob storage using the external table created in the SQL Data Warehouse.
At this point, the data is not physically copied from Blob storage to the SQL Data Warehouse but it is
accessible to the data warehouse through T-SQL SELECT commands.

Note: External tables defined using PolyBase do not support INSERT, UPDATE, or DELETE
operations.

The next logical thing to do is to use this external table to load the data into a table within the data
warehouse. There are two ways to do this in T-SQL:

 Load the data using CREATE TABLE AS SELECT (CTAS).

 Use the SELECT…INTO construct to copy the data.

CTAS is more configurable than SELECT…INTO. The SELECT…INTO command uses the default
ROUND_ROBIN distribution method with a clustered columnstore index; you cannot change the
distribution method or index. With CTAS, you specify the distribution method to use and define the
indexes to create.
MCT USE ONLY. STUDENT USE PROHIBITED
7-22 Implementing Azure SQL Data Warehouse

The following example uses CTAS to create a ROUND_ROBIN distributed table that has a clustered
columnstore index from the external table pointing to the Blob storage:

Creating a table using CTAS


CREATE TABLE dbo.DimCustomer
WITH
(
CLUSTERED COLUMNSTORE INDEX,
DISTRIBUTION = ROUND_ROBIN
)
AS SELECT * FROM dbo. ExternalTable1;

It’s important to note that, even though you use PolyBase to execute T-SQL queries over large volumes of
data (big data) in a nonrelational data store, it’s recommended to use PolyBase with CTAS to import the
data into the SQL Data Warehouse for running analytical queries.

For more information on using PolyBase, see:

PolyBase Guide
https://aka.ms/V7vd02

Importing data from SQL Server using SSIS


SQL Server Integration Services (SSIS), part of the
SQL Server product family, is used for building
data integration packages to read, process,
transform, cleanse, and write data from multiple
data sources, including flat files and databases.
A typical SSIS project consists of packages—each
package contains the steps for reading,
transforming, and writing data. SSIS is a visual
development tool that mostly uses drag-and-
drop interfaces to build packages.
To load data into SQL Data Warehouse, you
could use one of the following in SSIS:

 An ADO NET destination that references the data warehouse.

 An OLE DB destination that references the data warehouse.


 An Azure SQL DW Upload Task that copies files containing data exported from SQL Server to Blob
storage, and then uses PolyBase to transfer this data into SQL Data Warehouse; it automates the steps
described in the previous topic.
The last method requires that you export the data from SQL Server first. You can also do this in SSIS, by
using an ADO.NET source and a flat file destination. This process is highly efficient and gives the best
performance of the three preceding options. Additionally, SSIS is very effective if the data needs to be
transformed or cleansed before uploading into the SQL Data Warehouse.

For more information about using SSIS with SQL Data Warehouse, see:

Load data from SQL Server into SQL Data Warehouse (SSIS)
https://aka.ms/Lyyhuk
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-23

The Azure SQL DW Upload Task is part of the Azure Feature Pack for SSIS. For more information about
the SSIS tasks that this pack contains, see:
Azure Feature Pack for Integration Services (SSIS)
https://aka.ms/Nck4p0

Importing data from Azure Data Lake Store


Azure Data Lake Store (ADLS) provides unlimited
storage for analytical purposes for many types of
data. To import data from Data Lake Store into
the SQL Data Warehouse, you use PolyBase
together with CTAS.

The following steps explain the process to load


data from Data Lake Store to SQL Data
Warehouse. In many cases the steps appear to be
the same as those shown previously, but there
are some differences, mainly due to the different
types of credentials used by ADLS.

1. Connect to the SQL Data Warehouse using


SSMS, and use the following command to create a master key to protect the private keys present in
the database:

CREATE MASTER KEY ENCRYPTION BY PASSWORD='Your Password';

2. Create a database scoped credential for accessing the ADLS account. Note that the credential is a
combination of your client ID and an OAuth 2.0 token endpoint. You can obtain these items by
creating an Azure Active Directory application using the Azure portal:

CREATE DATABASE SCOPED CREDENTIAL CredentialsToADLS


WITH IDENTITY = 'Client ID@OAuthTokenEndpoint',
SECRET = '<Key>';

3. Create an external data source in SQL Data Warehouse to point to the Data Lake Store. As before,
note that PolyBase uses the Hadoop API to access ADLS, but this time using a Data Lake URL:

CREATE EXTERNAL DATA SOURCE AzureDataSource2


WITH (
TYPE = HADOOP,
LOCATION = 'adl://<AzureDataLakeAccountName>.azuredatalakestore.net',
CREDENTIAL = CredentialsToADLS
);
MCT USE ONLY. STUDENT USE PROHIBITED
7-24 Implementing Azure SQL Data Warehouse

4. Define the external data file format within the Data Lake Store. The following example specifies that
data fields are separated by using the '|' character. The DATE_FORMAT parameter indicates how
folders in the ADLS account are organized:

CREATE EXTERNAL FILE FORMAT PlainTextFileFormat


WITH (
FORMAT_TYPE = DelimitedText,
FORMAT_OPTIONS (
FIELD_TERMINATOR = '|',
STRING_DELIMITER = '',
DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss.fff',
USE_TYPE_DEFAULT = FALSE
));

5. Create the external table to access the Data Lake Store. The following code assumes that the Data
Lake Store has four columns in the CSV file—id, firstName, lastName and zipCode. PolyBase uses
recursive folder traversal to read all the files within the folder, in addition to the subfolders specified
in the LOCATION parameter:

CREATE EXTERNAL TABLE ExternalTable2 (


id int NOT NULL,
firstName varchar (50),
lastName varchar (50),
zipCode varchar (10) )
WITH (
LOCATION='/DimCustomer/',
DATA_SOURCE = AzureDataSource2,
FILE_FORMAT = PlainTextFileFormat
);

6. Use the external table to access the data within the Data Lake Store from the SQL Data Warehouse.
7. Copy the data from the Data Lake Store to SQL Data Warehouse via the external table using a CTAS
command. The following example creates a table that is hash distributed across the data warehouse:

CREATE TABLE dbo.DimCustomer


WITH
(
DISTRIBUTION = HASH(id)
)
AS SELECT * FROM dbo. ExternalTable2;

For further information, see:

Load data from Data Lake Store into SQL Data Warehouse
https://aka.ms/Ltxp1c
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-25

Importing streaming data from Azure Stream Analytics


To connect and import data from streaming data
sources, you will need Azure Stream Analytics.
Stream Analytics is one of the many Azure
services that provide complex event processing
functionality that is highly available and scalable.
You can use Stream Analytics to consume
streaming data and easily output it to a SQL Data
Warehouse. To do this, use the following steps:

1. Create a new Stream Analytics job within


your Azure subscription.

2. Specify the input for the job (for example,


this could be a streaming application writing
to Blob storage).

3. Specify the output for the job to the SQL Data Warehouse you have created. You need the following
details to configure this:
o Output alias—the name to identify the output.

o Subscription—the name of the Azure subscription.

o Database—the name of the SQL Data Warehouse.


o Server—the name of the server that hosts the SQL Data Warehouse.

o Username—the name of the database user.

o Password—the password of the database user.


o Table—the name of the table where you write the data.

4. Create a streaming query that redirects the data from input stream to the output.
5. Run the job, and start the streaming source that provides the input. The data will appear in the data
warehouse as it is processed by the Stream Analytics job.

For more information, see:


Use Azure Stream Analytics with SQL Data Warehouse

https://aka.ms/D2oudf
MCT USE ONLY. STUDENT USE PROHIBITED
7-26 Implementing Azure SQL Data Warehouse

Best practices for loading data into SQL Data Warehouse


When loading data, you should adopt the
following best practices:

 Drop any noncolumnstore indexes that have


been created over a table in the data
warehouse, import the data, and then
recreate the indexes.
 Use PolyBase and CTAS where applicable; for
example, when loading data from Blob
storage, Hadoop, or ADLS.
 Be careful not to have too many partitions
within a destination table in the data
warehouse because this will affect your query performance and loading times.
 Always hash distribute any huge tables based on an appropriate column (typically a date column).

 Always use an appropriate column length.

 Always use temporary heap tables for transient data.

 For better query performance, always reindex your tables after loading the data.

For more information, see:

Best practices for Azure SQL Data Warehouse


https://aka.ms/Y5x60s

Demonstration: Importing data into a SQL Data Warehouse


In this demonstration, you will see how to import data into SQL Data Warehouse using:

 AzCopy and PolyBase to upload data via Blob storage.


 PolyBase to read data from ADLS.

Question: Can you execute multiple concurrent bulk loads from separate files into a
columnstore index table using bcp?
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-27

Check Your Knowledge


Question

What is the maximum number of physical hard


disk drives that can be associated with an
import/export service job?

Select the correct answer.

10

20

30

40

50

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? SQL Server Integration


Services (SSIS) is not part of the SQL
Server platform.
MCT USE ONLY. STUDENT USE PROHIBITED
7-28 Implementing Azure SQL Data Warehouse

Lab: Implementing SQL Data Warehouse


Scenario
You work for Adatum as a data engineer, and you have been asked to build a traffic surveillance system
for traffic police. This system must be able to analyze significant amounts of dynamically streamed data—
captured from speed cameras and automatic number plate recognition (ANPR) devices—and then
crosscheck the outputs against large volumes of reference data holding vehicle, driver, and location
information. Fixed roadside cameras, hand-held cameras (held by traffic police), and mobile cameras (in
police patrol cars) are used to monitor traffic speeds and raise an alert if a vehicle is travelling too quickly
for the local speed limit. The cameras also have built-in ANPR software that read vehicle registration
plates.

In this phase of the project, you will consolidate the data storage for the traffic surveillance system by
using SQL Data Warehouse as a single data location for static, or rarely updated, information including
stolen vehicle data, vehicle owner data, and speed camera location data. You will also configure the traffic
surveillance system to use the same SQL Data Warehouse to hold dynamic data streamed live from the
speed cameras.

Objectives
After completing this lab, you will be able to:

 Create and configure a new SQL Data Warehouse.

 Design and configure SQL Data Warehouse tables.

 Import static data into SQL Data Warehouse.

 Stream dynamic data to SQL Data Warehouse.

Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.

Lab Setup
Estimated Time: 90 minutes

Virtual machine: 20776A-LON-DEV


Username: ADATUM\AdatumAdmin

Password: Pa55w.rd

This lab uses the following resources from Lab 5 and earlier:
 Resource group: CamerasRG

 Data Lake Store: adls<your name><date>

 Azure Stream Analytics job: CaptureTrafficData

 Event Hub: camerafeeds<your name><date>


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-29

Exercise 1: Create and configure a new SQL Data Warehouse


Scenario
You are going to consolidate the data storage for the traffic surveillance system by using SQL Data
Warehouse as a single data location for both static and dynamic data. In this exercise, you will create a
new data warehouse for holding traffic data—this warehouse will run on a new database server. You will
then use SQL Server Management Studio to explore the data warehouse, and use scaling to set to
performance level.
The main tasks for this exercise are as follows:

1. Install AzCopy and AdlCopy

2. Create a new database server

3. Create a new SQL Data Warehouse

4. Explore the SQL Data Warehouse using SQL Server Management Studio

5. Use scaling with SQL Data Warehouse

 Task 1: Install AzCopy and AdlCopy

 Task 2: Create a new database server

 Task 3: Create a new SQL Data Warehouse

 Task 4: Explore the SQL Data Warehouse using SQL Server Management Studio

 Task 5: Use scaling with SQL Data Warehouse

Results: At the end of this exercise, you will have created a new database server, created a new data
warehouse, explored the data warehouse using SQL Server Management Studio, and used scaling with
data warehouse.

Exercise 2: Design and configure SQL Data Warehouse tables


Scenario
You’re going to consolidate the data storage for the traffic surveillance system by using SQL Data
Warehouse as a single data location for static and dynamic data. In this exercise, you will design and
create the tables and indexes that are required to support the traffic monitoring system.

The main tasks for this exercise are as follows:

1. Design tables and indexes for a SQL Data Warehouse application

2. Use SQL Server Management Server to create data warehouse tables and indexes

 Task 1: Design tables and indexes for a SQL Data Warehouse application

 Task 2: Use SQL Server Management Server to create data warehouse tables and
indexes
MCT USE ONLY. STUDENT USE PROHIBITED
7-30 Implementing Azure SQL Data Warehouse

Results: At the end of this exercise, you will have designed tables and indexes for a data warehouse
application, and used SQL Server Management Server to create the required data warehouse tables and
indexes.

Exercise 3: Import static data into SQL Data Warehouse


Scenario
In this exercise, you will consolidate the data storage for the traffic surveillance system by using SQL Data
Warehouse as a single data location for static, or rarely updated, information including stolen vehicle
data, vehicle owner data, and speed camera location data. You will also import static data into the
warehouse. There are three data sources for this exercise:

 Stolen vehicle data in Data Lake Store.

 Vehicle owner data in an on-premises SQL Server database.

 Speed camera location data in a local CSV file.

You will first upload the stolen vehicle data to Data Lake Store (using AzCopy and Adlcopy), as a
temporary staging location. You will then create a local SQL Server database for holding vehicle owner
data, again as a staging location. You will then upload speed camera location data directly into the data
warehouse from a local CSV file using AzCopy, PolyBase, and CTAS (dropping the existing table first, and
using CTAS with the same options that were used to create the table in the first place). You will then
import the staged stolen vehicle data from Data Lake Store into SQL Data Warehouse—leaving the
existing table in place—and then use INSERT INTO to append data to the table. Finally, you will import
the staged vehicle/owner data from the on-premises SQL Server database by using an ADO.NET source
and destination with SQL Server Integration Services—again leaving the existing table in place.

The main tasks for this exercise are as follows:

1. Stage data in Data Lake Store prior to SQL Data Warehouse import

2. Stage data in an on-premises SQL Server database prior to SQL Data Warehouse import

3. Import data from a local CSV file into SQL Data Warehouse

4. Import data from Data Lake Store into SQL Data Warehouse
5. Import data from an on-premises SQL Server database into SQL Data Warehouse

 Task 1: Stage data in Data Lake Store prior to SQL Data Warehouse import

 Task 2: Stage data in an on-premises SQL Server database prior to SQL Data
Warehouse import

 Task 3: Import data from a local CSV file into SQL Data Warehouse

 Task 4: Import data from Data Lake Store into SQL Data Warehouse

 Task 5: Import data from an on-premises SQL Server database into SQL Data
Warehouse
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-31

Results: At the end of this exercise, you will have:

Staged data in Data Lake Store prior to SQL Data Warehouse import.

Staged data in an on-premises SQL Server database prior to SQL Data Warehouse import.
Imported data from a local CSV file directly into SQL Data Warehouse.

Imported data from Data Lake Store into SQL Data Warehouse.

Imported data from an on-premises SQL Server database into SQL Data Warehouse.

Exercise 4: Stream dynamic data to SQL Data Warehouse


Scenario
In this exercise, you will consolidate the data storage for the traffic surveillance system by using SQL Data
Warehouse as a single data location for dynamic data streamed live from the speed cameras. You will
stream dynamic data from speed cameras into SQL Data Warehouse from a Stream Analytics job, leaving
the existing table in place. You will then configure a Visual Studio app to use this Stream Analytics job,
and then view the Stream Analytics job data in SQL Data Warehouse.
The main tasks for this exercise are as follows:

1. Configure an Azure Stream Analytics job to output to SQL Data Warehouse

2. Configure a Visual Studio app to use the Stream Analytics job

3. View Stream Analytics job data in SQL Data Warehouse

4. Lab cleanup

 Task 1: Configure an Azure Stream Analytics job to output to SQL Data Warehouse

 Task 2: Configure a Visual Studio app to use the Stream Analytics job

 Task 3: View Stream Analytics job data in SQL Data Warehouse

 Task 4: Lab cleanup

Results: At the end of this exercise, you will have configured a Stream Analytics job to output to SQL Data
Warehouse, configured a Visual Studio app to use the Stream Analytics job, and viewed Stream Analytics
job data in SQL Data Warehouse.

Question: In the table design, why is it recommended that the vehicle speed data is hashed
by camera ID?

Question: What are two of the most important management options for SQL Data
Warehouse that help control your Azure costs?
MCT USE ONLY. STUDENT USE PROHIBITED
7-32 Implementing Azure SQL Data Warehouse

Module Review and Takeaways


In this module, you have learned about:

 The purpose and structure of Azure SQL Data Warehouse.

 Designing tables to optimize queries performed by analytical processing.


 The tools and techniques you can use for importing data into SQL Data Warehouse at scale.
MCT USE ONLY. STUDENT USE PROHIBITED
8-1

Module 8
Performing Analytics with Azure SQL Data Warehouse
Contents:
Module Overview 8-1 
Lesson 1: Querying data in SQL Data Warehouse 8-2 

Lesson 2: Maintaining performance 8-10 

Lesson 3: Protecting data in SQL Data Warehouse 8-19 


Lab: Performing analytics with SQL Data Warehouse 8-24 

Module Review and Takeaways 8-28 

Module Overview
This module describes how to query data that is stored in Microsoft® Azure® SQL Data Warehouse and
how to secure this data. It also describes the various ways in which you monitor the SQL Data Warehouse
to maintain good performance.

Objectives
By the end of this module, you will be able to:
 Describe the features of Transact-SQL that are available for use with SQL Data Warehouse.

 Configure and monitor the SQL Data Warehouse to maintain optimal performance.

 Describe how to protect data and manage security in a SQL Data Warehouse.
MCT USE ONLY. STUDENT USE PROHIBITED
8-2 Performing Analytics with Azure SQL Data Warehouse

Lesson 1
Querying data in SQL Data Warehouse
SQL Data Warehouse uses a subset of the Transact-SQL (T-SQL) language to perform database operations.
This lesson summarizes the common features of T-SQL that are and are not available for use with SQL
Data Warehouse. This lesson also explains how SQL Data Warehouse works with machine learning, and
how you generate Power BI reports using the information stored in SQL Data Warehouse.

Lesson Objectives
By the end of this lesson, you should be able to:

 List the features and limitations of T-SQL within SQL Data Warehouse.

 Group data and calculate aggregate values across groups.

 List the features of views that are not available within SQL Data Warehouse.

 Explain the transactional limitations of SQL Data Warehouse.


 Describe the primary T-SQL programmatic constructs available for SQL Data Warehouse.

 Integrate SQL Data Warehouse with Microsoft Azure Machine Learning.

 Use Power BI™ to generate reports from SQL Data Warehouse.

Using T-SQL in SQL Data Warehouse


SQL Data Warehouse is built upon the Microsoft
Parallel Data Warehouse (PDW)—a lot of
functionality is borrowed from PDW and SQL
Server® to create SQL Data Warehouse.
Querying data in SQL Data Warehouse is similar
to SQL Server query syntax. A large majority of
the T-SQL statements and constructs used with
SQL Server are also available to SQL Data
Warehouse. For example, you perform SELECT
statements to retrieve data, including joins and
subqueries. Additionally, you can use T-SQL
programming constructs such as WHILE loops,
variables (with some restrictions—you cannot use a SELECT statement to assign a value to a variable),
dynamic SQL, and stored procedures. However, there are some limitations. SQL Data Warehouse does not
support:

 Triggers (DDL, DML, or logon)

 Primary and foreign keys

 Unique and Check table constraints

 Unique indexes

 Computed columns in tables

 Sparse columns

 User-defined types

 Sequences
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-3

 Synonyms

T-SQL language elements


SQL Data Warehouse supports the following language elements:

 Data types including bigint, numeric, bit, smallint, decimal, int, float, date, char, and varchar. SQL Data
Warehouse does not currently support blob types, including varchar(max) and nvarchar(max).

 Control flow elements such as BEGIN...END, BREAK, IF…ELSE, THROW, TRY…CATCH, and WHILE.

 Operators such as Add (+), Subtraction (-), Multiply (*), and Divide (/).

 Wildcard characters.
For the full list of SQL Data Warehouse language elements, see:

Language elements
https://aka.ms/xuijkv

For more information on supported T-SQL constructs in SQL Data Warehouse, see:

Transact-SQL topics
https://aka.ms/fumw8l

For more information on catalog views, and dynamic management views (DMVs), see:
System views
https://aka.ms/ifb8qm

Grouping and aggregating data in SQL Data Warehouse


Grouping and aggregating data is an important
functionality of a data warehouse. The GROUP BY
construct is used to aggregate data and to
provide summarized results. However, unlike SQL
Server, SQL Data Warehouse doesn’t provide full
capabilities; SQL Data Warehouse doesn’t
support GROUP BY constructs that have:

 ROLLUP

 GROUPING SETS

 CUBE

However, this functionality can be achieved by


other means—for example, you might use multiple SQL statements merged with UNION ALL.

To achieve ROLLUP functionality across Country and Region, you might use the following three SQL
statements:

 A SQL statement that produces the aggregated data at Country and Region.

 A SQL statement that produces the aggregated data at Country.

 A SQL statement that produces the total aggregation.


MCT USE ONLY. STUDENT USE PROHIBITED
8-4 Performing Analytics with Azure SQL Data Warehouse

For more information on implementing queries that emulate these GROUP BY operations, see:

GROUP BY options in SQL Data Warehouse


https://aka.ms/uj1orl

Using views in SQL Data Warehouse


Views are useful in databases, and especially in
data warehouses. Typically, views are used to
abstract the technical know-how for joining
between different tables and providing a
consistent interface to end users. To improve
performance, you can enforce a join hint in the
view.

When you consider views in SQL Data


Warehouse, be aware of the following:

 You can’t have a schema binding option with


views.

 You can’t create updateable views—that is, underlying base tables can’t be updated through views.

 You can’t create views using temporary tables.

 You can’t use EXPAND and NOEXPAND hints.


 Indexed views are not possible in SQL Data Warehouse.

Performing transactions
SQL Data Warehouse supports transactions for
data workloads but is limited when compared to
SQL Server. SQL Data Warehouse uses ACID
(Atomicity, Consistency, Isolation and Durability)
transactions—the isolation level is limited to
READ UNCOMMITTED and can’t be modified.

SQL Data Warehouse uses the XACT_STATE()


function to report any failed transactions using
the value of -2—this means that the transaction
failed and is marked for rollback.

SQL Data Warehouse doesn’t support the


ERROR_LINE() function—the alternative way to
implement it is by using query labels.

For more information on query labels, see:

Use labels to facilitate queries in SQL Data Warehouse


https://aka.ms/loq1la
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-5

THROW and RAISEERROR are supported in SQL Data Warehouse but you should note the following:

 User-defined error message numbers using THROW can’t be in the range between 100,000 and
150,000.

 RAISEERROR messages are fixed at 50,000.

 Sys.messages is not supported in SQL Data Warehouse.

Programmatic constructs

Using variables
You create variables in SQL Data Warehouse by
using a DECLARE or SET statement—you should
note the following points:

 You use DECLARE to set more than one


variable in a single statement.
 You can’t initialize a variable and use the
same variable in the same DECLARE
statement.
 You can set one variable at a time using the
SET statement.

 You can’t use SELECT or UPDATE when assigning values to variables.

Using IF...ELSE
You use the IF…ELSE construct in the same way as you would in SQL Server—as a conditional construct,
with the IF keyword followed by a Boolean expression. Based on the outcome of the Boolean expression,
the IF clause executes when the condition is true; the ELSE clause executes when the condition is false.
The IF…ELSE construct can be nested until there are no memory issues. The syntax for the IF…ELSE clause
is as follows:

IF Boolean_expression
{ sql statement | statement block }
[ ELSE
{ sql statement | statement block } ]

Using the WHILE loop


Using the WHILE loop construct is the same as in SQL Server—you use it for a repeated execution of a
SQL statement or block, providing the Boolean expression returns TRUE. You control the execution by
using BREAK or CONTINUE keywords within the SQL block under the WHILE loop. The syntax for the
WHILE loop is as follows:

WHILE boolean_expression
{ sql statement | statement block | BREAK }

Dynamic SQL
Dynamic SQL is essential to make your code more generic, readable and flexible. SQL Data Warehouse no
longer supports blob data types, so it doesn’t have varchar(max) and nvarchar(max) data types. Therefore,
to write a SQL statement that uses many strings, you need to break the code into multiple statements
then merge and use the EXEC command to execute the combined SQL statement.
MCT USE ONLY. STUDENT USE PROHIBITED
8-6 Performing Analytics with Azure SQL Data Warehouse

The following example breaks the SQL into three statements:

DECLARE @sql_statement1 VARCHAR(8000)=' SELECT name '


, @sql_ statement 2 VARCHAR(8000)=' FROM sys.system_views '
, @sql_ statement 3 VARCHAR(8000)=' WHERE name like ''%table%''';
EXEC( @sql_statement1 + @sql_satement2 + @sql_statement3);

Stored procedures
Stored procedures provide a great way to modularize many lines of code, helping developers to organize
the code into multiple logical chunks. Stored procedures can also take parameters as input and so
become more flexible. Stored procedures promote reusability of code to save time in having to develop
the same code again and again.

Stored procedures in SQL Data Warehouse behave like stored procedures in SQL Server. However, there
are subtle differences between them. They are as follows:
 Stored procedures in SQL Data Warehouse are not precompiled (unlike SQL Server, where all stored
procedures are precompiled).

 SQL Data Warehouse supports nesting up to eight levels whereas SQL Server supports up to 32 levels.
 SQL Data Warehouse doesn’t support @@NESTLEVEL, so you must keep a manual count of the
nested levels.

It’s important to be aware of certain features in SQL Server that are not currently implemented in SQL
Data Warehouse. They are as follows:

 Table-valued, read-only or default parameters.

 Temporary, numbered, CLR or extended stored procedures.

 Encryption or replication options.

 Execution contexts.

 Return statement.

Using SQL Data Warehouse with Machine Learning


You use Machine Learning to generate analytical
and predictive models. You incorporate data for
use by these models from a variety of data
sources, including SQL Data Warehouse. To
retrieve data from SQL Data Warehouse, add an
Import Data module to your Machine Learning
experiment and set the properties of this module
as follows:

 Data source type: this is a drop-down—you


need to select Azure SQL Database.

 Database server: this is the name of the


database server where the SQL Data
Warehouse is hosted. It takes the form <servername>.database.windows.net.
 Database name: this is the name of the SQL Data Warehouse.

 Username and password: these are the credentials you need to connect to the SQL Data
Warehouse.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-7

 Database query: this is the query that should be run to fetch the data. This can be any valid T-SQL
SELECT statement supported by SQL Data Warehouse.

You also use SQL Data Warehouse as the repository for results from a machine learning model. To do this,
add an Export Data module to your experiment. Connect this module to the Results dataset output of a
module that generates a dataset. Set the following properties of the Writer module:

 Data destination: select Azure SQL Database.

 Database server name: specify the name of the server hosting your SQL Data Warehouse
(<servername>.database.windows.net.).

 Database name: the name of the SQL Data Warehouse.

 Username and password: the credentials for connecting to the SQL Data Warehouse.
 Comma separated list of columns to be saved: specify the fields in the results dataset that you wish
to write to the data warehouse.

 Data table name: the name of the table in which to save the data in the data warehouse. You should
create this table first.
 Comma separated list of datatable columns: the columns to populate in the table (the number
and type of columns must match those in the list of columns to be saved).
 Number of rows written per SQL Azure operation: the batch size of write operations to the data
warehouse.

Note: If you try to import data into a table that does not already exist, the Export Data
module will attempt to create it for you. However, the Export Data module will try to implement
a Unique constraint that is not supported by SQL Data Warehouse, and the export operation will
fail.

Using Power BI to generate reports from SQL Data Warehouse


Power BI is a powerful visualization tool that
enables you to visualize data stored within SQL
Data Warehouse databases. Power BI is available
as a free desktop tool, in addition to the Azure
service known as Power BI Service.
There are three ways to connect to SQL Data
Warehouse from Power BI:

Using the Power BI button in the Azure


portal under Azure SQL Data
Warehouse
In the Azure portal, under Azure SQL Data
Warehouse, there’s a link for Power BI that directly connects the Power BI Service and logs on to the SQL
Data Warehouse. This straightforward method directly populates the connection information. When the
Power BI Service connects to SQL Data Warehouse, you continue to analyze your dataset in SQL Data
Warehouse.
MCT USE ONLY. STUDENT USE PROHIBITED
8-8 Performing Analytics with Azure SQL Data Warehouse

Using this technique, Power BI uses DirectQuery to communicate with the SQL Data Warehouse. In Power
BI, for every operation performed, Power BI sends a query to SQL Data Warehouse in real time, utilizing
the power of SQL Data Warehouse. The data is aggregated in SQL Data Warehouse and sent back to
Power BI to display it to the user.

Using Get Data in Power BI Service


You can also connect to SQL Data Warehouse from Power BI Service. After you log in to the Power BI
Service portal, you connect to SQL Data Warehouse using the Get Data functionality—specifically by
clicking Get Data, clicking Databases, and then clicking Azure SQL Data Warehouse.

To connect to the SQL Data Warehouse, you need to provide the following details:

 Server name: this is the name of the SQL Data Warehouse server (fully qualified name).

 Database name: this is the name of the SQL Data Warehouse.

When Power BI Service is connected and linked to the SQL Data Warehouse, you select the tables and
columns you require to create the visualizations that form part of reports or dashboards. All of the queries
sent back to the SQL Data Warehouse are DirectQuery based, utilizing the full potential of SQL Data
Warehouse.

Using Microsoft Power BI Desktop


Microsoft Power BI Desktop is a free tool you use to perform rich analytics on different types of data
sources, including consuming data from SQL Data Warehouse. The steps to connect to SQL Data
Warehouse from Power BI Desktop are as follows:
1. After opening the Power BI Desktop tool, click Get Data from the ribbon menu on the top. You find
the SQL Data Warehouse by clicking Get Data, clicking Databases, and then clicking Azure SQL
Data Warehouse.

2. You need to provide the following information to connect to SQL Data Warehouse:
o Server: this is the name of the server where the SQL Data Warehouse is hosted.

o Database: this is the name of the SQL Data Warehouse database.

o Data connectivity mode: this is either Import or DirectQuery. Choose Import if the data
required is a small subset, because you are limited by the processing and storage capabilities of
the PC where you are downloading the data. It’s advisable to choose DirectQuery to utilize the
full computing power of SQL Data Warehouse.

Click OK.

3. Provide user authentication details like Username and Password to connect to SQL Data Warehouse.

4. When the connection is successful, you can perform data analytics on the data stored within SQL Data
Warehouse.

Demonstration: Integrating with Power BI and Machine Learning


In this demonstration, you will see how to:

 Utilize data held in SQL Data Warehouse in a Machine Learning predictive model.

 Visualize data held in SQL Data Warehouse using Power BI.


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-9

Check Your Knowledge


Question

Which one of the following commands is


supported in SQL Data Warehouse?

Select the correct answer.

ROLLUP

GROUPING SETS

CUBE

GROUP BY

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? SQL Data Warehouse


supports updateable views.
MCT USE ONLY. STUDENT USE PROHIBITED
8-10 Performing Analytics with Azure SQL Data Warehouse

Lesson 2
Maintaining performance
This lesson describes how to configure and monitor the performance of a SQL Data Warehouse. This
lesson also gives details of best practices for maintaining performance.

Lesson Objectives
By the end of this lesson, you should be able to:

 Understand how to update database statistics, and rebuild and optimize indexes.

 Describe how to manage workloads to maintain performance.

 Scale compute nodes.


 Manage distributed data.

 Rebuild and optimize indexes.

 Describe the various monitoring tools that are available in the Azure portal.

 Explain best practices for maintaining performance.

Updating statistics
Database statistics play a vital role in optimizing
user queries. It’s important that table statistics are
kept up to date because, generally, the optimizer
produces the most optimal plans under these
conditions. You collect statistics on a single
column or a set of columns or indexes. It’s
advisable to collect or execute statistics after a
data load, or at least once each day. If the
statistics on a table are old, the explain plan
produced by the optimizer for a given query
might not perform in an optimal way.

It’s important to note that, unlike SQL Server


where statistics are automatically generated when data loads happen, this isn’t the case in SQL Data
Warehouse. Statistics need to be collected manually. This can be a scheduled job on a nightly basis or it
might be run towards the end of a data load operation. Be careful not to create too many statistics on a
table because this will need to be maintained as an overhead. Typically, date fields, columns used in JOIN
conditions, columns used within GROUP BY statements, and columns used in the WHERE clause, are good
candidates where collecting statistics can help improve query performance.

The following explains the various Transact-SQL (T-SQL) statements you use to collect statistics, update
statistics, and so on:

 Use the following syntax to create statistics on a single column of a table using default options. By
default, SQL Data Warehouse uses a 20 percent sample size when creating these statistics:

CREATE STATISTICS [name of statistics] ON [schema name].[table name]([column name]);


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-11

 If you want to specify the sample size instead of using the default 20 percent, use the following
syntax:

CREATE STATISTICS [name of statistics] ON [schema name].[table name]([column name])


WITH SAMPLE=50 PERCENT;

 To update specific statistics on a table, use the following syntax:

UPDATE STATISTICS [schema name].[table name]([statistics name]);

 To update all statistics objects on a table, use the following syntax:

UPDATE STATISTICS [schema name].[table name];

For more information on database statistics in SQL Data Warehouse, see:

Managing statistics on tables in SQL Data Warehouse


https://aka.ms/oo60vt

Managing workloads
SQL Data Warehouse is designed to deliver
predictable performance when you scale up or
down. SQL Data Warehouse also has features
that control concurrency and resource allocation
(CPU and memory).
A SQL Data Warehouse supports up to 1,024
connections at the same time. These 1,024
connections submit queries concurrently but
queries might be queued based on key
parameters like concurrent queries and
concurrency slots.

Concurrent queries: the way to control


performance of a SQL Data Warehouse is to allocate Data Warehouse Units (DWUs) that are a
combination of CPU and memory. Based on the allocated DWUs for a given SQL Data Warehouse, that
determines how many concurrent queries execute simultaneously. At the time of writing, a maximum of
32 queries can be executed for the larger DWUs that are allocated.

Concurrency slots: for every 100 DWUs, four concurrency slots are allocated—for example, 1,000 DWUs
mean 40 concurrency slots. Each query requires a set number of concurrency slots based on the resource
class of the query. Queries executed in the smaller resource class only require one concurrency slot,
whereas queries executed in the higher resource class require more concurrency slots. There are four
resource classes:

 smallrc

 mediumrc

 largerc
 xlargerc
MCT USE ONLY. STUDENT USE PROHIBITED
8-12 Performing Analytics with Azure SQL Data Warehouse

The following table summarizes the number of concurrent queries that are available, in addition to the
number of concurrency slots allocated for a given number of DWUs:

Allocate Maximum Concurrenc


Concurrenc Concurrenc Concurrenc Concurrenc
d number of y slots used
y slots y slots used y slots used y slots used
number concurren by
allocated by smallrc by largerc by xlargerc
of DWUs t queries mediumrc

DW100 4 4 1 1 2 4

DW200 8 8 1 2 4 8

DW300 12 12 1 2 4 8

DW400 16 16 1 4 8 16

DW500 20 20 1 4 8 16

DW600 24 24 1 4 8 16

DW1000 32 40 1 8 16 32

DW1200 32 48 1 8 16 32

DW1500 32 60 1 8 16 32

DW2000 32 80 1 16 32 64

DW3000 32 120 1 16 32 64

DW6000 32 240 1 32 64 128

Based on what kind of queries are running, you can run a lot more smallrc queries, because they consume
a lower number of concurrency slots when compared to largerc or xlargerc queries for the same amount
of DWUs allocated.

Memory allocation: the resource class of a query dictates the amount of memory allocated for a query.
Because memory is a fixed resource, it’s important to understand how a SQL Data Warehouse provisions
memory based on the resource class. The following table summarizes the amount of memory allocated in
MB, based on the resource class:

DWU smallrc mediumrc largerc xlargerc

DW100 100 100 200 400

DW200 100 200 400 800

DW300 100 200 400 800

DW400 100 400 800 1600

DW500 100 400 800 1600

DW600 100 400 800 1600

DW1000 100 800 1600 3200


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-13

DWU smallrc mediumrc largerc xlargerc

DW1200 100 800 1600 3200

DW1500 100 800 1600 3200

DW2000 100 1600 3200 6400

DW3000 100 1600 3200 6400

DW6000 100 3200 6400 12800

The preceding table shows details of every distribution within the SQL Data Warehouse. There are 60
distributions for a SQL Data Warehouse instance. Therefore, each instance of allocated memory shown in
the table must be multiplied by 60 to understand the overall memory that is allocated for the SQL Data
Warehouse.
It’s important to note that not all queries are constrained by concurrency limits. Queries that access only
the metadata—like queries that access dynamic management views or catalog views—wouldn’t fall under
the concurrency limits.

The following queries adhere to the concurrency limits:


 SELECT

 INSERT-SELECT, UPDATE or DELETE

 ALTER INDEX REBUILD or REORGANIZE

 CREATE INDEX or ALTER TABLE REBUILD

 CREATE CLUSTERED COLUMNSTORE INDEX

 CREATE TABLE AS SELECT (CTAS)


 Data loads and data movement processes by the Data Movement Service (DMS)

The following queries do not adhere to the concurrency limits:

 CREATE or DROP or TRUNCATE table

 ALTER TABLE or ALTER INDEX DISABLE

 DROP INDEX

 CREATE, UPDATE, or DROP STATISTICS

 CREATE LOGIN or ALTER AUTHORIZATION

 CREATE, ALTER or DROP USER

 CREATE, ALTER or DROP PROCEDURE

 CREATE or DROP VIEW

 SELECT from system views and DMVs

 EXPLAIN and DBCC


MCT USE ONLY. STUDENT USE PROHIBITED
8-14 Performing Analytics with Azure SQL Data Warehouse

Scaling compute nodes


The performance of a SQL Data Warehouse is
determined by the number of DWUs allocated.
The larger the number of DWUs, the better the
performance. There’s a linear relationship
between performance and the number of
allocated DWUs. The design of a SQL Data
Warehouse consists of one control node, many
compute nodes (based on the number of DWUs
allocated) and a set of 60 distributions for
storage. The following table explains how many
compute nodes are allocated and the number of
distributions per node for the selected DWU size:

DWU Number of compute nodes Number of distributions per node

100 1 60

200 2 30

300 3 20

400 4 15

500 5 12

600 6 10

1000 10 6

1200 12 5

1500 15 4

2000 20 3

3000 30 2

6000 60 1

It’s important to note that the total number of distributions remains at 60, irrespective of the DWU size.

To manage compute, there are three main functions:


Pause compute: it’s important to save costs when you are not using SQL Data Warehouse—at weekends,
for example. Pausing compute saves on cost by releasing the allocated memory and CPU resources.
Pausing compute terminates all incoming queries and makes sure the database is rolled back to a
consistent state. You pause compute using Azure portal, PowerShell™ or REST APIs.

Resume compute: when the SQL Data Warehouse is required, you resume compute. Resuming compute
allocates the required CPU and memory based on the allocated DWUs for that instance. It also makes the
data available for end-user queries. You resume compute using Azure portal, PowerShell or REST APIs.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-15

Scale compute: you scale up or down, according to user requirements. To improve performance and
execute queries faster, you scale up the compute power to have more CPU and memory resources
allocated to your instance. Similarly, when you do not require the capacity, you scale down the compute.
It’s vital to remember that either scaling up or down will terminate all incoming queries until the system is
reconfigured accordingly. You scale compute by using Azure portal, PowerShell, REST APIs or T-SQL.

Rebuilding and optimizing indexes


The most commonly used index in SQL Data
Warehouse is the clustered columnstore index—
this is the default type when there is nothing
specified during table creation. Clustered
columnstore is effective for running queries and
utilizes efficient storage. Clustered columnstore
tables organize data into segments and segment
quality is important for executing queries
optimally. Segment quality is good when each
row group has at least 100,000 segments per row
group. To identify suboptimally performing
segments within a SQL Data Warehouse, the view
dbo.vColumnstoreDensity identifies tables that have row groups that contain less than 100,000 records.

For details of the dbo.vColumnstoreDensity view, see:

Optimizing clustered columnstore indexes


https://aka.ms/ganu0d

Rebuilding indexes provides an efficient way to improve performance by optimizing rowgroups. You
should ensure that you rebuild indexes with a user who has a large enough resource group. The
recommended minimum sizes, based on DWUs, are as follows:

 xlargerc: the recommended minimum for DW300 or less.

 largerc: the recommended minimum for DW400 to DW600.

 mediumrc: the recommended minimum for DW1000 and above.

To rebuild an entire clustered columnstore index, you execute the following statement:

ALTER INDEX ALL ON [dbo].[TableName] REBUILD

To rebuild a single partition, you execute the following statement:

ALTER INDEX ALL ON [dbo].[TableName] REBUILD Partition = 1


MCT USE ONLY. STUDENT USE PROHIBITED
8-16 Performing Analytics with Azure SQL Data Warehouse

Monitoring tools
SQL Data Warehouse comes packed with many
tools and techniques that you use to monitor
running queries, queued queries, and so on.
Dynamic management views (DMVs) provide
much of this functionality with database views
that encapsulate the logic of joining different
logging and auditing tables.

Monitoring connections: every login to the SQL


Data Warehouse is logged in
sys.dm_pdw_exec_sessions—this contains the
previous 10,000 sessions. Session_id is unique for
each session created. The following query
retrieves all of the previous 10,000 sessions apart from the one that is currently running this query:

SELECT * FROM sys.dm_pdw_exec_sessions t where t.status <> 'Closed' and t.session_id


<> session_id();

Monitoring executing queries: every query to the SQL Data Warehouse is logged in
sys.dm_pdw_exec_requests—this contains the previous 10,000 queries that have been executed.
Request_id is unique for each query executed. The following query retrieves all the previous 10,000
queries, apart from the one that is currently running this query:

SELECT * FROM sys.dm_pdw_exec_requests t WHERE t.status not in ('Completed' ,


'Failed' , 'Cancelled') AND t.session_id <> session_id() ORDER BY t.submit_time
DESCENDING;

Monitoring queued queries: if a query is waiting for a resource or resources, it will be in the state
“acquire resources”. You will find this information in the DMV sys.dm_pdw_waits.

Monitoring data movement and tempdb usage: queries that join or compare data held in different
nodes will need to move that data around as part of the join or comparison operations. This is expensive.
Additionally, these queries are likely to construct temporary tables on each node to hold the data
retrieved from other nodes. If you are performing a large number of queries, this might cause contention
in tempdb on each node. You query the dm_pdw_nodes_db_session_space_usage,
sys.dm_pdw_nodes_exec_sessions, sys.dm_pdw_nodes_exec_connections DMVs to monitor how queries
are being performed, or use the monitoring feature in the Azure portal that presents this data in a more
convenient manner.

Using labels in SQL Data Warehouse: when you have many queries running on the SQL Data
Warehouse, it is tiresome to search which query belongs to which project or ETL routine. SQL Data
Warehouse provides a concept named “query labels”—you label each query that you execute with an
optional string OPTION (LABEL = 'Project 1: ETL Routine 1: Step 1').

This makes it much easier to find where a particular long-running query belongs and to pinpoint the
problem location.

Using the Azure portal: use the data warehouse blade in the Azure portal to monitor and analyze the
performance of operations. Click Monitoring, and you will see the volume of recent activity. An important
part of this blade is the DWU Usage chart, which enables you to ascertain whether you are close to the
resource limits currently allocated and indicates that you should consider scaling by adding more DWUs.
You tailor the graph by clicking it to see additional information. You also configure alerts to inform an
operator if resources are close to their limit; the operator then decides whether or not to scale the data
warehouse.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-17

The Monitoring blade also shows the workload imposed in the system by query activity. If you click this
graph, you see the individual queries that have been performed recently. You drill down into the details to
see the query execution plans and use this information to assess whether to rewrite queries, add or
remove indexes, or completely restructure a table by changing its distribution policy.

Best practices for maintaining performance


SQL Data Warehouse needs to be well
maintained to give optimal performance for
executing user queries and providing good
return on investment (ROI). It’s always good to
use the following best practices to maintain SQL
Data Warehouse:

 Whenever you do not use SQL Data


Warehouse for a long time—for example, at
weekends or overnight—it’s advisable to
pause the warehouse to reduce running
costs, and resume the next time you require
it.

 Before you use the pause or scale functionality, it’s important to make sure that there are no major
transactions in progress within the data warehouse, because this might affect the time it takes to
pause or scale. Always complete your transactions before starting the pause or scale procedures.

 Always maintain up-to-date database statistics with the entities of SQL Data Warehouse. An older
version of the statistics will affect performance and queries will take longer to execute. Always update
the statistics after a big data load.

 Always use bulk insert instead of individual insert statements because this will improve load
performance.
 You use SQL Data Warehouse to load data using methods like bcp, Azure Data Factory or PolyBase—
it’s best to use PolyBase for very large datasets.

 By default, tables are distributed by using the round robin method. However, you use a hash
distribution over a selected column to ensure that your data is organized by that column instead. This
arrangement helps to make join operations over that column with other tables faster.

 Never overuse the partitions. Remember that there are 60 distributions for each SQL Data Warehouse
instance—creating partitions will create partitions on each individual distribution.

 When queries require more memory to execute, you should always use the higher resource class to
maximize performance.

 To increase concurrency, use the lower resource class.

 Always use the DMVs to monitor and analyze user queries.


MCT USE ONLY. STUDENT USE PROHIBITED
8-18 Performing Analytics with Azure SQL Data Warehouse

Demonstration: Monitoring a query and restructuring a table for


performance
In this demonstration, you will see how to:

 Use the Azure portal to monitor a query.

 Assess performance and make recommendations for restructuring a table.

Check Your Knowledge


Question

When you create statistics, what is the default


sample percentage size when you do not specify
a figure?

Select the correct answer.

10%

20%

30%

40%

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? You cannot have more


than 1,024 connections to a SQL Data
Warehouse at any given time.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-19

Lesson 3
Protecting data in SQL Data Warehouse
Authentication, authorization, encryption and auditing are the key pillars for securing data stored in any
system. SQL Data Warehouse provides many ways in which to secure stored data and protect it from
malicious access.

Lesson Objectives
By the end of this lesson, you should be able to:

 Create users and understand how firewall rules need to be configured to give them access.

 Understand how to authorize users for data access.


 Encrypt data stored in SQL Data Warehouse.

 Enable audit logging and analyze audit logs.

Protecting access to SQL Data Warehouse


SQL Data Warehouse utilizes the robust security
features built in Azure. To allow applications to
access data stored in SQL Data Warehouse, it’s
important to approve the IP addresses that
connect and consume stored data. Azure portal,
REST APIs or PowerShell allow firewall rules to be
added or amended as required. You should make
sure that local computers or network firewalls
allow communication on TCP port 1433. All
communication with SQL Data Warehouse is
encrypted—there are two ways in which users
can authenticate with SQL Data Warehouse.

Authenticate via SQL Server authentication


SQL Data Warehouse supports traditional SQL Server authentication with a specific database username
and password. You use the same username for accessing different databases within that server. However,
for databases within other servers, you need to recreate the same username because the SQL Server
authenticated users are local to the database server.
To create a new database user, it’s good practice to create a user login account within the MASTER
database of the specific SQL Data Warehouse instance. This will give that user login access to all databases
within that instance. You can also customize specific databases that the user login accesses. Use the
following syntax to create AppUserLogin as the new login with the password set as Str0ng_passw0rd:

CREATE LOGIN AppUserLogin WITH PASSWORD = 'Str0ng_passw0rd';

When the user login is created, you create a user based on that login. Use the following syntax to create
an AppUser user for the login you created earlier:

CREATE USER AppUser FOR LOGIN AppUserLogin;


MCT USE ONLY. STUDENT USE PROHIBITED
8-20 Performing Analytics with Azure SQL Data Warehouse

Authenticate via Microsoft Azure Active Directory (Azure AD) authentication


Microsoft Azure Active Directory is a service that provides identity and access management facilities.
Azure AD provides a central location that you use to manage access to various Azure services including
SQL Data Warehouse.

To configure Azure AD to authenticate users for SQL Data Warehouse, use the following steps:

1. Create a new directory using Azure AD, and add users that require access to SQL Data Warehouse.

2. Create an administrator account for the SQL Data Warehouse.


3. In the SQL Data Warehouse, create contained database (DB) users who are mapped to the Azure AD
identities.

4. Connect to SQL Data Warehouse using the Azure AD credentials.

Authorizing users
Authorization centers around what a user can do
within the SQL Data Warehouse, including the
objects they see, what data they see, and so on.
This is managed by using the following roles and
permissions:

Permissions: granular permissions control the


operations (SELECT, INSERT, UPDATE, DELETE)
that a user performs at column level, table level,
any object level, and so on. This is the most
granular way to provide access to objects within
SQL Data Warehouse. To do this, use the GRANT
command.

The following code shows how to enable a user named MyUser to run queries (SELECT) statements over
MyTable:

Granting permissions over a table


GRANT SELECT ON dbo.MyTable TO MyUser
GO

You can also use the REVOKE command to remove privileges from a user.
Database roles: SQL Data Warehouse provides many built-in database roles, which you use to grant users
entire levels of access. Typical database roles include db_datareader and db_datawriter. A database role
has an associated set of privileges. For example, the db_owner role has complete access to the contents of
a database, whereas the db_datareader role reads data from any table in the database but does not
modify it. You can also create custom, user-defined database roles.
You use the sp_addrole stored procedure to define a new role, and sp_addrolemember to assign a user to
a role. The sp_droprolemember revokes the role privileges from a user.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-21

The following example creates a new role named query_role that encapsulates SELECT privilege over three
tables (MyTable, MyTable1, and MyTable2). The code then grants this role to a user named MyUser2:

Creating and assigning a database role


EXEC sp_addrole ‘query_role’
GO

GRANT SELECT ON dbo.MyTable TO query_role


GRANT SELECT ON dbo.MyTable2 TO query_role
GRANT SELECT ON dbo.MyTable3 TO query_role
GO

EXEC sp_addrolemember ‘query_role’, ‘MyUser2’


GO

Stored procedures: stored procedures provide another way to control access to users’ actions in SQL
Data Warehouse. You can hide tables behind the code in a stored procedure. You only need to grant
EXECUTE permission on the stored procedure to a user; the user does not need to have direct access to
any tables used by the stored procedure.

Encrypting data
SQL Data Warehouse uses Transparent Data
Encryption (TDE) for encrypting and decrypting
data at rest in real time. All data, including
backups and transaction log files, is encrypted.
TDE uses a symmetric key (database encryption
key) to encrypt the whole database, and each
instance of SQL Data Warehouse has a unique
server certificate that protects the database
encryption key. These certificates are rotated
automatically by Microsoft at least every 90 days,
to keep the data safe and secure from malicious
activities. SQL Data Warehouse implements the
AES-256 encryption algorithm to encrypt the data. You use the Azure portal or T-SQL to encrypt data in
SQL Data Warehouse.

Using the Azure portal


Use the following steps to enable encryption by using the Azure portal:

1. Open the selected SQL Data Warehouse instance in the Azure portal.

2. Under database, click the Settings button.

3. Select the Transparent Data Encryption option.

4. Select the On setting.

5. Click Save.

Your SQL Data Warehouse will be encrypted.

To disable encryption using the Azure portal, follow the same steps but select the Off setting.
MCT USE ONLY. STUDENT USE PROHIBITED
8-22 Performing Analytics with Azure SQL Data Warehouse

Using T-SQL
To enable encryption using T-SQL, the user should be an administrator or have a “dbmanager” role in the
database. You should connect to the master database within the SQL Data Warehouse instance and use
the following T-SQL syntax to enable encryption:

ALTER DATABASE [SampleAzureSQLDataWarehouse] SET ENCRYPTION ON;

To disable encryption using T-SQL, the user should be an administrator or have a “dbmanager” role in the
database. You should connect to the master database within the SQL Data Warehouse instance and use
the following T-SQL syntax to disable encryption:

ALTER DATABASE [SampleAzureSQLDataWarehouse] SET ENCRYPTION OFF;

Auditing SQL Data Warehouse operations


SQL Data Warehouse records events for auditing
purposes. Audit logs require access to an Azure
Storage account to store this information. When
you set up auditing in the Azure portal, you need
to enable auditing, and then provide the Azure
Storage account where information is saved.
Power BI includes reports and dashboards that
retrieve and display this information efficiently
for reporting purposes.

Threat detection is also enabled in SQL Data


Warehouse to capture threats like SQL injection,
anomalous logins, and so on. Use the Azure
portal to enable threat detection and write the results to a Blob storage account—you then email
specified users when such threats are found.

It’s important to enable auditing and threat detection so that you can monitor who is using the database,
in addition to identifying potential threats that might cause a financial loss or a major disruption to
services.

Check Your Knowledge


Question

What is the encryption algorithm used in SQL


Data Warehouse?

Select the correct answer.

AES-64

AES-128

AES-256

AES-512
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-23

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? SQL Data Warehouse


encrypts all its data at rest.
MCT USE ONLY. STUDENT USE PROHIBITED
8-24 Performing Analytics with Azure SQL Data Warehouse

Lab: Performing analytics with SQL Data Warehouse


Scenario
You work for Adatum as a data engineer, and you have been asked to build a traffic surveillance system
for traffic police. This system must be able to analyze significant amounts of dynamically streamed data,
captured from speed cameras and automatic number plate recognition (ANPR) devices, and then
crosscheck the outputs against large volumes of reference data holding vehicle, driver, and location
information. Fixed roadside cameras, hand-held cameras (held by traffic police), and mobile cameras (in
police patrol cars) are used to monitor traffic speeds and raise an alert if a vehicle is travelling too quickly
for the local speed limit. The cameras also have built-in ANPR software that reads vehicle registration
plates.

For this phase of the project, you are going to use data output from Azure Stream Analytics into SQL Data
Warehouse to produce visually-rich reports from the stored data, as a first step in the identification of any
patterns and trends in vehicle speeds at each camera location. You are then going to use Machine
Learning against the data in SQL Data Warehouse to try to predict speeds at a given camera location at a
given time; this information could be used to deploy patrol cars to potential hotspots ahead of time. Next,
you want to be able to query the data in SQL Data Warehouse to find information such as the registration
numbers of all cars that have never been caught speeding. Because you will be performing such queries at
regular intervals, and the datasets are very large, it’s important that the data is properly structured to
optimize the query processes. Finally, because a lot of the traffic data stored in SQL Data Warehouse
includes personal and confidential details, it’s essential that the databases are protected from both
accidental and malicious threats. You will, therefore, configure SQL Data Warehouse auditing and look at
how to protect against various threats.

Objectives
After completing this lab, you will be able to:

 Visualize data stored in SQL Data Warehouse.


 Use Machine Learning with SQL Data Warehouse.

 Assess SQL Data Warehouse query performance and optimize database configuration.

 Configure SQL Data Warehouse auditing and analyze threats.

Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.

Lab Setup
Estimated Time: 90 minutes
Virtual machine: 20776A-LON-DEV

Username: ADATUM\AdatumAdmin

Password: Pa55w.rd

This lab uses the following resources from Lab 7 and earlier:

 Resource group: CamerasRG

 Azure SQL Server: trafficserver<your name><date>

 Azure SQL Data Warehouse: trafficwarehouse

 Machine Learning workspace: Traffic


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-25

Exercise 1: Visualize data stored in SQL Data Warehouse


Scenario
You want to be able to identify any patterns and trends in vehicle speeds at each camera location. The
SQL Data Warehouse holds data captured from speed cameras by using Stream Analytics, and you are
going to use this data to produce visualizations in Power BI, as a first step in this analysis.

Note: In this exercise, you will upload the speed data from a CSV file rather than streaming it from the
cameras; the CSV file contains data for a longer period of time than is feasible by using streaming.

The main tasks for this exercise are as follows:

1. Install AzCopy

2. Prepare the environment

3. Upload data to Blob storage

4. Use PolyBase to transfer blob data to SQL Data Warehouse

5. Use Power BI to visualize the data

 Task 1: Install AzCopy

 Task 2: Prepare the environment

 Task 3: Upload data to Blob storage

 Task 4: Use PolyBase to transfer blob data to SQL Data Warehouse

 Task 5: Use Power BI to visualize the data

Results: At the end of this exercise, you will have uploaded data to Blob storage, and then used PolyBase
to transfer this blob data to SQL Data Warehouse. You will then use Power BI to visualize this data, and
look for patterns in the data.

Exercise 2: Use Machine Learning with SQL Data Warehouse


Scenario
You want to preposition patrol cars at potential traffic hotspots. You will use the speed camera data held
in SQL Data Warehouse, together with Machine Learning, to create a model that you use to predict traffic
speeds for each camera for a given time of day. The model uses Decision Forest Regression. This form of
regression runs relatively quickly over the volume of data used by this lab. Neural Network Regression
might generate a slightly more accurate set of predictions, but might take a considerable amount of time
to run. Additionally, remember that the model generates predictions, not cast-iron guarantees of
behavior.

The main tasks for this exercise are as follows:

1. Create experiment

2. Create trained model

3. Deploy the Machine Learning web service

4. Generate predictions by using the Machine Learning web service in an application


MCT USE ONLY. STUDENT USE PROHIBITED
8-26 Performing Analytics with Azure SQL Data Warehouse

 Task 1: Create experiment

 Task 2: Create trained model

 Task 3: Deploy the Machine Learning web service

 Task 4: Generate predictions by using the Machine Learning web service in an


application

Results: At the end of this exercise, you will have created a trained model, deployed this model as a web
service, and then used this service in an application to generate traffic speed predictions for particular
camera locations, at particular times of day, and for particular days of the week.

Exercise 3: Assess SQL Data Warehouse query performance and optimize


database configuration
Scenario
You want to find the registration numbers of all cars that have never been caught speeding. One way to
do this is to perform a query that finds the registration of every vehicle, and then remove the registration
numbers of vehicles that have been caught speeding from this set. You will be performing this query at
regular intervals, so it’s important that the data is structured to optimize the processes performed to find
the required information. The details of all vehicle registrations are held in the VehicleOwner table in the
data warehouse; the vehicle speed information is stored in the VehicleSpeed table (from the previous
exercises).

The main tasks for this exercise are as follows:

1. Assess performance of a baseline query


2. Assess query performance when using replicated tables

3. Assess query performance when distributing data by vehicle number

4. Assess query performance when using columnstore


5. Assess query performance when distributing linked data to the same node

 Task 1: Assess performance of a baseline query

 Task 2: Assess query performance when using replicated tables

 Task 3: Assess query performance when distributing data by vehicle number

 Task 4: Assess query performance when using columnstore

 Task 5: Assess query performance when distributing linked data to the same node

Results: At the end of this exercise, you will have run a series of queries against SQL Data Warehouse,
assessed how each query is processed, and reconfigured the data structure several times to see the impact
on the query process.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-27

Exercise 4: Configure SQL Data Warehouse auditing and analyze threats


Scenario
You want to ensure that the traffic data stored in SQL Data Warehouse, which includes personal and
confidential details, is properly protected from both accidental and malicious threats; you will, therefore,
configure SQL Data Warehouse auditing and look at how to protect against various threats.

The main tasks for this exercise are as follows:

1. Enable auditing and threat detection

2. Generate audit and threat test data

3. View audit logs and alerts

4. Monitor login failures

5. Lab cleanup

 Task 1: Enable auditing and threat detection

 Task 2: Generate audit and threat test data

 Task 3: View audit logs and alerts

 Task 4: Monitor login failures

 Task 5: Lab cleanup

Results: At the end of this exercise, you will have enabled auditing, and used a sample application that
includes a SQL injection vulnerability to attempt to attack the data warehouse. You will also have
examined the audit logs, including identifying login failures.

Question: If you have an example big data application in your own organization, can you list
any strategies that are currently used to minimize data movement?

Question: Based on your own experience, and your organization’s data structures, can you
think of scenarios where you would not enable auditing and/or threat detection?
MCT USE ONLY. STUDENT USE PROHIBITED
8-28 Performing Analytics with Azure SQL Data Warehouse

Module Review and Takeaways


In this module, you learned how to:

 Use the features of Transact-SQL that are available for SQL Data Warehouse.

 Configure and monitor a SQL Data Warehouse to maintain optimal performance.


 Protect data and manage security in a SQL Data Warehouse.
MCT USE ONLY. STUDENT USE PROHIBITED
9-1

Module 9
Automating Data Flow with Azure Data Factory
Contents:
Module Overview 9-1 
Lesson 1: Introduction to Data Factory 9-2 

Lesson 2: Transferring data 9-11 

Lesson 3: Transforming data 9-26 


Lesson 4: Monitoring performance and protecting data 9-33 

Lab: Automating the Data Flow with Azure Data Factory 9-38 

Module Review and Takeaways 9-44 

Module Overview
Microsoft® Azure® Data Factory is a data orchestration service that ties together many different
compute and storage services to help you build powerful data pipelines. This orchestration service builds
on top of individual data analytics and storage components by integrating each into an overall data flow
application. Data Factory integrates with many different storage services and orchestrates transformations
on many compute services, including Azure Batch, Azure Data Lake Analytics, HDInsight® (Hadoop), and
Azure Machine Learning.

Objectives
By the end of this module, you will be able to:

 Describe the purpose of Data Factory and explain how it works.

 Describe how to create Data Factory pipelines that transfer data efficiently.

 Describe how to perform transformations using a Data Factory pipeline.

 Describe how to monitor Data Factory pipelines and how to protect the data flowing through these
pipelines.
MCT USE ONLY. STUDENT USE PROHIBITED
9-2 Automating Data Flow with Azure Data Factory

Lesson 1
Introduction to Data Factory
Enterprise data continues to grow with regards to volume, velocity, and variety. It’s crucial for enterprises
to be able to integrate these disparate data sources and data types easily into one cohesive data driven
application. Data Factory is an orchestration tool that provides the integration between each of these
systems to build a set of robust and cohesive data integration units called pipelines. You use Data Factory
to automate and schedule all data movement and transformation activities, all from a single set of
interfaces, where you monitor and manage the pipelines as your data is flowing through the application.

Lesson Objectives
By the end of this lesson, you should be able to:

 Understand what Data Factory is.

 Understand what Data Factory workflows, datasets and linked services are.
 Describe scheduling pipelines, controlling processing windows, and data slices.

Overview of Data Factory


Data Factory is a data orchestration and
integration service you use to schedule and
automate extract, transform and load (ETL)
workflows. Data Factory connects to many types
of databases and other data stores, including
multiple cloud platforms and on-premises data.
You build workflow units known as pipelines to
move data from a source dataset to a sink
dataset using a copy activity. Alternatively, you
might transform the data using a transformation
activity coupled with a compute service, like Data
Lake Analytics, Machine Learning or HDInsight.

Data Factory is automated by scheduling the execution of each copy or transformation activity in a
pipeline. The schedule is broken down into activity units known as time slices. These time slices provide a
recurring schedule for an activity to run. For example, you might schedule an activity to copy data from
an on-premises database to Azure Data Lake Store to run every hour.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-3

Understanding Data Factory workflows


Data Factory is made up of five key concepts:

 Linked services

 Datasets
 Activities

 Pipelines

 Data Management Gateway

Linked services. Linked services are objects that


are used to define connection information for
Data Factory to connect to storage or compute
services. These linked services will act as
connection strings for Data Factory to use to connect to a particular dataset or to submit a transformation
activity to a compute resource.

Datasets. Datasets are used to define the location and structure of a particular set of data. Data Factory
uses connection information from linked services to connect to a data source or data sink, and uses
information in the dataset definition to find the exact location of the data to be copied. Datasets are the
link between activities and are used to define dependencies in data availability for each activity. The
frequency of the time slice is also defined in the output dataset for each activity.
Activities. Activities define the processing that is performed on the data and require at least one input
and one output dataset. There are two types of activities—copy activities and transformation activities.
Copy activities are used to copy data from the location defined in the input dataset to the location
defined in the output dataset. Transformation activities are used to submit a script file or trigger a job to a
compute resource using the connection information in the compute linked service.

Pipelines. Pipelines, which can have one or more activities, are used to group similar activities into an
overall task. Grouping these activities into a set helps to simplify the data factory by managing the set at a
pipeline level instead of managing each task individually.

Data Management Gateway. The Data Management Gateway (DMG) is used to connect Data Factory
with on-premises data sources. You might install DMG on a machine in your network so it has access to
the on-premises datasets, and you connect DMG to your data factory using a uniquely generated key.
Each key concept of Data Factory is deployed by creating JSON files that define the object’s properties.
When deploying a data factory, linked services are deployed first, then datasets, then pipelines with one
or many activities defined. This order is required because of the dependency between the objects. For
example, you can’t create an Azure Blob dataset without having an Azure Blob linked service with the
connection information for the storage account. Similarly, you can’t create a pipeline with an activity to
copy data from Azure Blob to Data Lake Store before the input and output datasets are created.
MCT USE ONLY. STUDENT USE PROHIBITED
9-4 Automating Data Flow with Azure Data Factory

Defining datasets and linked services


Storage linked services are defined with
connection information to data stores that will be
used as sources and sinks for copy activities. After
the linked service is created, you create a dataset
that will use that connection information to point
to a specific dataset. For example, you can create
a Microsoft Azure Blob storage linked service
called WasbLinkedService and use the following
JSON to create a dataset that uses that linked
service:

{
"name": "WasbInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "WasbLinkedService",
"typeProperties": {
"fileName": "input.txt",
"folderPath": "data/raw/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {}
}
}

Many storage linked service types are available, including on-premises and cloud stores. However, some
data stores are only used as a source for a copy activity and some are only used as a sink. For example,
Data Lake Store datasets are used as a source and sink, MongoDB datasets are only used as a source and
Azure Search Index datasets are only used as a sink.

To view the available source and sink dataset types, see:

Supported data stores and formats


https://aka.ms/S7859c

Structure
You use the structure of a Data Factory dataset to define the schema of the dataset. The structure is a
combination of field names and data types for the dataset. The dataset structure is optional, but it’s
necessary in two scenarios: mapping columns of the source dataset to columns in the sink dataset, and
converting data types of the source dataset to native data types of the sink dataset.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-5

To define the structure of a dataset, add the structure property to the properties section of the dataset
definition in the JSON file. For example, the following structure has five columns: guid of type guid,
storeId of type String, productId of type int32, quantity of type int32, and price of type Decimal.

structure:
[
{ "name": "guid", "type": "Guid"},
{ "name": "storeId", "type": "String"},
{ "name": "productId", "type": "int32"},
{ "name": "quantity", "type": "int32"},
{ "name": "price", "type": "Decimal"}
]

Data Factory supports the following data types for use in defining dataset structure:

Int16 Decimal Datetime

Int32 Byte[] Datetimeoffset

Int64 Bool Timespan

Single String

Double GUID

External datasets
External datasets in Data Factory are any datasets that are not included as the output of a data factory
activity. This means that the dataset is generated by an external source, and this dataset will be used as an
input dataset to the first activity in a pipeline. To mark a dataset as external, you set the external tag to
true in the properties section of the dataset JSON file.

Scoped datasets
You set Data Factory pipelines to run in one of two modes, either Scheduled or OneTime. Pipelines that
are set to OneTime mode are defined with scoped datasets, or datasets that are only scoped to that
specific pipeline. Therefore, these datasets can only be accessed by activities that are defined in the same
pipeline as the scoped datasets.

Scheduling pipelines, controlling processing windows and data slices


Data Factory pipelines and activities are
automated by defining a schedule that is a
combination of frequency, or how often the
activity should run, and pipeline availability, or
start and end times that define a window of
availability for the pipeline. These properties
combine to create tumbling windows called time
slices that define when an activity will run.

Pipeline start and end times


Start and end time properties are defined in each
pipeline JSON file. This creates an availability
window between the start property and the end
property, meaning that no time slices will be created before the start time or after the end time. These
MCT USE ONLY. STUDENT USE PROHIBITED
9-6 Automating Data Flow with Azure Data Factory

properties work with the scheduler property of an activity to determine which time slices you should
create for the activity.

Activity schedule
The scheduler property of an activity defines the duration of the time slices that you will create and the
frequency for the time slices that will be created. For example, if you create a copy activity with the
scheduler frequency set to Hour and the interval set to 4, your activity will run every four hours.

The activity scheduler property is optional. However, when you define the property, it must match the
availability property of the output dataset (the dataset availability property has the same frequency and
interval properties).

Dataset availability and policies


The dataset availability property is similar to the activity scheduler property. The frequency and interval
are also defined here. The available frequencies are Minute, Hour, Day, Week, and Month. If you set the
frequency to Minute, the lowest value allowed for the interval is 15. Therefore, the smallest possible data
factory time slice is 15 minutes.

The schedule of each activity is defined by the output dataset. For this reason, the schedule for the activity
must match the schedule of the output dataset. For example, you might define a Hive activity with a
frequency of Hour and an interval of 6. The output dataset must also have a frequency of Hour and an
interval of 6 to create the hour-long time slices every six hours.

The following table shows the possible dataset availability properties. For more examples of these
properties, see:
Dataset availability
https://aka.ms/Vqhvfv

Name Description

frequency Time unit of the data slice.


Frequency options: Minute, Hour, Day, Week, Month.

interval How often a time slice is produced.


For example, with a frequency of Day and an interval of 5, a time slice that
lasts all day will be created every five days.

style States when the processing of a time slice should begin.


Style options: StartOfInterval, EndOfInterval.

anchorDateTime States the absolute datetime position used by the activity and dataset
schedulers to create time slice boundaries.

offset Defines the shift by which time slices are shifted.


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-7

You set Dataset validation policy properties to perform simple validation on the data before you begin an
activity. If the validation is successful, the slice status is changed to Ready. If the validation is unsuccessful,
the slice status is changed to Failed. The following table shows possible dataset policy properties:

Name Description

minimumSizeMB Validates whether the size of the


Blob dataset meets the specified
threshold.

minimumRows Validates whether the number of


rows in the Azure SQL or Azure
Table dataset meets the specified
threshold.

Combining pipeline availability with the activity and dataset schedule

To create the correct activity schedule that has the desired time slices, you must bring the pipeline start
and end times together with the activity and output dataset schedule.

For example, you want to run a copy activity from Azure Blob to Data Lake Store once daily for the entire
month of July 2017.

To set up the correct activity schedule:


1. In the input and output dataset JSON files, create the availability property with a frequency of Day
and an interval of 1.

2. In the activity definition, in the pipeline JSON file, create the scheduler property with a frequency of
Day and an interval of 1 to reflect the schedule of the output dataset.
3. In the pipeline JSON file, set the start property to 2017-07-01T00:00:00Z and the end property to
2017-08-01T00:00:00Z.

This procedure will create daily time slices for each of the 31 days in the pipeline availability interval.

Datasets with different frequencies


For most of the time, the schedules for the input dataset, output dataset, and the activity will be the same.
However, some use cases require activities to process data on a different schedule than the input dataset.
You create activities that process on a different frequency to that of the input dataset, if the input dataset
is processed more often than the activity and output dataset.

For example, you are copying reports from Blob storage to Data Lake Store once daily. However, you want
to build weekly aggregate reports of these daily reports. You might create an activity and output dataset
with a weekly frequency to process the aggregates, but the input dataset would have a daily frequency.

In the preceding example, the input is processed on a daily schedule. Because the activity frequency is
weekly, the activity would wait for seven input time slices to complete successfully before running the
single weekly time slice for all seven days of input data.
MCT USE ONLY. STUDENT USE PROHIBITED
9-8 Automating Data Flow with Azure Data Factory

Performing activities in parallel

Parallel activities in a pipeline


Data Factory pipelines are built as a set of one or
more activities. Generally, the activities grouped
in a single pipeline will be activities that perform
a similar function on different datasets. For
example, you have four reference data tables you
need to copy from SQL Database to Data Lake
Store. You create one pipeline called
CopyReferenceData that has four different copy
activities pointing to four different reference
tables. Because these activities are grouped in
the same pipeline but are not dependent on
each other, they can be run in parallel.

You create multiple activities in a pipeline that will run in parallel if the input of one activity is not
dependent on the output of another activity within the same pipeline.

Parallel processing of time slices


Data Factory processing is based on the pipeline availability and the schedule of the output dataset. You
can set the pipeline availability to begin and end in the past. When you set the availability in the past,
Data Factory automatically creates a time slice for each slice in the available interval and begins to process
the slices—because each of the start dates is in the past.
For example, you have a copy activity with a frequency set to daily. You set the pipeline availability from
2017-06-01T00:00:00Z to 2017-07-01T00:00:00Z and the current date is 2017-06-15. Data Factory will
create time slices for each day in June and will immediately start to process slices for the dates between
2017-06-01 and 2017-06-14 because those slices occurred in the past. However, the slice for 2017-06-15
will not be processed yet because the end of the interval has not occurred.

You set time slices to run in parallel by setting the concurrency property in the policy section of the
activity in the pipeline JSON file. The concurrency setting states how many time slices are processed
simultaneously. The default value is 1 and the maximum value is 10. Running past time slices in parallel
speeds up the processing of the activity.

Using a Data Management Gateway


Having access to on-premises data sources is key
in any scenario that involves enterprise data
integration in the cloud. Data Factory connects to
on-premises data sources via the Data
Management Gateway (DMG) client. The DMG
client is installed on-premises to give access to
data sources inside the corporate network while
connecting with Data Factory via outbound-only
communication. The DMG provides Data Factory
with the access needed to copy data between
data stored on-premises and in the cloud.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-9

Prerequisites to install the Data Management Gateway


 Supported operating systems: Windows 7, Windows 8, Windows 8.1, Windows 10, Windows Server
2008 R2, Windows Server 2012, or Windows Server 2012 R2.

 .NET Framework 4.5.1 or above.

 The recommended gateway machine configuration is at least a 2 GHz processor, 4 cores, 8 GB


physical memory and 80 GB disk.

Installation options
There are two options to install the DMG. You might install directly from the Microsoft Download Center
and manually register with your data factory. The alternative is to follow the EXPRESS SETUP directly in the
Data Factory authoring tool in the Azure portal.

To manually install and register the DMG:

1. Browse to the Data Management Gateway download page in the Microsoft Download Center:
https://aka.ms/G4rtki

2. Click the Download button and select either the 32-bit or 64-bit version, in addition to the optional
release notes.

3. Wait for the installer to finish downloading then click Run.

4. Select a language on the Welcome page then click Next.


To install using the EXPRESS SETUP from the Azure portal:

1. Browse to the Data Factory blade in the Azure portal.

2. Under the Actions section, click Author and deploy.


3. Click New data gateway, and then click to Create a new data gateway.

4. Provide a name for the data gateway, and then click OK.

5. Under Express Setup, click Install directly on this computer.


6. The data gateway will automatically install and register.

Demonstration: Creating and running an Azure Data Factory pipeline


In this demonstration, you will see how to:

 Prepare a local database for use with Data Factory.

 Create a new Data Factory.

 Create a Data Management Gateway.

 Create Data Factory linked services for source and sink data stores.

 Create Data Factory datasets to represent input and output data.

 Create a Data Factory pipeline with a copy activity to move the data.

 Verify the Data Factory pipeline.

Question: How might you use Data Factory to extract, transform and load data in to and out
of Azure within your organization?
MCT USE ONLY. STUDENT USE PROHIBITED
9-10 Automating Data Flow with Azure Data Factory

Check Your Knowledge


Question

Which one of the following is not part of Data


Factory?

Select the correct answer.

Pipelines

Activities

Datasets

Linked services

Stream Analytics

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? You can’t have time slices


in the past when scheduling a Data
Factory pipeline.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-11

Lesson 2
Transferring data
One of the strengths of using Data Factory to orchestrate your data workflows is that you utilize copy
activities to collocate all data in one place. You use copy activities to connect to and copy data from on-
premises and cloud data sources. The data is then integrated and transformed with transformation
activities to produce the data products that you need to publish. Usually, Data Factory copy activities are
used at the beginning of a data workflow to retrieve input data that is later transformed and at the end of
a data workflow to publish data products.

Lesson Objectives
By the end of this lesson, you should be able to:

 Understand how to copy data to and from different data sources using Data Factory.

 Explain how to parallelize a copy activity.


 Understand how to use Data Factory Copy Wizard.

Using a copy activity


Data Factory copy activities connect to data
using information from the single input dataset,
known as a source, and uses that data to produce
data in the location and form described in the
output dataset known as a sink.

The copy activity involves the following steps:

1. Read source data from the input data store.

2. Perform column mapping, type conversion,


compression or decompression, serialization,
and deserialization.

3. Write output data to the sink data store.


The copy activity operates based on configurations that are defined in the input dataset, copy activity and
output dataset JSON files.

JSON example for a copy activity

The following is a sample JSON file that defines a copy activity from the Blob store to the Data Lake Store:

{
"name": "ExampleCopyPipeline",
"properties": {
"description": "Copy data from Azure blob to Azure Data Lake Store",
"activities": [
{
"name": "CopyFromBlobToAdls",
"type": "Copy",
"inputs": [
{
"name": "BlobInputDataset"
}
],
"outputs": [
MCT USE ONLY. STUDENT USE PROHIBITED
9-12 Automating Data Flow with Azure Data Factory

{
"name": "AdlsOutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"executionLocation": "East US 2"
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2017-07-12T00:00:00Z",
"end": "2017-07-13T00:00:00Z"
}
}

Supported file formats


Copy activities support reading from and writing to the following file formats:
 Text format

 JSON format

 AVRO format
 ORC format

 Parquet format
You can define different properties for each of these file formats that define how the Data Factory copy
activity interacts with the dataset.

Text format
Datasets defined as TextFormat refer to simple text files, like .txt, .csv, and .tsv files. You define the
following properties in the format section of the dataset to define how the text files are read or written.

Property name Property description

columnDelimiter Defines the character used as the column separator in the file. Default
value is “,”.

rowDelimiter Defines the character used as the row separator in the file. Default value
is “\r”, “\n”, or “\r\n” on read and “\r\n” on write.

escapeChar Defines the character used to escape the column delimiter in the input
file.

quoteChar Defines the character used to quote string values in both input and
output datasets. The quoteChar property can’t be used with the
escapeChar property for a dataset.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-13

Property name Property description

nullValue Defines the character(s) used to represent a null value in the input and
output datasets. Default value is “\N” or “NULL” on read and “\N” on
write.

encodingName Defines the encoding of the text file. Default value is UTF-8.

firstRowAsHeader Defines whether Data Factory should treat the first line of a dataset as the
header row. Data Factory will read the first row as a header for input
datasets and will write the header as the first row for output dataset.
Default is false.

skipLineCount Defines the number of rows that Data Factory should skip when reading
in the dataset. If this property is used with the firstRowAsHeader
property, the number of lines provided is skipped, and the next line is
treated as the header.

treatEmptyAsNull Defines whether Data Factory should treat empty string values as null.
Default is true.

JSON format

Data Factory parses or writes JSON files based on the configuration properties listed in the format section
of the dataset JSON file.

Property name Property description

filePattern Defines the pattern of the data stored in the JSON file. Optional values
are setOfObjects or arrayOfObjects. The default value is setOfObjects.

jsonNodeReference Defines the JSON path of an array to iterate and extract from when you
copy data from JSON files. To use fields under the root object of the
JSON file, begin the expression with $ as root.

jsonPathDefinition Defines the JSON path expression to use with the column mapping
feature in a Data Factory copy activity.

encodingName Defines the encoding of the text file. Default value is UTF-8.

nestingSeparator Defines the character that is used to separate the nesting levels of the
JSON document. Default value is ‘.’.

AVRO format

To use AVRO format in Data Factory, set the format type property to AvroFormat. There are no other
type properties to define for the AVRO format. Complex data types (unions, maps, arrays, enums, records,
and fixed) are not supported with the AVRO format.

ORC format
To use ORC format in Data Factory, set the format type property to OrcFormat. There are no other type
properties to define for the ORC format. Complex data types (union, list, map, struct) are not supported
with the ORC format. You use Data Factory to read uncompressed ORC files or ORC files compressed with
zlib or snappy. However, Data Factory only writes ORC files using the default compression zlib.
MCT USE ONLY. STUDENT USE PROHIBITED
9-14 Automating Data Flow with Azure Data Factory

Parquet format
To use Parquet format in Data Factory, set the format type property to ParquetFormat. There are no
other type properties to define for the Parquet format. Complex data types (list, map) are not supported
with the Parquet format. You use Data Factory to read uncompressed Parquet files or Parquet files that
are compressed with lzo, gzip, or snappy. However, Data Factory only writes Parquet files using snappy,
which is the default compression.

Compression options
Compressing large datasets helps with performance while copying and transforming data. You use the
dataset compression property to read from and write to compressed files while performing a copy activity.

The compression section of a dataset has two properties you can set:

 Type—the type of compression to be used in the copy activity. Available compression types are
ZipDeflate, BZip2, Deflate, and GZip.

 Level—the level of compression. Available levels are Optimal or Fastest.

Copying data to and from Blob storage


You use Blob storage datasets as a source or a
sink for a Data Factory copy activity. For example,
you create a copy pipeline that copies data from
an on-premises MySQL table to Blob storage—in
this case, you set the sink type for the copy
activity as BlobSink. You then use the output of
that copy activity as the input for a copy activity
that copies the data from Blob storage to SQL
Database—in this case, you set the source type
for the copy activity as BlobSource. In this
scenario, Blob storage is used as both a source
and a sink for two different copy activities.

You use Blob storage as a source or sink with supported source and sink data stores in Data Factory.

For more information on supported source and sink data stores in Data Factory, see:
Supported data stores and formats
https://aka.ms/Hm2orx

It’s important to note the following:

 Copy Activity supports reading data from block, append or page blobs only, and supports writing
data to block blobs only.
 Azure Premium Storage is currently not supported as a sink data store because it’s backed by page
blobs.
 Copy Activity doesn’t delete source data after it is copied from the source data store to the sink data
store.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-15

The following table summarizes the key properties required when you use Blob storage as a source or sink
data store:

Copy
Property Property Mandatory (Yes
Activity Valid values
name description or No)
type

BlobSource recursive Specifies whether True (default) No


the data is read False
from underlying
subfolders or only
from the specific
folder mentioned.

BlobSink copyBehavior Specifies the copy PreserveHierarchy No


behavior when the preserves the
source data store is folder hierarchy as
BlobSource or in the source.
FileSystem. FlattenHierarchy
flattens the
hierarchy in the
target as a single
folder instead of
flattening the
folder hierarchy in
the source.
MergeFiles
merges all source
files into a single
file or blob.

For more information, see:

Copy Activity properties


https://aka.ms/Fkxsgd

Copying data to and from Data Lake storage


Data Factory supports the copying of data to and
from Data Lake storage, similar to Blob storage.
You use Data Lake Store as a source or sink data
store with the supported source or sink data
stores for Data Factory.

For more information, see:

Move data by using Copy Activity


https://aka.ms/Wbxpwb
MCT USE ONLY. STUDENT USE PROHIBITED
9-16 Automating Data Flow with Azure Data Factory

Data Lake Store supports two types of authentication:

 Service Principal Authentication. This is the recommended way to authenticate for executing the
scheduled copying of data.

 User Credential Authentication (OAuth). You need be aware that tokens might expire when you
use User Credential Authentication.

The following table describes the key JSON properties for linked services, authentication, dataset, and
copy activities:

Property Mandatory
Service type Property name Valid values
description (Yes or No)

Linked service type Defines the This should be set as Yes


type of the AzureDataLakeStore
linked
service.

Linked service dataLakeStoreUri Gives Should be one of the


information following formats:
about the https://[accountname].
Data Lake azuredatalakestore.net
Store /webhdfs/v1
account.
or
adl://[accountname].
azuredatalakestore.net/

Linked service subscriptionId Azure Mandatory


subscription for sink
ID where the data store
Data Lake
Store
account
resides.

Linked service resourceGroupName Name of the Mandatory


resource for sink
group where data store
the Data
Lake Store
account
resides.

Service servicePrincipalId Specifies the Yes


Principal client ID.
Authentication

Service servicePrincipalKey Specifies the Yes


Principal application
Authentication key.

Service tenant Specifies the Yes


Principal tenant
Authentication information.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-17

Property Mandatory
Service type Property name Valid values
description (Yes or No)

User authorization Auto Yes


Credential generated
Authentication URL when a
user clicks
the
Authorize
button in
the Data
Factory
editor.

User sessionId Auto Yes


Credential generated
Authentication unique
session ID
populated
by the Data
Factory
editor.

Dataset folderPath Path to the Yes


Data Lake
Store
container
and folder.

Dataset fileName Name of the No


file in Data
Lake Store.

Dataset partitionedBy Specifies the No


dynamic
path—for
example, if
data from
each hour or
day is placed
in separate
folders.

Dataset format Format of  Valid values are No


the dataset.
 TextFormat
 JsonFormat
 AvroFormat
 OrcFormat
 ParquetFormat

Dataset compression Specifies the  Valid values for No


type and compression type
level of data are:
compression.
 GZip
MCT USE ONLY. STUDENT USE PROHIBITED
9-18 Automating Data Flow with Azure Data Factory

Property Mandatory
Service type Property name Valid values
description (Yes or No)
 Deflate
 BZip2
 ZipDeflate
 Valid values for level
of compression are:
 Optimal
 Fastest

AzureDataLake recursive Specifies True (Default) No


StoreSource whether the False
data is read
from
underlying
subfolders or
only from
the specific
folder
mentioned.

AzureDataLake copyBehavior Specifies the PreserveHierarchy No


StoreSink copy preserves the folder
behavior hierarchy, as in the
when the source.
source data FlattenHierarchy
store is flattens the hierarchy
BlobSource in the target as a single
or folder instead of
FileSystem. flattening folder
hierarchy in the source.
MergeFiles merges all
source files into a
single file or blob.

Copying data to and from Azure SQL Data Warehouse


Data Factory supports copying data to and from
Azure SQL Data Warehouse, similar to Blob
storage or Data Lake storage. You use SQL Data
Warehouse as a source or sink data store with the
supported source or sink data stores for Data
Factory.

For more information, see:

Supported data stores and formats


https://aka.ms/T1ac3w
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-19

The SQL Data Warehouse connector for Data Factory only supports basic authentication.

The following table describes the key JSON properties for linked services, dataset and copy activities:

Property Mandatory (Yes


Service type Property name Valid values
description or No)

Linked type Defines the type of This should Yes


service the linked service. be set as
AzureSqlDW

Linked connectionString Specifies the Yes


service connection
information for
connecting to SQL
Data Warehouse.

Dataset tableName Name of the table Yes


or view within SQL
Data Warehouse.

SqlDWSource sqlReaderQuery Specifies the SQL SQL query— No


query to read data for example,
from SQL Data select * from
Warehouse. SourceTable.

SqlDWSource sqlReaderStored Specifies the name No


ProcedureName of the stored
procedure that
reads data from the
source table.

SqlDWSource storedProcedure Specifies the Name and No


Parameters parameters required Value pairs.
for the stored
procedure.

SqlDWSink sqlWriterCleanup Specifies the SQL A SQL query No


Script query to be statement.
executed—typically
used to clean up a
specific slice.

SqlDWSink allowPolyBase Specifies whether True No


PolyBase should be False
used instead of (Default)
BULKINSERT.

SqlDWSink polyBaseSettings PolyBase specific No


parameters when
PolyBase is allowed.

SqlDWSink rejectValue Specifies the 0, 1, 2, and No


number or so on.
percentage of Default value
records that might is 0.
be rejected before
the SQL query fails.
MCT USE ONLY. STUDENT USE PROHIBITED
9-20 Automating Data Flow with Azure Data Factory

Property Mandatory (Yes


Service type Property name Valid values
description or No)

SqlDWSink rejectType Specifies whether Value or No


the rejectValue percentage
parameter is a value default is
or a percentage. Value.

SqlDWSink rejectSampleValue Specifies the sample 1, 2, 3, and Yes, when


number of records so on. rejectType is set
before PolyBase as Percentage
recalculates the
rejected records
percentage.

SqlDWSink useTypeDefault Specifies the default True No


type when handling False
missing values in (Default)
text files using
PolyBase.

SqlDWSink writeBatchSize Specifies the Default is No


maximum batch 10,000.
size before data is
committed to the
SQL table.

SqlDWSink writeBatchTimeout Specifies the 00:15:00 (for No


amount of waiting 15minutes).
time before batch
operation fails to
insert data.

For more examples of JSON using the above properties, see:

JSON examples for copying data to and from SQL Data Warehouse
https://aka.ms/Qurtuo

Parallelizing a copy activity


When reading and writing data using Data
Factory, you parallelize the copy activity to
improve performance. In this way, you read data
faster and write data faster. This is defined as a
cloud data movement unit (DMU). DMU is a unit
of measure for the allocation of CPU, memory
and network resources for every activity within
Data Factory. By default, for a single copy
activity, Data Factory assigns 1 DMU. You
increase DMUs in multiples of 2. For example, 1,
2, 4, 8, 16, 32, and so on.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-21

The following JSON sample shows how to set the DMUs to 64 when copying data from Blob storage to
SQL Data Warehouse:

"activities":[
{
"name": "Example Copy Activity",
"description": "",
"type": "Copy",
"inputs": [{ "name": "SourceDataset" }],
"outputs": [{ "name": "DestinationDataset" }],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "SqlDWSink"
},
"cloudDataMovementUnits": 64
}
}
]

The parallelCopies parameter is another good way to increase throughput—parallelCopies indicates the
maximum number of threads that run within a copy activity when reading data from the source and
writing data to the sink in parallel. Data Factory automatically decides the number of parallel copies based
on the source data store, sink data store, load on the host machines, and so on. You override this to a
number between 1 and 32.

The following pointers help maximize performance:


 When you copy many small files from on-premises to Blob storage, it’s important to increase the
number of parallel copies (parallelCopies parameter). Increasing the number of parallel copies will
increase the throughput. Be aware that the machine that hosts the DMG needs to be of a high
specification to support good data transfers.
 When many big files are being copied, increasing the parallelCopies parameter wouldn’t necessarily
improve performance—given that the copy activity is using the default of 1 DMU. It is advisable to
increase the DMUs (set cloudDataMovementUnits parameter to 4 or 8) to increase the throughput.

Using the Data Factory Copy Data Wizard


One of the first steps in any data project is to
retrieve data from source systems before
transforming it. Data Factory provides a good
wizard to make it easy to copy data from various
types of source system into a destination system.
The Copy Data Wizard automatically creates the
required JSON definitions for pipelines, linked
services, datasets and activities to perform the
data copying job.
MCT USE ONLY. STUDENT USE PROHIBITED
9-22 Automating Data Flow with Azure Data Factory

The following steps explain how to use the Copy Data Wizard in Data Factory to copy data from source to
destination:
1. Open the Azure portal and go to the specific Data Factory resource.

2. Click the Copy data tile to invoke the Copy Data Wizard within Data Factory.

3. Provide a task name and description for this copy data task.

4. Specify the load schedule. The Data Factory Copy Data Wizard offers two options:

a. Run once now. This option is particularly useful if you need a one-off data copy from source to
destination.
b. Run regularly on schedule. Use this option to schedule the data load on a regular basis as more
data is available in the source system. Data Factory currently has many options to choose from
(hourly, daily, weekly, monthly, and so on). It’s important to note that you can’t select a frequency
of less than 15 minutes.

5. Select the source from where the data is being copied. The Data Factory Copy Data Wizard supports a
wide variety of source systems and provides an easy interface to supply source connection
information.

6. When the source type is selected and connection information is provided, the Copy Data Wizard
automatically scans the sample data and schematic definition to give the user a preview. This helps
users to identify whether the correct source data to be copied to the destination has been chosen.

7. If the source is a text file or set of text files, the Data Factory Copy Data Wizard provides options to
select the column delimiter, row delimiter, skip header rows, and so on.

8. Map the source schema to the destination schema. This is a very important step—the Copy Data
Wizard provides drop-down lists to map columns between source and destination.
You also use the Data Factory Copy Data Wizard to filter data when only a subset of data is required from
the source system. This is particularly helpful to avoid copying the full dataset when the project requires,
for example, data from the previous year.
9. The Data Factory Copy Data Wizard also provides some advanced settings that you tune, such as
setting the number of cloud units required to execute this task, the number of parallel copies to be
used, and so on.

10. When the required options are selected, the Data Factory Copy Data Wizard provides a summary of
all the choices. When the user is happy with the information, and the options have been deployed,
the Copy Data Wizard automatically generates the underlying JSON definitions based on the user
choices made during the wizard.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-23

Performing a staged copy


When you copy data from a source data store to
a destination data store, it might be useful to
store the data in an interim Blob storage. This is
particularly useful in the following scenarios:

 Ingesting data from multiple source


systems into SQL Data Warehouse.
PolyBase is the recommended approach for
loading very large volumes of data into SQL
Data Warehouse. PolyBase requires source
data to be present in Blob storage. It’s
advisable to use a staged copy to capture all
the required source data into an interim Blob
storage before loading into SQL Data Warehouse from Blob storage using PolyBase.

 Hybrid data movement. Hybrid data movement helps you to copy data between on-premises and
the cloud. Copying large volumes of data on slow networks is typically time consuming, so it’s
advisable to compress the data before copying from on-premises to the cloud, and load into an
interim staging data store. You then decompress the data in the interim staging data store before you
load it into a destination data store.
 Firewall restrictions. Typically, it’s difficult to open ports other than port 80 for HTTP or port 443 for
HTTPS because of corporate IT policies. In such cases, it’s advisable to stage data into Blob storage
when you copy data from on-premises to the cloud. DMG only requires access to port 80 or 443 to
copy data from on-premises to the cloud whereas, if data is to be loaded to a SQL Data Warehouse, it
would require port 1433 to be open.
By using staged copy, you now have a two-step process to load data from a source data store to a
destination data store. To enable staged copy, you use the setting enableStaging in the copy activity. The
following properties help you to enable staging copy and provide the required information to the data
factory to stage data:

Property name Property description Valid values Mandatory (Yes or No)

enableStaging Specifies whether the TRUE No


data needs to be staged. FALSE
(Default)

linkedServiceName Provides information Yes, if enableStaging is TRUE


about Blob storage to
stage the data.

path Specifies the path to the No


Blob storage.

enableCompression Specifies whether data TRUE No


must be compressed. FALSE
(Default)
MCT USE ONLY. STUDENT USE PROHIBITED
9-24 Automating Data Flow with Azure Data Factory

The following is a sample JSON that enables staged data when loading data from an on-premises SQL
Server to SQL Database:

"activities":[
{
"name": "Example Copy Activity",
"type": "Copy",
"inputs": [{ "name": "OnpremisesSQLServerInputDataStore" }],
"outputs": [{ "name": "AzureSQLDBDestinationDataStore" }],
"typeProperties": {
"source": {
"type": "SqlSource",
},
"sink": {
"type": "SqlSink"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "StagingBlobStore",
"path": "stagingcontainer/stagingpath",
"enableCompression": true
}
}
}
]

Demonstration: Creating a pipeline using the Data Factory Copy Wizard


In this demonstration, you will see how to:
 Create an Azure SQL Database to act as a destination

 Create a copy data activity and pipeline using the wizard.

 Verify and test the new pipeline.


Question: What type of copy activity will you use within your organization to copy data
from source to destination in Data Factory?

Check Your Knowledge


Question

Which one of the following schedules cannot be


set in Data Factory?

Select the correct answer.

Every day

Every month

Every week

Every 10 minutes
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-25

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? Data can’t be compressed


using staged data in Data Factory.
MCT USE ONLY. STUDENT USE PROHIBITED
9-26 Automating Data Flow with Azure Data Factory

Lesson 3
Transforming data
After the data is copied from source data stores, the logical next step is to transform it as per the
requirements—so that it’s ready to be loaded on to the destination data store. This lesson looks at how
data transformations are used within Data Factory and how to implement custom transformations. This
lesson also considers how to transform data using Data Lake Analytics and SQL Data Warehouse.

Lesson Objectives
By the end of this lesson, you should be able to:

 Define transformations and see how they work in Data Factory.

 Explain how to transform data using Data Lake Analytics and SQL Data Warehouse.

 Understand how to implement custom transformations in Data Factory.

Defining transformations
To help you to process and transform data, Data
Factory provides two different configurations of
compute environments. These are the on-
demand and bring your own (BYO) compute
environments.

On-demand compute environment


This configuration provides a compute
environment that is fully managed by Data
Factory automatically. In other words, the
compute environment is created as soon as the
job is submitted, and the compute environment
is deleted after the job has completed execution.

Data Factory can automatically create an on-demand HDInsight cluster, based on Windows/Linux, to
process and transform data. This cluster is created in the same storage account as the Data Factory job.

The following JSON linked service snippet configures an on-demand Windows-based HDInsight cluster of
size 1 that is used for running the Data Factory job:

{
"name": "SampleHDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:10:00",
"osType": "Windows",
"linkedServiceName": "AzureStorageLinkedService"
}
}
}

It’s important to note that the OSType could be changed to Linux from Windows to create a Linux-based
HDInsight.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-27

The following table explains the key properties used in the above JSON:

Property Name Property Description Mandatory (Yes or No)

type This property is always set as Yes


HDInsightOnDemand.

clusterSize This specifies the number of worker/data nodes Yes


in the HDInsight cluster.

timetolive This specifies the amount of time the cluster Yes


needs to be available so that it can be reused
by other processes before it is deleted after a
Data Factory job has finished execution.
Because cluster creation is an expensive and
time-consuming job, an appropriate timetolive
would save money and time—but giving a very
high value for timetolive might increase the
costs.

version This specifies the version for the HDInsight No


cluster. For Windows, the version is 3.1 and for
a Linux cluster, it is 3.2.

linkedServiceName Name of the Azure Storage linked service to be Yes


used by the HDInsight cluster.

additionalLinked This specifies the additional Azure Storage No


ServiceNames linked service that Data Factory could use.

osType This specifies the type of operating system for No


the HDInsight cluster. Valid values are
Windows (default) or Linux.

hcatalogLinked This specifies the name of the Azure SQL linked No


ServiceName service that points to the HCatalog database.

For a more granular configuration of the on-demand HDInsight cluster, see:

Advanced Properties
https://aka.ms/I6xm5c

Bring your own (BYO) compute environment


Use this configuration to specify an existing compute environment where the Data Factory job is
processed. The following compute environments are supported by Data Factory for this purpose:

 Azure HDInsight

 Azure Batch

 Azure Machine Learning

 Azure Data Lake Analytics

 Azure SQL Database/Azure SQL Data Warehouse, SQL Server


MCT USE ONLY. STUDENT USE PROHIBITED
9-28 Automating Data Flow with Azure Data Factory

Transforming data with Data Lake Analytics


You use an existing compute environment like
Data Lake Analytics to process data for a Data
Factory job. You do this by creating a Data Lake
Analytics linked service that links the Data Lake
Analytics environment to Data Factory.
The following are the key JSON properties you
use to define the Data Lake Analytics connection
information as part of the linked service:

Property name Property description Mandatory (Yes or No)

type This property is always set as Yes


AzureDataLakeAnalytics.

accountName Specifies the name of the Data Lake Analytics Yes


account.

dataLakeAnalyticsUri Specifies the URL for Data Lake Analytics. No

subscriptionId Specifies the Azure Subscription ID. If not No


specified, the connection will use the ADF
Subscription ID.

resourceGroupName Specifies the name of the Azure Resource No


Group. If not specified, the connection will use
the Data Factory Resource Group name.

There are two ways to authenticate Data Factory to Data Lake Analytics:

Service Principal Authentication. This is the recommended authentication for Data Lake Analytics. You
will need the following key information:

 Application ID

 Application Key
 Tenant ID

User Credential Authentication. To use user credential authentication, you click the Authorize button in
the Azure portal. However, you should be aware that, depending on the scenario, the tokens might expire
and need to be reauthorized.

For more information on token expiration, see:

Token expiration
https://aka.ms/Jp4g3n
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-29

Transforming data with SQL Data Warehouse


Data Factory also uses SQL Data Warehouse as its
compute environment. You need to create a
linked service to connect to SQL Data Warehouse
and use it with a Stored Procedure Activity to call
the stored procedure from a Data Factory
pipeline.

The following are the key JSON elements you


need to create a linked service to SQL Data
Warehouse:

Property Name Property Description Mandatory (Yes or No)

type This property is always set as AzureSqlDW. Yes

connectionString This specifies the connection information for the Yes


SQL Data Warehouse instance—only basic
authentication is supported.

To create a Data Factory pipeline with a Stored Procedure Activity, create a new pipeline and use the
following JSON snippet to create a SQL Server Stored Procedure Activity that calls the stored procedure:

{
"name": "storedProcedureActivitySamplePipeline",
"properties": {
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "sampleStoredProcedure",
"storedProcedureParameters": {
"DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}',
SliceStart)"
}
},
"outputs": [
{
"name": "storedProcedureSampleOut"
}
],
"scheduler": {
"frequency": "Hour",
"interval": 4
},
"name": "storedProcedureActivitySample"
}
],
"start": "2017-09-02T00:00:00Z",
"end": "2017-09-02T05:00:00Z",
"isPaused": false
}
}

Note the following in the preceding code:


MCT USE ONLY. STUDENT USE PROHIBITED
9-30 Automating Data Flow with Azure Data Factory

 “SqlServerStoredProcedure” is set in the type property because it is a SQL Server stored procedure.

 The storedProcedureName property is set to sampleStoredProcedure—the name of the stored


procedure that is being called.

 The storedProcedureParameters property block contains only one input parameter that is a
DateTime type.

Integrating with Azure Machine Learning


Data Factory is well integrated with Azure
Machine Learning. You might create a linked
service to Machine Learning so that the endpoint
of a Data Factory pipeline is linked to Machine
Learning batch scoring.

To create a Machine Learning linked service, use


the following sample snippet:

{
"name":
"AzureMachineLearningLinkedService",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://[batch_scoring_endpoint]/jobs",
"apiKey": "<api_key>"
}
}
}

Property name Property description Mandatory (Yes or No)

type This property should be set to AzureML. Yes

mlEndpoint This specifies the Machine Learning batch scoring Yes


URL.

apiKey This specifies the published workspace model’s API. Yes

Batch Execution Activity is used within Data Factory to invoke a Machine Learning web service from a
Data Factory pipeline to make predictions on the data in the batch process. You use the Batch Execution
Activity to invoke both training and scoring Machine Learning web services.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-31

Implementing custom transformations


Sometimes the existing activities that are
available in Data Factory are not sufficient and
custom activities are required. Custom activities
give users the ability to write custom
transformations using .NET.
There are two ways to execute custom activities
written in .NET:

 Using Azure Batch. Azure Batch is an Azure


service you use to run large-scale parallel
and high-performance computing (HPC)
applications efficiently in Azure.
 Using Windows-based Azure HDInsight cluster. You execute custom .NET activities using
Windows-based Azure HDInsight clusters—either by using an existing HDInsight cluster or an on-
demand HDInsight cluster managed by Data Factory.

To create a custom .NET activity, you require Visual Studio 2012 or later, in addition to Azure .NET SDK.

Use the following key steps to create a custom .NET activity:

1. Create a .NET Class Library project to create a new class (for example, SampleDotNetActivity) that
implements the IDotNetActivity interface. The interface consists of one method—the Execute
method—that requires the following four parameters:

o linkedServices. This is an enumerable list of linked services for input/output datasets.

o Datasets. This is an enumerable list of input/output datasets.


o Activity. This is an object of class Activity that represents the current activity.

o Logger. This is an object of class IActivityLogger that logs data for the Data Factory pipeline—
for debugging purposes.

2. Implement the Execute method with the custom transformations that need to be performed using
.NET.

3. Make sure you’ve installed the NuGet Packages for Data Factory and Azure Storage because they’re
required for custom .NET activities.

4. To compile the project, click Build from the Visual Studio menu, and then click Build Solution.

5. After the build is successful, zip the files under bin\debug or bin\release and make sure the .pdb file is
included—this helps to debug issues when there’s a failure.

6. Create a blob container and upload the zip file as a blob to that container. Make sure it’s a general-
purpose Blob storage.

7. This custom activity is used within any Data Factory pipeline by using the activity type
DotNetActivity.
MCT USE ONLY. STUDENT USE PROHIBITED
9-32 Automating Data Flow with Azure Data Factory

Demonstration: Using Machine Learning in an Azure Data Factory pipeline


In this demonstration, you will see how to:

 Create and deploy an ML model ready for use with Data Factory.

 Upload live data as a test dataset.


 Create a Data Factory Machine Learning linked service.

 Create Data Factory input and output datasets.

 Create a new Data Factory pipeline.

 Verify and test the ML pipeline.

Question: How will you use Data Factory with Machine Learning in your organization?

Check Your Knowledge


Question

What is the name of the method that you should


implement when you create custom
transformation activity in Data Factory?

Select the correct answer.

Run

Execute

Method

Custom

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? Service Principal


Authentication is the recommended
authentication for Data Lake Analytics.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-33

Lesson 4
Monitoring performance and protecting data
After creating a data pipeline and activities in Data Factory, it’s important to understand how to check the
status of pipelines and activities, and how to pause and resume pipelines, and monitor their performance.

Lesson Objectives
By the end of this lesson, you should be able to:

 Understand how to check the status of pipelines and activities.

 Understand how to pause and resume pipelines.

 Explain how to monitor performance in Data Factory.

Monitoring performance
Data Factory provides the tools you use to
monitor performance and to understand the
status of pipelines and activities that either have
been executed or are executing now. The Azure
portal offers these capabilities in an intuitive
graphical user interface (GUI) in a web browser.

How to check the status of pipelines


and activities
Use the following steps to check the status of
pipelines and activities using the Azure portal:

1. Log in to the Azure portal.

2. Go to the data factory that contains the pipelines where you want to check the status.

3. Under Actions, click the Diagram icon. This shows the pictorial view of the pipeline.

4. Right-click the pipeline then click Open pipeline to open the pipeline and reveal the activities it
contains.

5. To check the status of the activity, double-click the dataset that is produced by that activity.

6. The dataset shows the summary of the results along with the data slices that are produced inside the
pipeline.

7. You click each data slice to view their detailed information.


For more information on what each status means, see:
View the state of each activity inside a pipeline

https://aka.ms/Oisghj

How to pause and resume pipelines


To pause and resume pipelines within Data Factory, you need to use the Azure PowerShell and run
commands that are essentially called cmdlets.
MCT USE ONLY. STUDENT USE PROHIBITED
9-34 Automating Data Flow with Azure Data Factory

To pause or suspend a pipeline, you use the Suspend-AzureRmDataFactoryPipeline PowerShell cmdlet


as follows:

Suspend-AzureRmDataFactoryPipeline -ResourceGroupName ADFResourceGroup -


DataFactoryName sampleDataFactory -Name samplePipeline

In the preceding example, samplePipeline in sampleDataFactory, under the ADFResourceGroup, is


suspended by issuing the cmdlet.

To resume the pipeline, you use the Resume-AzureRmDataFactoryPipeline PowerShell cmdlet as


follows:

Resume-AzureRmDataFactoryPipeline -ResourceGroupName ADFResourceGroup -


DataFactoryName sampleDataFactory -Name samplePipeline

In the preceding example, samplePipeline in sampleDataFactory, under the ADFResourceGroup, is


resumed for execution when you issue the cmdlet.

How to examine the activity log of a pipeline and rerun failed pipelines
Data Factory provides good capabilities for users to debug and troubleshoot pipelines using the Azure
portal and Azure PowerShell.

To find out what errors have occurred using the Azure portal, use the following steps:
1. Log in to the Azure portal.

2. Go to the data factory that contains the pipelines where you want to check the status.

3. Under Actions, click the Diagram icon. This shows the pictorial view of the pipeline.

4. Right-click the pipeline then click Open pipeline to open the pipeline and reveal the activities it
contains.

5. To check the status of the activity, double-click the dataset that is produced by that activity.
6. The dataset shows the summary of the results along with the data slices that are produced inside the
pipeline.

7. Click the data slice that has its status set to Failed then click on the activity run that has failed.

8. You can now view the error details and the log files associated with that error. You can also download
the error information.

After errors are analyzed and fixed, the pipeline can be rerun using the Run button on the command bar
of the data slice that failed.

Monitoring and Management app


The Azure portal now provides a useful Monitoring and Management app for Data Factory pipelines. The
app provides a single view of all the information you need to monitor, manage and debug pipelines and
activities.
The Monitoring and Management app is located under Actions on the Data Factory page. After you open
the Monitoring and Management app, it opens in a new page and has the following sections:

 Resource explorer. This is a tree view of all the resources shown in the left pane of the app.

 Diagram view. This is a diagram view at the top, in the middle pane of the app.

 Activity windows. The activity windows are present at the bottom, in the middle pane of the app.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-35

 Multiple tabs. The right pane of the app consists of properties, the activity window explorer and
script tabs.
For more information about the Monitoring and Management app, see:

Monitor and manage Data Factory pipelines by using the Monitoring and Management app
https://aka.ms/Cbr8tq

Managing fault tolerance


When you copy data from the source data store
to the sink data store, it’s important to handle
any records that are not compatible to the sink
data store. Data Factory Copy Activity provides
two ways for you to handle such a situation. You
might abort the activity when incompatible
records are encountered—this is the default
behavior. The alternative is for you to define fault
tolerance and skip incompatible records—this
means you log incompatible records to a
different storage (for example, Blob storage) and
continue processing the remainder of the
records. You then review the rejected records, fix them and rerun the copy activity.

Copy Activity supports three scenarios for detecting and logging incompatible records when fault
tolerance is enabled:

 Incompatible data types between the source data store and the sink data store. If the data types
are different between the source data store and the sink data store, Data Factory rejects the record
and logs it as incompatible—those records are skipped.
 Different number of columns between the source data store and the sink data store. If the
number of columns are between the source data store and the sink data store are different, then such
records are rejected and logged as incompatible—those records are skipped.
 Primary key violation when writing to relational database. If the copy activity encounters a
primary key violation when writing to the sink data store, this is logged as “incompatible records”—
those records are skipped.

The following JSON enables fault tolerance in the copy activity by setting enableSkipIncompatibleRow
to true. The redirectIncompatibleRowSettings property provides the linked service to write the
incompatible records to Blob storage:

"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
},
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": "BlobStorage",
"path": "errorcontainer/output"
}
}
MCT USE ONLY. STUDENT USE PROHIBITED
9-36 Automating Data Flow with Azure Data Factory

After the copy activity completes, the total number of skipped records appears in the monitoring section.
You find any incompatible records that have been logged at the following location—it contains the
incompatible record and the error information:

https://[your_blob_account].blob.core.windows.net/[path_if_configured]/[copy_activity
_run_id]/[auto_generated_GUID].csv

Handling security
Data Factory provides the robust security that
you see throughout the various Azure services.
Data movement using Data Factory is certified
for the following:

 HIPAA/HITECH

 ISO/IEC 27001

 ISO/IEC 27018

 CSA STAR
The following are the key points to remember
regarding data security in Data Factory:
 All credentials used to connect to data stores are encrypted using certificates managed by Microsoft.
These certificates are rotated every two years.

 All data in transit is encrypted via HTTPS or TLS, provided the cloud data store supports them.

 Many data stores support data encryption at rest:

o Both SQL Data Warehouse and SQL Database support Transparent Data Encryption (TDE).

o Data Lake Store also encrypts data stored inside. It automatically encrypts data before storing
and decrypts it when it is retrieved.

o Blob storage supports Storage Service Encryption (SSE).

 Security can be further strengthened by using IPSec VPN or Express Route to transfer data between
an on-premises network and Azure.
 Configuring corporate firewalls and Windows firewalls further increases data security.

 Many Azure services such as SQL Database, SQL Data Warehouse, and Data Lake Store, require the
whitelisting of IP addresses from which the services are accessed.

Demonstration: Using the Monitoring and Management app


In this demonstration, you will see how to:

 Use the Diagram view to examine pipelines and datasets.

 Use filters to examine activities.

 Pause and resume a pipeline.

 Use monitoring views to view the status of activities.


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-37

Question: How would you use the Azure portal Monitoring and Management app to
monitor, debug and manage Data Factory pipelines and activities within your organization?

Check Your Knowledge


Question

For which of the following standards is Data


Factory not certified for data security?

Select the correct answer.

HIPAA

CSA STAR

ISO/IEC 27018

WORLD STAR

Verify the correctness of the statement by placing a mark in the column to the right.

Statement Answer

True or false? You can’t pause or resume


a Data Factory pipeline.
MCT USE ONLY. STUDENT USE PROHIBITED
9-38 Automating Data Flow with Azure Data Factory

Lab: Automating the Data Flow with Azure Data Factory


Scenario
You work for Adatum as a data engineer, and you have been asked to build a traffic surveillance system
for traffic police. This system must be able to analyze significant amounts of dynamically streamed data,
captured from speed cameras and automatic number plate recognition (ANPR) devices, and then cross-
check the outputs against large volumes of reference data holding vehicle, driver, and location
information. Fixed road-side cameras, hand-held cameras (held by traffic police), and mobile cameras (in
police patrol cars) are used to monitor traffic speeds and raise an alert if a vehicle is travelling too quickly
for the local speed limit. The cameras also have built-in ANPR software that can read vehicle registration
plates.

For this final phase of the project, you are going to use Azure Data Factory to automate the management
of data associated with the traffic surveillance system. You will use a Data Factory pipeline to
automatically backup stolen vehicle data from one Data Lake Store to another. You will also use Data
Factory pipelines to perform batch transformations—using Azure Data Analytics to summarize speed
camera data as the data is uploaded from one Data Lake Store to another, and using Azure ML to perform
predictive analytics on speed data as it is uploaded from an Azure Storage blob. In order to ensure the
reliability of your Data Factory pipelines, you are also going to test the monitoring and management
capabilities provided with Azure Data Factory.

Objectives
After completing this lab, you will be able to:

 Use Data Factory to back up data from an Azure Data Lake Store to a second ADLS store.

 Transform uploaded data by running a U-SQL script in an ADLA linked service.

 Transform uploaded data by running an ML model in a Machine Learning linked service.

 Use the monitoring and management app to track progress of a pipeline.

Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.

Lab Setup
Estimated Time: 75 minutes

Virtual machine: 20776A-LON-DEV

Username: ADATUM\AdatumAdmin

Password: Pa55w.rd

This lab uses the following resources from previous labs:


 Resource group: CamerasRG

 Azure SQL Data Warehouse: trafficwarehouse

 Azure SQL Server: trafficserver<your name><date>


 Azure Blob storage: speeddata<your name><date>, vehicledata<your name><date>

 Machine Learning namespace: Traffic

 Azure Data Lake Store: adls<your name><date>


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-39

 Azure Data Lake Analytics account: speedsdla<your name><date>

Exercise 1: Use Data Factory to back up data from an Azure Data Lake
Store to a second ADLS store
Scenario
You’re using Azure Data Factory to automate the management of data associated with the traffic
surveillance system, and will use a Data Factory pipeline to automatically back up stolen vehicle data from
one Data Lake Store to another.

In this exercise, you will check the data in the original ADLS account, and then create a second ADLS
account which will host the backup location. After creating a new Data Factory, you will create and
configure a service principal so that Data Lake Store authorization can occur during the execution of a
Data Factory pipeline. You will then assign permissions to the Service Principal to enable data copying,
and use the Data Factory Copy Wizard to back up the data from one ADLS account to another.

The main tasks for this exercise are as follows:

1. Verify the data in the original ADLS account

2. Create a second backup ADLS account

3. Create new Data Factory

4. Create a Service Principal

5. Assign copy data permissions to the Service Principal


6. Use Copy Wizard to back up data from ADLS1 to ADLS2

 Task 1: Verify the data in the original ADLS account

 Task 2: Create a second backup ADLS account

 Task 3: Create new Data Factory

 Task 4: Create a Service Principal

 Task 5: Assign copy data permissions to the Service Principal

 Task 6: Use Copy Wizard to back up data from ADLS1 to ADLS2

Results: At the end of this exercise, you will have:

Verified the data in the original ADLS account.

Created a second ADLS account to act as the backup location.

Created a new Data Factory.

Created a service principal to enable Data Lake Store authorization in a Data Factory pipeline.

Assigned permissions to the Service Principal to enable data copying.


Used the Data Factory Copy Wizard to backup data from one ADLS account to another.
MCT USE ONLY. STUDENT USE PROHIBITED
9-40 Automating Data Flow with Azure Data Factory

Exercise 2: Transform uploaded data by running a U-SQL script in an ADLA


linked service
Scenario
You are using Azure Data Factory to automate the management of data associated with the traffic
surveillance system, and will use Azure Data Factory pipelines to perform batch transformations. By using
an Azure Data Analytics linked service, you will summarize speed camera data as the data is uploaded
from one Data Lake Store to another.

In this exercise, you will upload a CSV file containing camera speed data to ADLS ready for processing in
ADLA by using a U-SQL script; To authorize ADLA to process this data in a Data Factory pipeline, you will
add the Service Principal from Exercise 1 as a Contributor to the ADLA account. You will then create Data
Factory linked services for Azure Data Lake Analytics, for Data Lake Store (for the input and output
datasets), and for Azure Storage (as the scripts location for the U-SQL script). You will then create Data
Lake Store input and output datasets, create a script for the Data Lake Analytics U-SQL Activity that
extracts summary data for a specific speed camera. Finally, you will create and deploy a new pipeline to
run this activity, and check the results to verify the U-SQL data transformation.

The main tasks for this exercise are as follows:

1. Prepare the environment

2. Add the Service Principal as Contributor to the ADLA account


3. Create an Azure Data Lake Analytics linked service

4. Create a Data Lake Store linked service for input and output datasets

5. Create an Azure Storage Blob linked service for the U-SQL script

6. Create Data Lake Store input and output datasets

7. Create and deploy a new pipeline

8. Verify the U-SQL data transformation

 Task 1: Prepare the environment

 Task 2: Add the Service Principal as Contributor to the ADLA account

 Task 3: Create an Azure Data Lake Analytics linked service

 Task 4: Create a Data Lake Store linked service for input and output datasets

 Task 5: Create an Azure Storage Blob linked service for the U-SQL script

 Task 6: Create Data Lake Store input and output datasets

 Task 7: Create and deploy a new pipeline

 Task 8: Verify the U-SQL data transformation


MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-41

Results: At the end of this exercise, you will have:

Prepared your environment and uploaded test data to your Data Lake Store.

Added the Service Principal as a Contributor to the ADLA account.


Created an Azure Data Lake Analytics linked service.

Created a Data Lake Store linked service for input and output datasets.

Created an Azure Storage Blob linked service for the U-SQL script.
Created Data Lake Store input and output datasets.

Created and deployed a new pipeline.

Verified the U-SQL data transformation.

Exercise 3: Transform uploaded data by running an ML model in a Machine


Learning linked service
Scenario
You are using Azure Data Factory to automate the management of data associated with the traffic
surveillance system, and will use Azure Data Factory pipelines to perform batch transformations. By using
an Azure ML linked service, you will perform predictive analytics on speed data as it’s uploaded from an
Azure Storage blob.

In this exercise, you will start your Data Warehouse, because this was the original data source for the
deployed ML model you are going to use—you will then obtain the API key and batch execution URL for
this ML model. You will then create an ML linked service, using these parameters, and then create Azure
Storage input and output datasets that link to live test input data, and an output results location
respectively. Finally, you will create and deploy a new pipeline, and check the results in order to verify the
ML data transformation.

The main tasks for this exercise are as follows:

1. Start the Data Warehouse

2. Obtain the API key and batch execution URL for a deployed ML model

3. Create an ML linked service

4. Create Azure Storage input and output datasets


5. Create and deploy a new pipeline

6. Verify the ML data transformation


MCT USE ONLY. STUDENT USE PROHIBITED
9-42 Automating Data Flow with Azure Data Factory

 Task 1: Start the Data Warehouse

 Task 2: Obtain the API key and batch execution URL for a deployed ML model

 Task 3: Create an ML linked service

 Task 4: Create Azure Storage input and output datasets

 Task 5: Create and deploy a new pipeline

 Task 6: Verify the ML data transformation

Results: At the end of this exercise, you will have:

Started your Data Warehouse.

Obtained the API key and batch execution URL for the deployed ML model.

Created an ML linked service in Data Factory.

Created Azure Storage input and output datasets.

Created and deployed a new pipeline.

Verified the ML data transformation.

Exercise 4: Use the Monitoring and Management app to track progress of


a pipeline
Scenario
You are using Azure Data Factory to automate the management of data associated with the traffic
surveillance system. To ensure the reliability of your Data Factory pipelines, you’re going to test the
monitoring and management capabilities provided with Azure Data Factory.

In this exercise, you will use the Diagram View, in the monitoring and management app, to see the status
of the Traffic DF Copy Pipeline that you created in Exercise 1. You will then use filters and views to find
specific status information on the inputs and outputs, and on the copy activity in the Traffic DF Copy
Pipeline.

The main tasks for this exercise are as follows:

1. Use the Diagram View to see overall job statuses

2. Use filters and views to find specific status information

3. Lab clean up

 Task 1: Use the Diagram View to see overall job statuses

 Task 2: Use filters and views to find specific status information

 Task 3: Lab clean up


Results: At the end of this exercise, you will have:

Used the Diagram View to see overall job statuses.


Used filters and views to find specific status information..
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-43

Question: Why might you pause a deployed Data Factory pipeline?

Question: Why might you choose to use Service Principal Authentication with a Data Factory
pipeline?
MCT USE ONLY. STUDENT USE PROHIBITED
9-44 Automating Data Flow with Azure Data Factory

Module Review and Takeaways


In this module, you learned:

 The purpose of Data Factory and how it works.

 How to create Data Factory pipelines that transfer data efficiently.


 How to perform transformations using a Data Factory pipeline.

 How to monitor Data Factory pipelines and how to protect the data flowing through these pipelines.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-45

Course Evaluation

Your evaluation of this course will help Microsoft understand the quality of your learning experience.
Please work with your training provider to access the course evaluation form.

Microsoft will keep your answers to this survey private and confidential and will use your responses to
improve your future learning experience. Your open and honest feedback is valuable and appreciated.
MCT USE ONLY. STUDENT USE PROHIBITED
 

Potrebbero piacerti anche