Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
20776A
Performing Big Data Engineering on
Microsoft Cloud Services
MCT USE ONLY. STUDENT USE PROHIBITED
ii Performing Big Data Engineering on Microsoft Cloud Services
Information in this document, including URL and other Internet Web site references, is subject to change
without notice. Unless otherwise noted, the example companies, organizations, products, domain names,
e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with
any real company, organization, product, domain name, e-mail address, logo, person, place or event is
intended or should be inferred. Complying with all applicable copyright laws is the responsibility of the
user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in
or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical,
photocopying, recording, or otherwise), or for any purpose, without the express written permission of
Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this document. Except as expressly provided in any written license
agreement from Microsoft, the furnishing of this document does not give you any license to these
patents, trademarks, copyrights, or other intellectual property.
The names of manufacturers, products, or URLs are provided for informational purposes only and
Microsoft makes no representations and warranties, either expressed, implied, or statutory, regarding
these manufacturers or the use of the products with any Microsoft technologies. The inclusion of a
manufacturer or product does not imply endorsement of Microsoft of the manufacturer or product. Links
may be provided to third party sites. Such sites are not under the control of Microsoft and Microsoft is not
responsible for the contents of any linked site or any link contained in a linked site, or any changes or
updates to such sites. Microsoft is not responsible for webcasting or any other form of transmission
received from any linked site. Microsoft is providing these links to you only as a convenience, and the
inclusion of any link does not imply endorsement of Microsoft of the site or the products contained
therein.
© 2018 Microsoft Corporation. All rights reserved.
Released: 10/2018
MCT USE ONLY. STUDENT USE PROHIBITED
MICROSOFT LICENSE TERMS
MICROSOFT INSTRUCTOR-LED COURSEWARE
These license terms are an agreement between Microsoft Corporation (or based on where you live, one of its
affiliates) and you. Please read them. They apply to your use of the content accompanying this agreement which
includes the media on which you received it, if any. These license terms also apply to Trainer Content and any
updates and supplements for the Licensed Content unless other terms accompany those items. If so, those terms
apply.
BY ACCESSING, DOWNLOADING OR USING THE LICENSED CONTENT, YOU ACCEPT THESE TERMS.
IF YOU DO NOT ACCEPT THEM, DO NOT ACCESS, DOWNLOAD OR USE THE LICENSED CONTENT.
If you comply with these license terms, you have the rights below for each license you acquire.
1. DEFINITIONS.
a. “Authorized Learning Center” means a Microsoft IT Academy Program Member, Microsoft Learning
Competency Member, or such other entity as Microsoft may designate from time to time.
b. “Authorized Training Session” means the instructor-led training class using Microsoft Instructor-Led
Courseware conducted by a Trainer at or through an Authorized Learning Center.
c. “Classroom Device” means one (1) dedicated, secure computer that an Authorized Learning Center owns
or controls that is located at an Authorized Learning Center’s training facilities that meets or exceeds the
hardware level specified for the particular Microsoft Instructor-Led Courseware.
d. “End User” means an individual who is (i) duly enrolled in and attending an Authorized Training Session
or Private Training Session, (ii) an employee of a MPN Member, or (iii) a Microsoft full-time employee.
e. “Licensed Content” means the content accompanying this agreement which may include the Microsoft
Instructor-Led Courseware or Trainer Content.
f. “Microsoft Certified Trainer” or “MCT” means an individual who is (i) engaged to teach a training session
to End Users on behalf of an Authorized Learning Center or MPN Member, and (ii) currently certified as a
Microsoft Certified Trainer under the Microsoft Certification Program.
g. “Microsoft Instructor-Led Courseware” means the Microsoft-branded instructor-led training course that
educates IT professionals and developers on Microsoft technologies. A Microsoft Instructor-Led
Courseware title may be branded as MOC, Microsoft Dynamics or Microsoft Business Group courseware.
h. “Microsoft IT Academy Program Member” means an active member of the Microsoft IT Academy
Program.
i. “Microsoft Learning Competency Member” means an active member of the Microsoft Partner Network
program in good standing that currently holds the Learning Competency status.
j. “MOC” means the “Official Microsoft Learning Product” instructor-led courseware known as Microsoft
Official Course that educates IT professionals and developers on Microsoft technologies.
k. “MPN Member” means an active Microsoft Partner Network program member in good standing.
MCT USE ONLY. STUDENT USE PROHIBITED
l. “Personal Device” means one (1) personal computer, device, workstation or other digital electronic device
that you personally own or control that meets or exceeds the hardware level specified for the particular
Microsoft Instructor-Led Courseware.
m. “Private Training Session” means the instructor-led training classes provided by MPN Members for
corporate customers to teach a predefined learning objective using Microsoft Instructor-Led Courseware.
These classes are not advertised or promoted to the general public and class attendance is restricted to
individuals employed by or contracted by the corporate customer.
n. “Trainer” means (i) an academically accredited educator engaged by a Microsoft IT Academy Program
Member to teach an Authorized Training Session, and/or (ii) a MCT.
o. “Trainer Content” means the trainer version of the Microsoft Instructor-Led Courseware and additional
supplemental content designated solely for Trainers’ use to teach a training session using the Microsoft
Instructor-Led Courseware. Trainer Content may include Microsoft PowerPoint presentations, trainer
preparation guide, train the trainer materials, Microsoft One Note packs, classroom setup guide and Pre-
release course feedback form. To clarify, Trainer Content does not include any software, virtual hard
disks or virtual machines.
2. USE RIGHTS. The Licensed Content is licensed not sold. The Licensed Content is licensed on a one copy
per user basis, such that you must acquire a license for each individual that accesses or uses the Licensed
Content.
2.1 Below are five separate sets of use rights. Only one set of rights apply to you.
2.2 Separation of Components. The Licensed Content is licensed as a single unit and you may not
separate their components and install them on different devices.
2.3 Redistribution of Licensed Content. Except as expressly provided in the use rights above, you may
not distribute any Licensed Content or any portion thereof (including any permitted modifications) to any
third parties without the express written permission of Microsoft.
2.4 Third Party Notices. The Licensed Content may include third party code tent that Microsoft, not the
third party, licenses to you under this agreement. Notices, if any, for the third party code ntent are included
for your information only.
2.5 Additional Terms. Some Licensed Content may contain components with additional terms,
conditions, and licenses regarding its use. Any non-conflicting terms in those conditions and licenses also
apply to your use of that respective component and supplements the terms described in this agreement.
a. Pre-Release Licensed Content. This Licensed Content subject matter is on the Pre-release version of
the Microsoft technology. The technology may not work the way a final version of the technology will
and we may change the technology for the final version. We also may not release a final version.
Licensed Content based on the final version of the technology may not contain the same information as
the Licensed Content based on the Pre-release version. Microsoft is under no obligation to provide you
with any further content, including any Licensed Content based on the final version of the technology.
b. Feedback. If you agree to give feedback about the Licensed Content to Microsoft, either directly or
through its third party designee, you give to Microsoft without charge, the right to use, share and
commercialize your feedback in any way and for any purpose. You also give to third parties, without
charge, any patent rights needed for their products, technologies and services to use or interface with
any specific parts of a Microsoft technology, Microsoft product, or service that includes the feedback.
You will not give feedback that is subject to a license that requires Microsoft to license its technology,
technologies, or products to third parties because we include your feedback in them. These rights
survive this agreement.
c. Pre-release Term. If you are an Microsoft IT Academy Program Member, Microsoft Learning
Competency Member, MPN Member or Trainer, you will cease using all copies of the Licensed Content on
the Pre-release technology upon (i) the date which Microsoft informs you is the end date for using the
Licensed Content on the Pre-release technology, or (ii) sixty (60) days after the commercial release of the
technology that is the subject of the Licensed Content, whichever is earliest (“Pre-release term”).
Upon expiration or termination of the Pre-release term, you will irretrievably delete and destroy all copies
of the Licensed Content in your possession or under your control.
MCT USE ONLY. STUDENT USE PROHIBITED
4. SCOPE OF LICENSE. The Licensed Content is licensed, not sold. This agreement only gives you some
rights to use the Licensed Content. Microsoft reserves all other rights. Unless applicable law gives you more
rights despite this limitation, you may use the Licensed Content only as expressly permitted in this
agreement. In doing so, you must comply with any technical limitations in the Licensed Content that only
allows you to use it in certain ways. Except as expressly permitted in this agreement, you may not:
• access or allow any individual to access the Licensed Content if they have not acquired a valid license
for the Licensed Content,
• alter, remove or obscure any copyright or other protective notices (including watermarks), branding
or identifications contained in the Licensed Content,
• modify or create a derivative work of any Licensed Content,
• publicly display, or make the Licensed Content available for others to access or use,
• copy, print, install, sell, publish, transmit, lend, adapt, reuse, link to or post, make available or
distribute the Licensed Content to any third party,
• work around any technical limitations in the Licensed Content, or
• reverse engineer, decompile, remove or otherwise thwart any protections or disassemble the
Licensed Content except and only to the extent that applicable law expressly permits, despite this
limitation.
5. RESERVATION OF RIGHTS AND OWNERSHIP. Microsoft reserves all rights not expressly granted to
you in this agreement. The Licensed Content is protected by copyright and other intellectual property laws
and treaties. Microsoft or its suppliers own the title, copyright, and other intellectual property rights in the
Licensed Content.
6. EXPORT RESTRICTIONS. The Licensed Content is subject to United States export laws and regulations.
You must comply with all domestic and international export laws and regulations that apply to the Licensed
Content. These laws include restrictions on destinations, end users and end use. For additional information,
see www.microsoft.com/exporting.
7. SUPPORT SERVICES. Because the Licensed Content is “as is”, we may not provide support services for it.
8. TERMINATION. Without prejudice to any other rights, Microsoft may terminate this agreement if you fail
to comply with the terms and conditions of this agreement. Upon termination of this agreement for any
reason, you will immediately stop all use of and delete and destroy all copies of the Licensed Content in
your possession or under your control.
9. LINKS TO THIRD PARTY SITES. You may link to third party sites through the use of the Licensed
Content. The third party sites are not under the control of Microsoft, and Microsoft is not responsible for
the contents of any third party sites, any links contained in third party sites, or any changes or updates to
third party sites. Microsoft is not responsible for webcasting or any other form of transmission received
from any third party sites. Microsoft is providing these links to third party sites to you only as a
convenience, and the inclusion of any link does not imply an endorsement by Microsoft of the third party
site.
10. ENTIRE AGREEMENT. This agreement, and any additional terms for the Trainer Content, updates and
supplements are the entire agreement for the Licensed Content, updates and supplements.
12. LEGAL EFFECT. This agreement describes certain legal rights. You may have other rights under the laws
of your country. You may also have rights with respect to the party from whom you acquired the Licensed
Content. This agreement does not change your rights under the laws of your country if the laws of your
country do not permit it to do so.
13. DISCLAIMER OF WARRANTY. THE LICENSED CONTENT IS LICENSED "AS-IS" AND "AS
AVAILABLE." YOU BEAR THE RISK OF USING IT. MICROSOFT AND ITS RESPECTIVE
AFFILIATES GIVES NO EXPRESS WARRANTIES, GUARANTEES, OR CONDITIONS. YOU MAY
HAVE ADDITIONAL CONSUMER RIGHTS UNDER YOUR LOCAL LAWS WHICH THIS AGREEMENT
CANNOT CHANGE. TO THE EXTENT PERMITTED UNDER YOUR LOCAL LAWS, MICROSOFT AND
ITS RESPECTIVE AFFILIATES EXCLUDES ANY IMPLIED WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT.
14. LIMITATION ON AND EXCLUSION OF REMEDIES AND DAMAGES. YOU CAN RECOVER FROM
MICROSOFT, ITS RESPECTIVE AFFILIATES AND ITS SUPPLIERS ONLY DIRECT DAMAGES UP
TO US$5.00. YOU CANNOT RECOVER ANY OTHER DAMAGES, INCLUDING CONSEQUENTIAL,
LOST PROFITS, SPECIAL, INDIRECT OR INCIDENTAL DAMAGES.
It also applies even if Microsoft knew or should have known about the possibility of the damages. The
above limitation or exclusion may not apply to you because your country may not allow the exclusion or
limitation of incidental, consequential or other damages.
Please note: As this Licensed Content is distributed in Quebec, Canada, some of the clauses in this
agreement are provided below in French.
Remarque : Ce le contenu sous licence étant distribué au Québec, Canada, certaines des clauses
dans ce contrat sont fournies ci-dessous en français.
EXONÉRATION DE GARANTIE. Le contenu sous licence visé par une licence est offert « tel quel ». Toute
utilisation de ce contenu sous licence est à votre seule risque et péril. Microsoft n’accorde aucune autre garantie
expresse. Vous pouvez bénéficier de droits additionnels en vertu du droit local sur la protection dues
consommateurs, que ce contrat ne peut modifier. La ou elles sont permises par le droit locale, les garanties
implicites de qualité marchande, d’adéquation à un usage particulier et d’absence de contrefaçon sont exclues.
EFFET JURIDIQUE. Le présent contrat décrit certains droits juridiques. Vous pourriez avoir d’autres droits
prévus par les lois de votre pays. Le présent contrat ne modifie pas les droits que vous confèrent les lois de votre
pays si celles-ci ne le permettent pas.
Acknowledgements
Microsoft Learning would like to acknowledge and thank the following for their contribution towards
developing this title. Their effort at various stages in the development has ensured that you have a good
classroom experience.
Contents
Module 1: Architectures for Big Data Engineering with Azure
Module Overview 1-1
Module 6: Implementing Custom Operations and Monitoring Performance in Azure Data Lake
Analytics
Module Overview 6-1
Lab: Automating the Data Flow with Azure Data Factory 9-38
Course Description
This five-day instructor-led course describes how to perform Big Data Engineering on Microsoft Cloud
Services.
Audience
The primary audience for this course is data engineers (IT professionals, developers, and information
workers) who plan to implement big data engineering workflows on Azure.
Student Prerequisites
In addition to their professional experience, students who attend this course should have:
A basic knowledge of the Microsoft Windows operating system and its core functionality.
A good knowledge of relational databases.
Course Objectives
After completing this course, students will be able to:
Implement Custom Operations and monitor performance in Azure Data Lake Store
Create a repository to support large-scale analytical processing in Azure SQL Data Warehouse
Course Outline
The course outline is as follows:
Module 1: ‘Architectures for Big Data Engineering with Azure’ describes the common architectures for
processing big data using Azure tools and services.
Module 2: ‘Processing Event Streams using Azure Stream Analytics’ explains how to use Azure Stream
Analytics to design and implement stream processing over large-scale data.
Module 3: ‘Performing Custom Processing in Azure Stream Analytics’ describes how to include
custom functions and incorporate machine learning activities into an Azure Stream Analytics job.
Module 4: ‘Managing Big Data in Azure Data Lake Store’ explains how to use Azure Data Lake Store
as a large-scale repository of data files.
Module 5: ‘Processing Big Data using Azure Data Lake Analytics’ describes how to use Azure Data
Lake Analytics to examine and process data held in Azure Data Lake Store.
MCT USE ONLY. STUDENT USE PROHIBITED
ii About This Course
Module 6: ‘Implementing Custom Operations and Monitoring Performance in Azure Data Lake
Analytics’ describes how to create and deploy custom functions and operations, integrate with Python
and R, and protect and optimize jobs.
Module 7: ‘Implementing Azure SQL Data Warehouse’ explains how to use Azure SQL Data
Warehouse to create a repository that can support large-scale analytical processing over data at rest.
Module 8: ‘Performing Analytics with Azure SQL Data Warehouse’ describes how to use Azure SQL
Data Warehouse to perform analytical processing, how to maintain performance, and how to protect
the data.
Module 9: ‘Automating the Data Flow with Azure Data Factory, explains how to use Azure Data
Factory to import, transform, and transfer data between repositories and services.
Course Materials
The following materials are included with your kit:
Course Handbook: a succinct classroom learning guide that provides the critical technical
information in a crisp, tightly-focused format, which is essential for an effective in-class learning
experience.
Lessons: guide you through the learning objectives and provide the key points that are critical to the
success of the in-class learning experience.
Labs: provide a real-world, hands-on platform for you to apply the knowledge and skills learned in
the module. Your instructor will provide you with the lab steps.
Module Reviews and Takeaways: provide on-the-job reference material to boost knowledge and
skills retention.
Lab Answer Keys: provide step-by-step lab solution guidance. Your instructor will provide you with
the lab steps.
Important: The lab answer keys are based on the versions of the Azure portal and
associated software that were current at the time of writing. Azure is a constantly evolving
environment, so some of the detailed steps provided in the lab answer keys might not reflect the
exact procedures that you need to perform, although the general principles should remain the
same.
Modules: include companion content, such as questions and answers, detailed demo steps and
additional reading links, for each lesson. Additionally, they include Lab Review questions and answers
and Module Reviews and Takeaways sections, which contain the review questions and answers, best
practices, common issues and troubleshooting tips with answers, and real-world issues and scenarios
with answers.
o Resources: include well-categorized additional resources that give you immediate access to the
most current premium content on TechNet, MSDN®, or Microsoft® Press®.
MCT USE ONLY. STUDENT USE PROHIBITED
About This Course iii
o Course evaluation: at the end of the course, you will have the opportunity to complete an
online evaluation to provide feedback on the course, training facility, and instructor.
o To provide additional comments or feedback, or to report a problem with course resources, visit
the Training Support site at https://trainingsupport.microsoft.com/en-us. To inquire about the
Microsoft Certification Program, send an email to certify@microsoft.com.
The following table shows the role of each virtual machine that is used in this course:
Software Configuration
The following software is installed on the virtual machines:
Course Files
The files associated with the labs in this course are located in the E:\Labfiles folder on the 20776A-LON-
DEV virtual machine.
MCT USE ONLY. STUDENT USE PROHIBITED
iv About This Course
Classroom Setup
Each classroom computer will have the same virtual machines configured in the same
way. All students and the instructor must perform the following tasks prior to
commencing module 1:
Start the VMs
1. In Hyper-V Manager, under Virtual Machines, right-click MT17B-WS2016-NAT, and then click Start.
2. In Hyper-V Manager, under Virtual Machines, right-click 20776A-LON-SQL, and then click Start.
3. In Hyper-V Manager, under Virtual Machines, right-click 20776A-LON-DEV, and then click Start.
Processor:
o 2.8 GHz 64-bit processor (multi-core) or better
AMD:
o Hardware-enforced Data Execution Prevention (DEP) must be available and enabled (NX Bit)
Intel:
o Supports Second Level Address Translation (SLAT) – Extended Page Table (EPT)
o Hardware-enforced Data Execution Prevention (DEP) must be available and enabled (XD bit)
RAM: 32 GB minimum
Network adapter
Module 1
Architectures for Big Data Engineering with Azure
Contents:
Module Overview 1-1
Lesson 1: Understanding big data 1-3
Module Overview
This module introduces the concept of big data, and what makes the processing of big data different
from other data processing. The module then introduces the main architectures for processing big data,
and looks at the main issues that you should consider when designing a big data solution.
Objectives
After completing this module, you will be able to:
Explain the concept of big data.
Describe the Lambda and Kappa architectures for processing big data.
Describe design considerations for building big data solutions with Azure.
Prerequisites
This section outlines the steps you need to take to
set up the environment for this module. To
complete the labs, you will require an Azure trial
subscription.
Azure trial subscription
Access to Microsoft Azure Learning Passes for students of authorized Microsoft Learning
Partners
https://aka.ms/jjhtex
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-3
Lesson 1
Understanding big data
Data that is too large, too fast, or too different from traditional data sources, is called “big data”. This
information comes from many different sources and resides within multiple departments of a single
company. Big data often answers vital company questions, such as ”where should we open our next
branch?”
The underlying information is often too complex to analyze with traditional relational databases, or it’s in
the form of a real-time stream. The source data could be trapped in a file format that can’t be processed
or stored on a disk that is no longer accessible from the company network. Whatever the challenge you
are facing, big data solutions can help you unlock these business insights from your data.
Lesson Objectives
After completing this lesson, you will be able to describe:
The three Vs of big data.
Velocity. The data is being collected from a wide range of devices and sources, and the sources
increase in number as the data increases in volume. Examples include Twitter data, manufacturing
sensors, and mobile phone GPS data.
When you consider which technology to use, it’s helpful to first understand which V you need to
overcome to find your answer. It’s not always straightforward and you might find that, to answer all your
questions, your solution will need to address all three Vs.
Here are three examples of how organizations might use big data to solve business issues:
1. Wingtip Toys produces toys for young children using multiple machines in a large manufacturing
plant they built in 2014. During the last holiday rush, a manufacturing machine broke and delayed
the production of 100,000 Tailspins, their best-selling toy. Customers were outraged and many orders
were cancelled. Since the last holiday season, Wingtip Toys has implemented a big data solution that
monitors its facilities and recommends times of low production when it’s best to perform
maintenance on their manufacturing lines. This predictive maintenance solution minimizes the chance
of production delays during times of holiday rush, helping Wingtip Toys to fulfil their orders.
2. Northwind Traders is a financial firm that specializes in high frequency trading. The traders use
machine learning algorithms to buy and sell securities, making small profits on each trade. Recently,
they created an algorithm that would look at how much a particular stock is discussed on popular
social media platforms, and use the results to execute trades. They implemented big data technology
to run analysis over entire social networks and extract valuable data to train their algorithm.
Managing this volume and variety of data made it possible to implement a new trading strategy,
bringing brand new revenue to the firm.
3. Alpine Ski House is the most popular ski resort in the world. The resort management recently realized
that people were using goggle tan lines to see who spent the most days skiing on the mountain. This
method proved to be inaccurate and many locals were visibly upset when claims that they had
amassed 120 skiing days were not believed. Alpine Ski House implemented a big data solution that
would track skiers’ runs down the mountain, and provide aggregated statistics of the vertical feet they
had skied, along with the number of days spent on the mountain. This approach made a huge
difference to customer satisfaction and set Alpine Ski House apart from their competitors.
Question: What are some examples of big data use cases in your industry?
MCT USE ONLY. STUDENT USE PROHIBITED
1-6 Architectures for Big Data Engineering with Azure
Lesson 2
Architectures for processing big data
Because real-time data gathering requires the processing of vast amounts of information, it’s vital that
you implement an effective real-time processing architecture that can cope with the rate of data input,
implements the necessary analysis, and generates the outputs that you need.
In this lesson, you will learn about the two main architectures for real-time data processing—Lambda and
Kappa. You will review their similarities and differences and identify when one or the other is the best
choice. You will then identify how Azure implements these architectures as services, along with how to
implement a big data processing architecture on this platform.
Lesson Objectives
After completing this lesson, you will be able to:
Explain how Azure services implement the Lambda and Kappa architectures.
Speed layer—this does the data stream processing without being concerned with accuracy.
Serving layer—this stores output from both the batch and stream layers, and then responds to ad-
hoc queries by providing precomputed views of the data or building custom views, depending on the
query.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-7
Input for analysis comes from a stream of data that provides a series of record. This information might
come from a device or could be captured as part of a sampling process. The Lambda internal architecture
refers to processing as “hot path” or “cold path” flows:
The hot path provides data stream processing, giving instant access to the data and analytics, but at
the expense of lower accuracy.
The cold path improves accuracy with comprehensive batch processing, using data refinement
methods. Both methods use the same data, but apply different levels of granularity to obtain their
outputs.
The hot and cold paths perform similar tasks over the same data, but the hot path is more likely to use
approximations and coarse granularity to evaluate data, whereas the cold path is more precise and finer
grained. Apps use the hot path data to provide instant feedback and time critical decision-making, but
subsequently use the cold path data to make more measured evaluations and corrections. For example, if
you need to monitor the flow of water through a sewage treatment works, the hot path (stream
processing) might identify if there has been a sudden surge in water that requires immediate action, such
as opening outputs or closing inputs. For this type of real-time decision, the issue is not the exact flow
level in gallons per hour, but its relative magnitude. Alongside this monitoring of the critical real-time
data, the cold path process could then calculate data such as performance charts, looking at oxygen levels
or flow rates over time.
To give another example, if you are developing applications for a professional sports team, such as long-
distance cycle racers, the real-time hot path data might be road speed, pedal cadence, rider power output
or heart rate—all inputs that might require immediate follow-up to maximize an athlete’s performance. In
contrast, the cold path data could involve more long-term reviews of rider performance, perhaps along a
specific section of a track or course, to identify any required changes to training programs.
Limitations
The main challenge with Lambda architecture is that you need to maintain consistency between the hot
and cold path processing logic. If you change the analytical requirements, then both sets of code must be
changed to maintain consistency.
A second consideration is that, after the data has been captured and processed, the data is effectively
immutable, and can no longer be changed. Alterations to the model, such as gathering a wider range of
data or bringing newer capture devices online, makes it difficult to generate longer-term comparisons of
historical data.
In essence, Kappa is a simplification of Lambda, in which the batch processing function is removed. Data is
stored in an append-only immutable log. That log database could be something like Apache Kafka. From
that store, the information is then streamed through computational systems to generate the required
analysis—these computational systems could be Apache Storm, Apache Spark, or Kafka Streams.
The removal of the batch processing function means that Kappa requires maintenance of only one set
of code.
Migrations and reorganizations are easy, because you simply repopulate a database from the
canonical store.
With Kappa, the serving layer still provides optimized responses to queries. However, these databases are
almost like caches—you can simply wipe and regenerate them from the original log data. You can use any
database type in the serving layer—for example, in-memory, persistent, or special purpose, such as full
text search.
Azure Machine Learning. Supports analyzing and predictive modeling of data streams.
Azure Data Lake Storage. Persists data at massive scale with unlimited objects or file sizes.
Azure Data Lake Analytics. Performs batch processing and analysis of big data, implementing very
large data parallel processing.
Azure SQL Data Warehouse. Provides a massively parallel processing (MPP) capable relational
database that processes large volumes of data.
Note: Azure data lakes provide unstructured storage for very large amounts of data, which
typically ties in with use cases that Lambda and Kappa architectures enable. SQL Data Warehouse
is more suited for storing structured or processed data with a preconfigured schema. However,
SQL Data Warehouse could be used to store the processed output from Lambda or Kappa
architectures.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-9
These options are all for batch processing and real-time data stream analysis. For structured data storage
in Azure, you would use the large relational database processing features. These options include:
SQL Data Warehouse for data volumes of 1 TB and more, with typical sizes in the 10 TB range.
Data Lake Analytics provides additional facilities to go beyond what is possible by using structured SQL
data storage, such as:
Federating queries to bring together structured data from a database or data warehouse and
unstructured data from a data lake.
Ultimately, your decision on which architecture to implement, and what features to use, depends on your
data source and how you are going to use the data. If you need elasticity and scalability, then Azure Data
Factory provides the facilities for moving data around in the cloud.
You use Stream Analytics to examine very high, often disparate, streams of data, and use that information
to identify and analyze relationships, patterns and trends between the monitored sources. For example,
what is the effect of weather patterns on traffic flow? Using that data, you then carry out tasks in
applications or propagate that information to a website. These days, it’s easy to get a live feed of current
journey times on your commute, but what if you could find out what the expected levels of traffic would
be in an hour or later that afternoon? Would that information affect whether you were likely to go
shopping now or later?
MCT USE ONLY. STUDENT USE PROHIBITED
1-10 Architectures for Big Data Engineering with Azure
This course is based around a hybrid Lambda/Kappa architecture, as shown on the slide:
Device data is passed in from event hubs and IoT hubs (depending on the source devices).
Stream Analytics does its own “hot” stream processing of this data, but also passes unchanged event
data to long-term storage (such as Blob storage or Data Lake Storage).
Data Lake Analytics processes the same data to provide additional insights, and to generate analytical
models for Data Warehouse.
Additional, slow moving data (lookups and supporting information from other databases) is ingested
through Data Factory and/or methods such as PolyBase and SQL Server Integration Services (SSIS).
Stream Analytics and Data Lake Analytics use this data as part of the decision-making process in the
analytical logic.
Stream Analytics, Data Lake Analytics, and Data Factory might include Azure Machine Learning
modules to provide additional predictive input, based on existing data.
Power BI and custom apps built using the Azure SDK visualize and report the results.
Lesson 3
Considerations for designing big data solutions
By necessity, big data designs are different both from traditional data warehouse projects, and from large
relational databases, particularly in relation to the number of decision points. Rather than a plug and play
approach using tried and tested technologies, each big data implementation will be unique, bringing
together previously unused combinations of components to create the solution.
In this lesson, you will review the various considerations that you must address when designing big data
solutions. Getting these factors correct at the design stage will help to create a stable, secure environment
that achieves the solution parameters and will cost less overall than a poorly-specified implementation.
Lesson Objectives
After completing this lesson, you will be able to:
Scalability
As a first principle, any well-designed big data
system should scale both up and down according
to demand, enabling the system to respond to
changing demand and yet minimize costs. To scale
effectively, you need to consider the following
factors:
Workload partitioning
Resource allocation
Data partitioning
Data storage
Client affinity
Workload partitioning
The trick with designing for scalability is to partition the process into discrete and decomposable
elements. Processing elements should be as small as practicable, bearing in mind that distribution of these
elements should maximize utilization of compute units.
MCT USE ONLY. STUDENT USE PROHIBITED
1-12 Architectures for Big Data Engineering with Azure
Resource allocation
To achieve effective scaling, it’s essential that resources are easily allocated to, and deallocated from, any
component in the system. For a virtual machine, for example, it’s often the amount of system memory and
processing power that needs to be adjusted; for databases, it might be the size of log files; for analytics
jobs, it might be the computational units that are assigned to a job.
Data partitioning
You should consider whether to divide the data across database locations or use a database service (such
as Data Lake) that automatically partitions data transparently. Having control over partitions may be
beneficial in some cases, but such benefits can be outweighed by the administrative overhead that might
be required to maintain an optimum partitioning model as data volumes increase. For this reason, using a
storage option that does not directly expose the underlying partition scheme is often the best choice for
large dynamic enterprise-wide applications.
Data storage
You should ensure that you use the right type of data store for each component in the system. For
example, service layer elements in Lambda architectures might work best with an Azure SQL or SQL Server
database that implements a schema. You should also plan your storage on the assumption that data
volumes almost always increase over time, so even without fluctuations on the demand side, there will
typically be an ongoing requirement to scale up your storage.
Client affinity
Alongside stateless services within the back end, client connections should also be stateless so that, as new
service endpoints are deployed in a scale up scenario, client connections do not remain attached to their
initial (and possibly now overloaded) connection. Within Azure, for example, this affinity is configured
through the Load Balancer.
Throughput
Throughput and response times are critical to the
success of any big data system. You will need to
consider how to handle large numbers of inputs,
along with the anticipated data volumes.
Under provisioning. The consequences of under provisioning will vary depending on the Azure
service. For example, event hubs have a capacity determined by the number of Throughput Units
(TUs) which, if not properly set, can impose a limit on the number of messages that are processed per
minute. TUs are shared across an event hub namespace, which might comprise many event hubs. For
services like this, where the throughput granularity is at a fairly coarse level, it’s important that one
particular service, such as an especially active event hub, doesn’t end up consuming all the potential
throughput and, therefore, generating a processing bottleneck.
Over provisioning. It’s also important not to set overly high limits because, for some Azure services
(such as SQL Data Warehouse), you are billed for the available capacity, and not the current usage.
When designing for big data processing, it’s therefore important to be able to estimate the likely required
throughput, in addition to ensuring that there will be functions in place to adjust throughput to suit
actual workloads.
Security
Security is another key part of any IT system
design. Data breaches are becoming increasingly
commonplace, with attackers looking to steal
important data that could be sensitive to an
organization and its customers. With big data
designs, data needs to be protected, both when it
is in transit between a client and Azure services—
such as databases or storage—and when it’s at
rest on the storage medium.
Another key approach to network security is to make use of Azure security features, including Virtual
Networks (vNets), Network Security Groups (NSGs), and perimeter networks to create a multilayered
defense:
Azure Virtual Networks (vNets) provide a logical segmentation of the Azure cloud to which only
your subscription has access; you have full control over DNS settings, routing tables, security policies,
and IP address blocks.
Network Security Groups (NSGs) help you to create rules (ACLs) at various levels, including at
network interfaces, at compute units such as virtual machines, and at the virtual subnet. NSGs,
therefore, enable fine-grained access control by permitting or denying communication between
workloads within a virtual network, and between on-premises and cloud-based systems.
Azure perimeter networks should be designed so that all inbound packets must pass through
security appliances, such as firewalls, intrusion detection systems (IDS), and intrusion prevention
systems (IPS), before being able to access Azure services. Outbound packets should also pass through
these security appliances, to enable effective policy enforcement, inspection, and auditing.
Identity management
To help prevent unauthorized access, it’s important to be able to enforce the establishment of correct
identity. To overcome the limitations, and vulnerabilities, of traditional username and password
combinations, Azure supports two-factor authentication using smartcards or mobile phone apps to verify
that users are who they say they are. After authentication is validated against Azure Active Directory
(AAD), features such as RBAC and AAD user groups enable you to grant the minimum access required by
users and services.
Conditional access
Taking this a step further, the AAD conditional access feature protects against the potential consequences
of stolen or phished credentials by combining two-factor authentication with the requirement to possess
a device that is managed by Microsoft Intune® to access services, such as administrator accounts.
Conditional access also blocks access attempts from particular geographic locations, from untrusted
networks, or when access attempts have been made from two or more geographically distant locations
within a short time period.
Encryption
Encryption is another fundamental requirement for a secure big data system—how you implement
encryption differs, depending on whether you are encrypting data in transit or at rest. For data in transit
technologies, such as Virtual Private Networks (VPNs) using IP Security (IPSec), you use tunnels to provide
encrypted clients to Azure connections. For data at rest, there are several options, including Azure Storage
Service Encryption (SSE) for storage accounts, with block blobs and page blobs being automatically
encrypted when written to Azure Storage; similarly, Data Lake Store has automatic encryption for data
lake storage. You might also use Azure Disk Encryption to encrypt the OS disks and data disks used by an
Azure Virtual Machine. Client-Side Encryption is built into the Java and the .NET storage client libraries.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-15
In addition to the mechanisms used for encrypting data, another consideration is the management of
encryption keys. In some environments, such as defense, it’s typically mandated that all keys must be
managed by the organization itself. For many Azure services, you choose to use Azure-managed keys or
your own keys stored in Azure Key Vault—or something similar. Azure Key Vault enables access to keys
and other secrets in Azure Key Vault to be limited to a specific AAD account.
Auditing
Auditing is an important component of any security plan, enabling systems access to be monitored, and
potential threats to be proactively identified and mitigated. By logging Azure operations, including system
access, any unauthorized or accidental changes will be recorded in an audit trail.
Reliability
Ensuring the reliability of a big data processing
system involves a range of tasks, operations, and
considerations. You must set up good monitoring
systems so that you know when issues occur, and
have procedures in place to be able to respond to
these issues in a timely manner. You must be able
to maintain network connectivity, and access to
your data and services. If things go seriously
wrong, you must have a good plan in place to be
able to recover from potential disaster.
Maintain connectivity
Cloud-based environments such as Azure require a reliable connection to the remote datacenter location.
The likelihood of service outages caused by failures in the connecting internet infrastructure is very low.
This is because the packet-switching technologies used by the internet backbone route traffic around
multiple failures in the network. Therefore, the most likely point of failure is your own connection to the
internet.
To ensure that this connection is secure and resilient, you should use a dedicated connection, such as a
VPN connection, or use ExpressRoute and, ideally, deploy multiple redundant connections. For example,
you could use a dedicated ExpressRoute connection as the primary route and have a VPN as backup.
When using database services, such as Azure SQL Server, you configure active geo-replication, so that
your primary database is replicated to up to four readable secondary databases. You might configure
these secondary databases to be in different regions, so you use them for querying and failover if the
primary database becomes unavailable.
SQL Data Warehouse supports distributed data, where the data is stored across multiple locations that
each act as independent storage and processing units. This means that, in addition to providing resilience,
distributed data enables very high query performance through running queries in parallel across locations.
Within any big data processing system, there will almost certainly be custom code; you should ensure that
such code includes robust failure handling procedures.
Data Lake Store requires a slightly different approach to DR. Because Data Lake Store uses automated
replicas within a region, you need to maintain your own copies of critical data in a different store account
or location to protect against issues such as accidental deletions. You maintain these copies using tools
such as ADLCopy, Azure PowerShell or Azure Data Factory (Data Factory is particularly useful for
managing recurring data copying and mirroring).
Data sources
Source data might come from devices, sensors,
data feeds, databases, flat files, other Azure
services, and other on-premises or cloud services. For example, in Azure, you might use IoT hubs or event
hubs to collect data, or the data may already be online or in other databases, such as Azure SQL Database.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-17
Data destinations
After processing, data might be sent to dashboards, such as Power BI, or be consumed by services, such as
Azure Analysis Services. The data might also be copied for long-term storage to locations such as SQL
Data Warehouse or Data Lake Store.
The system is based on fixed roadside traffic cameras with automatic number plate recognition (ANPR)
software built in. These cameras are positioned at strategic junctions and locations on the road network,
and capture information about each vehicle that passes.
Information from the ANPR cameras needs to be passed to the Police Central Communications Center
(PCCC). Software at the PCCC determines whether vehicles are speeding, and can trigger functionality that
raises fines or summonses (depending on how much over the speed limit a vehicle is travelling). The PCCC
can also detect whether a vehicle is reported as stolen, and alert a nearby police patrol car to try and
intercept the vehicle.
Police patrol cars are equipped with devices that communicate with the PCCC, transmitting their current
location and speed. These devices can also receive alerts from the PCCC concerning the location of nearby
suspect and stolen vehicles.
Objectives
After completing this lab, you will have considered the requirements for a system that:
Captures a stream of vehicle details from each ANPR camera and sends it to the PCCC for processing.
Tracks the locations of police patrol cars, and communicates real-time information about suspect
vehicles to these patrol cars.
Generates real-time reports and other statistical information about vehicle speeds.
Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.
Lab Setup
Estimated Time: 60 minutes
Username: Admin
Password: Pa55w.rd
Generate real-time reports about the speeds captured by each camera (such as the average speed
during the last 30 seconds).
Detect whether a traffic incident has occurred that could require a patrol car to attend.
You need to determine the most appropriate technologies to implement these features. The solution must
be scalable to handle data from many hundreds of cameras, and be capable of processing data very
efficiently to enable the timely interception of stolen vehicles. It must also be able to detect anomalies,
such as the same car appearing to be in more than one place simultaneously; a car might be fitted with
false registration plates, for example.
Results: At the end of this exercise, you will have selected the technologies that support the stream
processing requirements listed in the scenario.
Creating reports that show how traffic flows vary over time at each speed camera.
Performing general analyses, such as the likelihood of speeding vehicles also being recorded as
stolen.
You need to determine the most appropriate technologies to implement these features.
Results: At the end of this exercise, you will have selected the technologies that support the batch
processing requirements listed in the scenario.
MCT USE ONLY. STUDENT USE PROHIBITED
1-20 Architectures for Big Data Engineering with Azure
You have been asked how to store and structure this information.
Results: At the end of this exercise, you will have selected the technologies that support the data storage
and detailed analytical processing requirements of the system.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 1-21
Review Question(s)
Question: What are likely to be your biggest challenges when planning for big data
processing in your organization?
MCT USE ONLY. STUDENT USE PROHIBITED
MCT USE ONLY. STUDENT USE PROHIBITED
2-1
Module 2
Processing Event Streams using Azure Stream Analytics
Contents:
Module Overview 2-1
Lesson 1: Introduction to Stream Analytics 2-2
Module Overview
In today’s business world, enterprises receive larger amounts of data from different applications and
devices faster than ever before. Being able to process this data as it flows into an organization’s data
platform is crucial to uncover real-time insights, like sales performance or maintenance issues. You use
Microsoft® Azure® Stream Analytics to achieve this real-time analysis on streaming data.
Stream Analytics is a managed service in Azure that you use to create analytic computations on streaming
data. You can connect a Stream Analytics job to different streaming inputs, transform the data using a
SQL-like query language to join, aggregate, sort, and filter data over a given time interval, then output the
data to one or many destinations—including relational and nonrelational data stores, Service Bus queues
or topics, and Power BI™. Customers use Stream Analytics to create a real-time stream processing
workflow quickly to gain insights from streaming data in minutes.
Objectives
By the end of this module, you will be able to:
Describe Stream Analytics and its purpose.
Lesson 1
Introduction to Stream Analytics
Stream Analytics is used to process large amounts of streaming data coming from devices, applications, or
processes. Stream Analytics jobs are configured to query the streaming data to filter the output, look for
patterns, and control the flow of data to the destination. You use Stream Analytics to create automation
workflows, perform real-time reporting, store data for batch processing, or raise alerts based on the
analytics performed on the data.
Lesson Objectives
By the end of this lesson, you should be able to:
Inputs Outputs
Event Hubs
Power BI
Azure Cosmos DB
Event hubs have a concept known as a consumer group that provides separate views of the event hub to
enable multiple applications to connect and read the stream independently. It’s a best practice to create a
separate consumer group for each Stream Analytics job that connects to the event hub, though there is a
limit of 20 consumer groups per event hub. If your Stream Analytics job contains multiple SELECT
statements that connect to the same event hub consumer group, you should consider creating multiple
consumer groups and Stream Analytics inputs—one for each reference to the input data stream in your
query.
Types of output
Stream Analytics integrates seamlessly with many
types of output. These outputs might be
persistent storage, queues for further data
processing, or Power BI for reporting of the
streaming dataset. Outputs can also be in CSV,
JSON, or Avro format.
Output Scenario
Azure Data Lake Store Used to store data for further batch processing by Azure Data Lake
Analytics or HDInsight®.
Azure SQL DB Used to store data that is relational in nature and needs to be
accessible via a SQL query.
Azure Blob storage Used to store data in a cost-effective and scalable manner that
provides access to applications and batch processing solutions like
Azure Data Lake Analytics and HDInsight.
Event Hubs Used to ingest data from the output of one Stream Analytics job to
send to another streaming job for further processing.
Azure Table storage Used to store data in a low latency, highly scalable manner for
integration with downstream applications.
Service Bus queues Used as a first in, first out (FIFO) message delivery service that has
competing consumers. This can be used for round robin processing of
the data by downstream applications.
Service Bus topics Used as a one-to-many message delivery service to send events to be
processed by multiple downstream applications.
SELECT
*
INTO [SampleOutput]
FROM [SampleInput]
You might also select specific columns and filter the data based on a condition, as follows:
SELECT
ProductName, ProductCategory, Price
INTO [SampleOutput]
FROM [SampleInput]
WHERE Price>=200
The following is an example of how to use multiple outputs in one Stream Analytics job:
SELECT
*
INTO [SampleOutput1]
FROM [SampleInput]
WHERE Price>=200
SELECT
*
INTO [SampleOutput2]
FROM [SampleInput]
WHERE Price<200
Note that, while you can have multiple inputs and multiple outputs in a single Stream Analytics job, it’s
best practice to split unrelated queries into multiple Stream Analytics jobs. This helps optimize the
performance of each Stream Analytics job by reducing complexity and processing steps.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-7
In a HAVING clause.
The following is a list of aggregate functions you use with Stream Analytics:
Collect Returns an array with all values from the time window.
Percentile_Disc Returns a percentile based on the entire dataset sorted by the ORDER
BY clause. The result will be equal to a value in the input dataset.
STDEVP Returns the standard deviation for the population for a set of values in
the input.
TopOne Returns the top result from a set of values based on the ORDER BY
clause.
VARP Returns the variance for the population of a set of values in the input.
The GROUP BY clause is typically used when aggregate functions are used in the SELECT statement.
GROUP BY clauses are used in Stream Analytics in the same way as a GROUP BY clause is used in T-SQL.
MCT USE ONLY. STUDENT USE PROHIBITED
2-8 Processing Event Streams using Azure Stream Analytics
For example, to select the top 10 products by sales quantity over the last three minutes (windowing
functions are described in more detail in the next section):
SELECT
ProductName,
CollectTop(10) OVER (ORDER BY QuantitySold DESC)
FROM [SampleInput] timestamp by time
GROUP BY TumblingWindow(minute, 3), ProductName
Sliding window
Sliding windows consider all possible windows of
the given length. However, to make the number
of windows manageable for Stream Analytics,
sliding windows produce an output only when an event enters or exits the window. Every window has at
least one event, and each event can be in more than one sliding window.
For example, you have the following input dataset with a time and a value:
12 1
19 2
24 3
A sliding window of 10 seconds will produce windows ending at the following times:
19—value 2 enters the window; two values in window (value 1, value 2).
22—value 1 exits the window; one value in window (value 2).
24—value 3 enters the window; two values in window (value 2, value 3).
A window is not created at time 34, because this would create an empty window (value 3 exits the
window).
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-9
Tumbling window
Tumbling windows are fixed-size windows that do not overlap and are contiguous. When the timespan of
the window size has passed, a new window is immediately created with the same duration.
For example, to calculate the average temperature by device for each five-minute period:
SELECT
TimeStamp,
DeviceId,
AVG(Temp) AS AvgTemp
FROM [SampleInput] TIMESTAMP BY TimeStamp
GROUP BY DeviceId, TumblingWindow(mi, 5)
Hopping window
Hopping windows are used to specify overlapping windows that are scheduled. Hopping windows are
defined with a windowsize, hopsize, and timeunit.
For example, you have the following set of data including a timestamp and event name:
3 A
5 B
8 C
12 D
16 E
20 F
You specify a hopping window with a windowsize of 10, a hopsize of 5, and seconds for the time unit. This
will create the following windows:
Notice that the windows are inclusive of the end of the window and exclusive of the beginning of the
window. You use the offsetsize parameter to change this behavior.
TIMESTAMP BY
There’s a timestamp associated with each event that is processed by Stream Analytics. Timestamps are
generally created based on the arrival time to the input source. For example, events in Blob storage have
a timestamp based on the blob’s last modified time; events captured with event hubs are given a
timestamp of when the event arrives. An event timestamp is retrieved by using the System.Timestamp
property in any part of the query.
MCT USE ONLY. STUDENT USE PROHIBITED
2-10 Processing Event Streams using Azure Stream Analytics
Many streaming scenarios require processing events based on the time they occur, rather than the time
they are received. For example, point-of-sale systems typically need to process the data based on the
timestamp when the transaction occurs.
To process events based on a custom timestamp or the time they occur, you use the TIMESTAMP BY
clause in the SELECT statement:
SELECT
TransTime,
RegisterId,
TransId,
TransAmt
FROM [SampleInput] TIMESTAMP BY TransTime
SELECT
i1.StoreId,
i1.CustEntryTime,
i2.CustExitTime,
r.StoreDescription
FROM [SampleInput1] i1 TIMESTAMP BY CustEntryTime
JOIN [SampleInput2] i2 TIMESTAMP BY CustExitTime
ON DATEDIFF(minute, i1, i2) BETWEEN 0 AND 5
AND i1.StoreId=i2.StoreId
JOIN [ReferenceInput] r
ON i1.StoreId=r.id
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-11
4. View the output of the query in the browser and download output files if necessary.
After you have successfully tested your query in the portal, you save the query and start the Stream
Analytics job to begin processing events.
You also take advantage of multiple SELECT INTO statements to output intermediate data to test your
data streams and joins.
For example, you have the following query that is producing zero events:
SELECT
r1.StoreName,
r2.ProductName,
i1.ProductId,
i1.ProductQty
INTO [SampleOutput]
FROM [SampleInput] i1
JOIN [ReferenceInput1] r1
ON i1.StoreId=r1.StoreId
JOIN [ReferenceInput2] r2
ON i1.ProductId=r2.ProductId
You can rewrite the query with multiple outputs to test each step and data stream:
WITH StepOne AS
(
SELECT
r1.StoreName,
i1.ProductId,
i1.ProductQty
FROM [SampleInput] i1
JOIN [ReferenceInput1] r1
ON i1.StoreId=r1.StoreId
),
StepTwo AS
(
SELECT
s1.StoreName,
r2.ProductName,
s1.ProductId,
s1.ProductQty
FROM [StepOne] s1
JOIN [ReferenceInput2] r2
ON s1.ProductId=r2.ProductId
MCT USE ONLY. STUDENT USE PROHIBITED
2-12 Processing Event Streams using Azure Stream Analytics
)
--Regular output to SampleOutput
SELECT
*
INTO
[SampleOutput]
FROM
StepTwo
--Log input data
SELECT
*
INTO
[TestOutput1]
FROM
[SampleInput]
--Log data from first join
SELECT
*
INTO
[TestOutput2]
FROM
StepOne
Now you can test and view the output for each step of the query to see why your data might not be
joining successfully.
1. On the Stream Analytics job blade in the Azure portal, click Job diagram in the SUPPORT +
TROUBLESHOOTING section on the left pane.
2. View the metrics for each query step by selecting the query step in the diagram.
3. To view the metrics for partitioned inputs or outputs, you select the ellipses (…) on the input or
output then select Expand partitions.
4. Click a single partition node to view the metrics for that partition.
5. Click the merger node to view the metrics for the merger.
The job diagram gives a helpful visual representation of your job that you use to identify issues and
bottlenecks quickly.
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Lesson 2
Configure Stream Analytics jobs
It’s quick and easy to create a Stream Analytics job and begin to process incoming streaming data. To
create a scalable and reliable solution, you need to plan the design and configuration for Stream Analytics
deployments.
Lesson Objectives
By the end of this lesson, you should be able to:
Event hubs have between two and 32 partitions. These are specified at creation time and cannot be
changed, so you should consider the long term when you design the streaming solution. The number of
partitions in an event hub should correlate directly with the number of concurrent readers—in this case,
Stream Analytics jobs—that you have in your solution. For example, if you plan to have four Stream
Analytics jobs reading from one event hub, you should create the event hub with four partitions.
Azure blobs are stored in a container within a storage account. The partition key for a given blob is the
Azure storage account name, plus the container name and the blob name. In this scenario, each blob can
have its own partition.
o Simple queries, like a select-project-filter query, might not require the same query instance to
process the same key, and you can ignore this requirement.
Just as the input source must be partitioned, the query must also be partitioned. The Partition By
clause must be used in each step of the query, and they must all be partitioned by the same key. To
be fully parallel, the partitioning key must be set to PartitionId.
Only event hubs and Blob storage output types allow partitioned output. The partition key must be
set to PartitionId for event hub outputs, but you don’t have to do anything for Blob storage outputs
because of how they are partitioned.
The number of input partitions must equal the number of output partitions. However, this doesn’t
matter when using Blob storage as an output, because it inherits the partitioning scheme of the
query. The following are example scenarios that allow for fully parallel jobs:
o Four input event hub partitions and four output event hub partitions.
o Four input Blob storage partitions and four output event hub partitions.
SELECT
COUNT(*) AS Count,
DeviceId
FROM [SampleInput] Partition BY PartitionId
GROUP BY TumblingWindow(second, 30), DeviceId, PartitionId
Because the query has a grouping key (DeviceId), that key needs to be processed by the same query
instance each time. That means the input events must be partitioned when being sent to the event hub
input. Because DeviceId is the grouping key, the PartitionKey value of the input event data should use
DeviceId.
Even though the ideal scenario is to have embarrassingly parallel jobs, sometimes you can’t prevent jobs
that don’t fall into this category. The following are some cases where the Stream Analytics job is not
embarrassingly parallel:
Multistep queries that use a different partition key for the Partition By clause.
You’ll find the maximum number of SUs that your job can use by looking at the input, output, and query.
The amount of SUs depends on the number of steps in the query and the number of partitions for each
step. For queries that do not have any partitioned steps, the maximum number of SUs is six. For
partitioned queries, the maximum number of SUs for the job is calculated by multiplying the number of
partitions by the number of partitioned steps, by six SUs for a single step.
To handle events that are out of order or late arriving, you set Event ordering policies, which consist of a
late arrival tolerance window, an out of order tolerance window, and an action.
Late arrival tolerance window—the Stream Analytics job will accept late events with a timestamp
that is in the specified window.
Out of order tolerance window—the Stream Analytics job will accept out of order events with a
timestamp that is in the specified window.
Action—the Stream Analytics job will either Drop an event that occurs outside the acceptable
window, or Adjust the timestamp to the latest acceptable time.
When using the out of order tolerance window, Stream Analytics will buffer the events up to that window
then reorder the events to make sure they are back in the correct order before processing. The output of
the Stream Analytics job is delayed by the same amount of time as the out of order buffer.
Error policy
When processing streaming data, there might be several reasons why a Stream Analytics job sometimes
fails to write to the output. To remedy this, you specify how errors are handled in the Error policy blade.
You set the Action to one of two settings:
Drop—drops any events that cause errors when writing to the output.
Role Description
Contributor Provides access to manage everything about the resource except for
access.
Reader Provides access to view all information about the resource, but not
change anything.
Log analytics contributor Provides access to read all monitoring data and edit monitoring
settings, including settings for Azure Log Analytics and Diagnostics.
Log analytics reader Provides access to read all monitoring data, including settings for
Azure Log Analytics and Diagnostics.
Monitoring contributor Provides access to read all monitoring data and edit monitoring
settings.
User access administrator Provides access to manage user and group access to the resource.
1. On the Stream Analytics blade, click Access Control (IAM) in the left pane.
3. Select the role for the user or group to be added to, then use the search box to select the user or
group.
4. Click the Save button at the bottom of the blade to complete adding the user or group to the role.
The following shows a list of Stream Analytics metrics you can view in the Azure portal:
Metric Description
Out-of-order events Number of events that were received out of order and processed
based on the event ordering policy.
Data conversion errors Number of errors that the Stream Analytics job encountered when
attempting to convert data types.
Runtime errors Number of errors during the execution of the Stream Analytics
job.
Late input events Number of late arriving events processed by the event ordering
policy.
Failed function requests Number of failed Azure Machine Learning function calls.
Function events Number of events sent to the Azure Machine Learning function.
Input event bytes Amount of data in bytes received by the Stream Analytics job.
MCT USE ONLY. STUDENT USE PROHIBITED
2-20 Processing Event Streams using Azure Stream Analytics
2. Click the Add alert button on the ribbon on the Metric blade.
3. Enter a name and a description for the alert and choose a metric that is used to trigger the alert.
An email will be sent to the email(s) addresses specified when the metric hits the threshold you provided.
1. On the Stream Analytics job blade, click Diagnostic logs in the left panel.
4. Click Configure then select the storage account to use for diagnostic log collection, and click OK.
5. Under LOG, check the boxes for Execution and Authoring and set the retention policy.
6. Under METRIC, check the box for 1 minute and set the retention policy.
7. Click Save on the top ribbon to save the diagnostic logging settings.
You will now be able to view, search, and filter through the diagnostic logs in the Activity log blade to
perform troubleshooting and auditing on your Stream Analytics job. You can also download the
diagnostic data from your specified storage account if you want to examine and process it locally.
It’s also recommended to keep the solution in one Azure region to prevent data egress from the Azure
datacenter. You are not charged for data that is streaming into the Azure datacenter, but if your
streaming solution sends data from one datacenter to another, you will be charged for that data egress
between Azure datacenters. It’s also typically recommended to create compute solutions where the data
rests, instead of moving data to different regions to be analyzed. This will reduce the latency, complexity,
and cost of your solution.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-21
You also use the Stream Analytics .NET SDK to create and manage Stream Analytics jobs. For more
information, see:
Management .NET SDK: Set up and run analytics jobs using the Azure Stream Analytics API
for .NET
https://aka.ms/quwsdf
MCT USE ONLY. STUDENT USE PROHIBITED
2-22 Processing Event Streams using Azure Stream Analytics
12
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
For the first phase of the project, you will use Stream Analytics, together with Event Hubs, IoT Hubs,
Service Bus, and custom applications to:
Objectives
After completing this lab, you will be able to:
Reconfigure a Stream Analytics job to send output through a Service Bus queue.
Reconfigure a Stream Analytics job to process both event hub and static file data.
Use multiple Stream Analytics jobs to process event hub, IoT hub and static file data, and output
results using a Service Bus and custom application.
Use the Azure portal and PowerShell to manage and scale Stream Analytics jobs.
Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.
Username: ADATUM\AdatumAdmin
Password: Pa55w.rd
MCT USE ONLY. STUDENT USE PROHIBITED
2-24 Processing Event Streams using Azure Stream Analytics
Task 8: Generate event hub data for processing with Stream Analytics
Results: At the end of this exercise, you will have created an Azure Data Lake Store, an event hubs
namespace, and a Stream Analytics job. You will then use Stream Analytics to process event hubs data,
and view the results in a Power BI dashboard and in Data Lake Store.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-25
Task 7: Generate IoT hub data for processing with Stream Analytics
Results: At the end of this exercise, you will have created a Data Lake Store, an IoT hub, and a new Stream
Analytics job. You will then use Stream Analytics to process IoT hub data, and view the results in a Power
BI report and in Data Lake Store.
MCT USE ONLY. STUDENT USE PROHIBITED
2-26 Processing Event Streams using Azure Stream Analytics
Task 5: Prepare an application to receive Stream Analytics data using a Service Bus
Task 6: Generate IoT hub data for processing with Stream Analytics
Results: At the end of this exercise, you will have created an Azure Service Bus, reconfigured an existing
IoT hub, and an existing Stream Analytics job. You will then use Stream Analytics to process IoT hub data
and to send results to the Service Bus. Finally, you will use a custom Visual Studio application to view the
output of the Service Bus.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-27
4. Update the event hub and add two more consumer groups
Task 1: Create a Blob storage account for holding stolen vehicle data
Task 4: Update the event hub and add two more consumer groups
Task 9: Generate event hub data for processing with Stream Analytics
Results: At the end of this exercise, you will have uploaded data to a new Blob storage container, updated
your event hub with new consumer groups, and reconfigured your TrafficAnalytics Azure Stream Analytics
job to use these new inputs. You will then use Stream Analytics to process the event hubs data, and view
the results in a Power BI dashboard, and in Data Lake Store.
Exercise 5: Use multiple Stream Analytics jobs to process event hub, IoT
hub and static file data, and output results using a Service Bus and custom
application
Scenario
For the final part of this initial phase in the development of the traffic surveillance system, you have been
asked to add the ability to determine the nearest patrol car to a speeding vehicle or stolen vehicle, send a
dispatch alert to the nearest patrol car, and then show the dispatched patrol car locations on a map.
Specifically, the system must be able to identify the nearest patrol car to a speeding or stolen vehicle, and
then send a message (using Service Bus) to that patrol car. The message would contain details about the
vehicle’s registration number, location, and speed. Any patrol car situated within five kilometers of the
stolen or speeding vehicle’s most recently reported location could then be dispatched to that location.
The message should contain the ID of the patrol car, the registration number of the stolen vehicle, and
the coordinates of the location where the vehicle was observed. In this exercise, you will create a new
Service Bus topic, and add a subscription to the topic. You will use this topic to send alert messages to
patrol cars about stolen vehicles. Patrol car devices will subscribe to the subscription in this topic.
8. Generate event hub and IoT hub data for processing with Stream Analytics
Task 2: Reconfigure the IoT hub and add a new consumer group
Task 3: Reconfigure the event hub and add a new consumer group
Task 7: Start the TrafficAnalytics Azure and PatrolCarAnalytics Stream Analytics jobs
Task 8: Generate event hub and IoT hub data for processing with Stream Analytics
Task 9: Start an application to receive Stream Analytics data using a Service Bus
Created a new Service Bus topic, and added a subscription to this topic.
Reconfigured the IoT and event hubs, and added a new consumer group to each hub.
Reconfigured the TrafficAnalytics Azure Stream Analytics job to use these new inputs, and to use the new
Service Bus topic as a job output.
Updated the job query to send data to the Service Bus topic, by using a Visual Studio application.
Exercise 6: Use the Azure portal and Azure PowerShell to manage and scale
Stream Analytics jobs
Scenario
You have been asked how the traffic surveillance system could cope with any large-scale incident or event
that requires additional police resources being brought on stream. You have also been asked how the
system might be monitored and managed, and to demonstrate any potential for automation. In this
exercise, you will monitor a Stream Analytics job, and create an alert when the job uses more than a
threshold number of streaming units. You will then use the Azure portal to scale up the job, and review
the streaming unit utilization. You will use the Azure PowerShell cmdlets for Stream Analytics to stop the
Stream Analytics job, to scale the job back down, and then to restart the job. Finally, you will use job
diagrams to visualize the configurations of your two Stream Analytics jobs.
4. Use Azure PowerShell to scale down and restart a Stream Analytics job
6. Lab closedown
Task 4: Use Azure PowerShell to scale down and restart a Stream Analytics job
Used Azure PowerShell to scale down and restart a Stream Analytics job.
Question: What data types would you process using Stream Analytics within your
organization?
Question: How might you use multiple stream analytics jobs within your organization?
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 2-31
Review Question(s)
Question: How might you implement Stream Analytics within your organization?
MCT USE ONLY. STUDENT USE PROHIBITED
MCT USE ONLY. STUDENT USE PROHIBITED
3-1
Module 3
Performing Custom Processing in Azure Stream Analytics
Contents:
Module Overview 3-1
Lesson 1: Implementing custom functions and debugging jobs 3-2
Module Overview
This module describes how to use custom functions in Microsoft® Azure® Stream Analytics and includes
an understanding of how to use Microsoft Azure Machine Learning with Stream Analytics. It also covers
how to test and debug Stream Analytics jobs.
Objectives
By the end of this module, you will be able to:
Use custom functions that are implemented by using JavaScript in Stream Analytics jobs.
Integrate Machine Learning models into a Stream Analytics job.
MCT USE ONLY. STUDENT USE PROHIBITED
3-2 Performing Custom Processing in Azure Stream Analytics
Lesson 1
Implementing custom functions and debugging jobs
This lesson explains how to create user-defined functions (UDFs) and use them in Stream Analytics. It also
explains how to test and debug Stream Analytics jobs.
Lesson Objectives
By the end of this lesson, you should be able to:
o Function type. This is either JavaScript UDF or machine learning. For this purpose, select
JavaScript UDF.
o Output type. Select the required output type. The data types that are supported are array, bigint,
datetime, float, nvarchar(max) or record. “Any” data type can be used if you want a function to
return different data types based on different scenarios—and you decode at the function call.
}
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 3-3
7. This UDF can now be called from the query within a Stream Analytics job just like any scalar function.
Collect() function. Collect() is a built-in function in the Azure Stream Analytics Query Language and is
used to return an array containing all the records from a specific time window.
It’s important to note that, even though UDFs normally work on a row-by-row basis, they also take an
entire dataset for a specific time window as an input if you use the Collect() function. An example of using
the Collect() function is as follows:
SELECT
time,
UDF.Text2Int(offset_parameter) AS Text2IntOffset
INTO
ASAOutput
FROM
ASAInputStream
It’s important to note certain conversions between Stream Analytics object data types and JavaScript data
types. For example, JavaScript only represents integers up to precisely 2^53 and JavaScript only supports
milliseconds in Date data type. The following table provides a mapping between them:
bigint Number
double Number
nvarchar(MAX) String
DateTime Date
MCT USE ONLY. STUDENT USE PROHIBITED
3-4 Performing Custom Processing in Azure Stream Analytics
Record Object
Array Array
NULL Null
Testing jobs
After you’ve created a Stream Analytics job, it’s
important to test it against a sample dataset. The
Azure portal provides functionality to upload a
sample file to test the Stream Analytics job. You
should use the following steps:
3. Right-click on the input where the dataset needs to be uploaded for testing. You will see two options:
o Upload sample data from file. Use this option to upload an input file for test purposes.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 3-5
o Sample data from input. This option extracts a portion of data based on Start date time and
duration of data collection.
4. When the test data is either uploaded or extracted, you click the Test button. This triggers the
execution of the Stream Analytics job with the test data.
5. After you execute the Stream Analytics job, the results are displayed on screen and available for
download.
It’s important to test your Stream Analytics job with sample data, to see if the function or query is working
as expected, before executing the job with full dataset.
Debugging jobs
Stream Analytics provides a highly efficient
platform for quickly creating large-scale data
processing applications. As the application/query
logic becomes more complex, it can become
difficult to debug jobs when you run into issues.
For this purpose, it’s important to understand
how to debug Stream Analytics jobs.
The Azure portal provides an activity log for Stream Analytics jobs for events like Critical, Error, Warning
and Informational. It’s useful to review the activity log to understand the issues and resolve them
accordingly. It’s important to note that informational events contain the actual details for causes of errors
when compared to error events, which often state that an error has occurred.
It’s important to note that the Azure portal also provides a capability to add an activity log alert when
errors occur. You can configure the Azure portal to send alerts via an email message, SMS or webhook.
This demonstration analyzes the prices of stocks in a stock market, and attempts to work out which stocks
are more volatile than others.
The MostVolatile UDF function returns the stock ticker, the number of price changes, and the maximum
and minimum prices for the most volatile stock item from all the stock market price changes recorded in a
given time window.
MCT USE ONLY. STUDENT USE PROHIBITED
3-6 Performing Custom Processing in Azure Stream Analytics
The price must have changed at least 10 times in the time window.
The difference between the maximum and minimum prices must be greater than that of all other
stocks that have changed prices at least five times.
In the event of a tie, the number of price changes decides which is the most volatile.
Question: Can you use Stream Analytics to process data for fraud or threat detection?
Critical
Error
Warning
Informational
Important
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Lesson 2
Incorporating Machine Learning into a Stream Analytics
job
This lesson describes how to use Machine Learning models in Stream Analytics jobs.
Lesson Objectives
At the end of this lesson, you should be able to:
Azure Machine Learning Studio. This is a graphical tool you use to design, implement, test, and deploy
machine learning functions. Machine Learning Studio provides a large library of preprocessing routines
you use to prepare data for machine learning from the raw data, run experiments on the prepared data
using machine learning algorithms, and test the model. After an effective model is found, Machine
Learning Studio also helps to deploy the model.
Data preprocessing modules. Machine Learning provides a large library of data preprocessing modules
you use to process and prepare raw data into processed data on which machine learning algorithms are
executed.
Machine Learning algorithms. Machine Learning provides many algorithms such as Multiclass Decision
Jungle, Two-Class Boosted Decision Tree, One-vs-All Multiclass, Bayesian Linear Regression, Boosted
Decision Tree Regression, Neural Network Regression.
Machine Learning API. After an effective model is deployed, Machine Learning provides a rich API for
downstream applications to consume this model—for example RESTful web services.
MCT USE ONLY. STUDENT USE PROHIBITED
3-8 Performing Custom Processing in Azure Stream Analytics
1. Log in to the Azure portal and open the specific Stream Analytics job where you need to add a UDF.
Records
A record is a collection of name and value pairs
that JSON uses extensively. A typical JSON file
has a structure similar to that of name and value
pairs. For example, consider the following sensor
information for a single event in JSON format:
{
"DeviceIdentification":"ABC123",
"LocationInformation":{"Lat":"100", "Long":"200"},
"SensorInformation":{
"Temperature": "70",
"Humidity": "50",
"CustomSensor01": "10",
"CustomSensor02": "20",
"CustomSensor03": "30"
}
}
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 3-9
Many events similar to this are produced by the sensors and are then sent as a data stream to Stream
Analytics. The Stream Analytics job processes this information to either store it in a database or send it to
another application for further processing.
To access this information in a Stream Analytics query, you use the dot notation to reference each specific
field. For example:
SELECT
DeviceIdentification,
LocationInformation.Lat,
LocationInformation.Long,
SensorInformation.Temperature,
SensorInformation.Humidity
FROM input
Arrays
Arrays are an ordered collection of values. Arrays are particularly useful when you don’t know the number
of sets of name and value pairs that are coming for each element within that event. Therefore, you use an
array to represent that information. Consider the same sensor information in the preceding example but
include many point measurements in the form of an array, as follows:
{
"DeviceIdentification": "ABC123",
"LocationInformation": {"Lat": "100", "Long": "200"},
"SensorInformation":
[
{
"Temperature": "50",
"Humidity": "50",
"CustomSensor01": "10",
"CustomSensor02": "20",
"CustomSensor03": "30"
},
{
"Temperature": "60",
"Humidity": "60",
"CustomSensor01": "20",
"CustomSensor02": "30",
"CustomSensor03": "40"
},
{
"Temperature": "70",
"Humidity": "70",
"CustomSensor01": "30",
"CustomSensor02": "40",
"CustomSensor03": "50"
}
]
}
In this example, SensorInformation is an array structure as denoted between [ and ] brackets and has three
array elements—each element has five name/value pairs.
MCT USE ONLY. STUDENT USE PROHIBITED
3-10 Performing Custom Processing in Azure Stream Analytics
Stream Analytics provides functionality like CROSS APPLY that makes it easy to extract information from
such complex array data types:
SELECT
t.DeviceIdentification,
t.LocationInformation.Lat,
t.LocationInformation.Long,
flat.ArrayValue.Temperature,
flat.ArrayValue.Humidity,
flat.ArrayValue.CustomSensor01,
flat.ArrayValue.CustomSensor02,
flat.ArrayValue.CustomSensor03
FROM
Input t
CROSS APPLY GetElements(t.SensorInformation) AS flat
The preceding query produces a flat table that has three records, as follows:
WITH weatherForecast AS (
SELECT inputParam, weatherForecast
(inputParam) as result from input
)
select inputParam,
result.[expectedTemperature]
into output
from weatherForecast
weatherForecast(inputParam) calls the Machine Learning UDF with the inputParam and receives the result.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 3-11
This demonstration continues the stock market scenario. Stream Analytics captures the tickers and new
price each time a price change occurs. The Stream Analytics job uses machine learning to detect whether
a price is unusual for the stock (much higher than might be expected given the price history of the stock).
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
For the second phase of the project, you will use Stream Analytics, together with Event Hub, IoT Hubs,
Service Bus, Machine Learning, and custom applications to:
Post messages to a Service Bus queue for all vehicles that are speeding, by using a simple JavaScript
UDF function to determine whether a vehicle’s speed is above a particular limit.
Identify vehicles that appear to be using the same registration number, by using a JavaScript UDF
function to determine whether a vehicle with the same registration (not necessarily speeding) has
been spotted at two locations that are an impossible distance apart within a given timeframe.
Identify traffic flow issues, such as road blockages, or excessive speeds, by using Machine Learning
with Stream Analytics to detect consistent speed anomalies from a speed camera; for example, if
speeds are consistently very low for a period, the cause could be a traffic accident or incident.
Objectives
After completing this lab, you will be able to:
Use a Stream Analytics UDF function to identify specific data points.
Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.
Lab Setup
Estimated Time: 90 minutes
Virtual machine: 20776A-LON-DEV
Username: ADATUM\AdatumAdmin
Password: Pa55w.rd
This lab uses the following resources from Lab 2, all in resource group CamerasRG:
TrafficAnalytics
PatrolCarAnalytics
Results: At the end of this exercise, you will have added a new consumer group to your event hub, added
a new queue to your Service Bus, reconfigured an existing Stream Analytics job to use these resources,
and added a UDF that returns an integer value indicating whether a vehicle is speeding. You will also have
tested this logic using Visual Studio apps.
Results: At the end of this exercise, you will have added a new consumer group to your event hub, and
created a new Stream Analytics job that uses a UDF to identify duplicate vehicle registrations. You will also
have tested this logic using a Visual Studio app.
4. Stop the Speed Camera app and the CaptureTrafficData Stream Analytics job
10. Add another consumer group to the Speed Camera event hub
11. Create a new Service Bus queue
16. Start the Location Alerts app and view the results
17. Lab closedown
MCT USE ONLY. STUDENT USE PROHIBITED
3-16 Performing Custom Processing in Azure Stream Analytics
Task 4: Stop the Speed Camera app and the CaptureTrafficData Stream Analytics job
Task 10: Add another consumer group to the Speed Camera event hub
Task 15: Start the Patrol Car and Speed Camera apps
Task 16: Start the Location Alerts app and view the results
Results: At the end of this exercise, you will have added new consumer groups to your event hub, and
created a new Stream Analytics job that works with a Visual Studio app to generate training data. You will
then create a machine learning experiment to detect anomalous data, train your model using the
generated training data, and then deploy the trained model as a web service. You will also have created a
second new Stream Analytics job that uses the Machine Learning web service, and the model using Visual
Studio apps.
Question: How might you use Stream Analytics UDF functions to identify specific data points
within your organization?
Question: What requirements does your organization have for deduplicating information?
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 3-17
Review Question(s)
Question: What requirements does your organization have for implementing custom
functions and debugging jobs?
MCT USE ONLY. STUDENT USE PROHIBITED
MCT USE ONLY. STUDENT USE PROHIBITED
4-1
Module 4
Managing Big Data in Azure Data Lake Store
Contents:
Module Overview 4-1
Lesson 1: The Azure Data Lake Store 4-2
Module Overview
Microsoft® Azure® Data Lake Store is a hyperscale distributed file service that is part of the Azure Data
Lake collection of services. Data Lake Store plays a key role in the process of the management and
analysis of big data. Data Lake Store, sometimes referred to as an analytics store, provides a staging area
for substantial amounts of data where transformation or preparation and other analytics jobs are
performed. Data Lake Store fully incorporates tools and interfaces—such as Visual Studio®, PowerShell™,
and U-SQL™—that are commonly used by developers, data scientists, engineers, and architects. It is
extensible through its compatibility with open-source big data solutions like those found within the
Hadoop ecosystem.
Objectives
By the end of this module, you will be able to:
Lesson 1
The Azure Data Lake Store
This lesson describes how to use the Data Lake Store service to create and manage large-scale storage
structures. You use a Data Lake Store to hold vast amounts of data, unconstrained by the storage capacity
limits of a single computer. However, to utilize large-scale storage effectively, you have to understand
how to structure and manage the data so that you find it quickly.
Lesson Objectives
At the end of this lesson, you should be able to:
Populate the Data Lake Store with data using the Azure portal and other methods.
Describe how to optimize access to a Data Lake Store.
There are no limits for account sizes, file sizes, or the amount of data stored in a Data Lake store. It holds
files ranging in size from kilobytes to petabytes, with no limit on the duration during which a file is stored.
Examples of information typically held in a Data Lake store include CSV or text files, data streamed from
medical devices or automobiles, video and audio files, databases, and much more.
Data Lake Store is an analytics store that is primed for high capacity I/O operations and compute
functions. This distinguishes Data Lake Store from other cloud-based storage solutions that are typically
designed for relatively static data, usually stored as blobs (such as in OneDrive® or Azure storage
containers). However, the complexity and design of the underlying storage system invokes a premium
when compared to other Azure storage solutions, and might be more expensive to use. Therefore, it is
important to store data efficiently to minimize these costs.
Name The name of your Data Lake Store must be lowercase and contain between
three and 24 alphanumeric characters.
The Data Lake Store will also be given a suffix so that the name will resemble:
myadsl.azuredatalakestore.net
Subscription Any available subscription that is assigned to your Azure Active Directory (Azure
AD) account.
Resource All resources within Azure must reside in a Resource Group (RG). You have the
Group option to create one or to select an existing one from a drop-down menu.
Location At the time of writing, the available regions for Data Lake Stores are:
East US 2
Central US
North Europe
Other regions are expected to come online.
Encryption Encryption is enabled by default using keys that are managed by the Data
Lake Store service. Other options include:
No encryption
Encryption using keys from a personal Key Vault account
MCT USE ONLY. STUDENT USE PROHIBITED
4-4 Managing Big Data in Azure Data Lake Store
For detailed information on using the Azure portal to create a Data Lake Store account, see:
Get started with Data Lake Store using the Azure portal
https://aka.ms/Yixagp
You can also create a Data Lake Store programmatically, using interfaces that are available for PowerShell,
Azure CLI, and Visual Studio. Modules are also available for open source languages, such as Python. You
use these to automate many common Data Lake Store tasks.
The following example shows how to create a Data Lake Store using the PowerShell interface:
# Specify a subscription
Set-AzureRmContext -SubscriptionId <subscription ID>
# Verify that the Data Lake Store account has been created
Test-AzureRmDataLakeStoreAccount -Name $dataLakeStoreName
This example uses the default settings for encryption and the pricing tier, but you can modify these values
by using the –Encryption and –Tier parameters to the New-AzureRmDataLakeStoreAccount command.
specify a Data Lake Store as an output sink. However, whichever scheme you use, you should ensure that
you can easily locate data quickly—don’t just create folders in which to “dump” data and depend on the
file name to determine what the file contains. Additionally, analytics processors such as Data Lake
Analytics work better with a few large data files, rather than many small ones—you should structure
datasets into large chunks, each of which is processed as a single item. Microsoft recommend that you
organize datasets into files of 256 MB or larger.
Note: If you have many small files, consider using a preprocessor that combines these files
into larger pieces before passing them to an analytics processor.
The simplest way to create folders, upload files, and navigate the structure of a Data Lake Store is to use
the Data Explorer that is available on the Data Lake Store blade in the Azure portal. However, if you are
automating tasks, you use the programmatic interfaces that are available to the many programming
languages supported by Azure.
The following code shows how to create a new folder and view the contents of a folder by using
PowerShell:
Use the Azure portal to create Data Lake Store folders and upload data.
Use the Azure portal to download data from Data Lake Store.
MCT USE ONLY. STUDENT USE PROHIBITED
4-6 Managing Big Data in Azure Data Lake Store
Whilst the Import/Export service provides a fast way to onboard large amounts of data, for smaller
datasets you will likely use one of several software-based tools, such as PowerShell, AzCopy, AdlCopy,
Visual Studio, or Azure Data Factory.
You upload and download individual files programmatically using PowerShell. You use the Import-
AzureRmDataLakeStoreItem cmdlet to upload a file, and the Export-AzureRmDataLakeStoreItem cmdlet
to download a file.
If you need to transfer multiple files quickly, there’s a two-stage process. The first stage involves moving
the data into Blob storage, typically using the AzCopy utility. The second stage involves moving the data
across Azure’s data plane from Blob storage to your Data Lake Store; the AdlCopy utility is ideal for this
part of the process.
The following commands show how to use AzCopy and AdlCopy to transfer all CSV files in a specified
folder on an on-premises computer into a Data Lake Store:
Note that AdlCopy operates in two modes: standalone and by using a Data Lake Analytics account. In
standalone mode, the AdlCopy utility uses resources provided by the Data Lake Store service, and
performance might be unpredictable, especially if you are transferring large files. Using a Data Lake
Analytics account causes AdlCopy to run as an analytics job using the resources that you specify (and are
billed for). This mode of operation is more predictable. For more detailed information on using AzCopy
and AdlCopy, see:
Note: Utilities are also available for transferring data from HDInsight cluster storage
(Distcp), and Azure SQL Database (Sqoop).
You might also use the Cloud Explorer extension in Visual Studio—this is useful if you’re working on
complex projects. Visual Studio gives access to Blob storage and Data Lake Stores, providing the capability
for you to transfer data in and out of Data Lake Store. You also use Cloud Explore to transfer data
between stores held in separate Azure accounts. However, Visual Studio enforces a queuing mechanism
for requests, so the rate of transfer in and out of Data Lake Store will depend on how many operations
you are attempting to perform from Visual Studio at any given time.
ExpressRoute
https://aka.ms/Huxmw2
You should also ensure that the tools you are using (AzCopy, AdlCopy, PowerShell, and so on) are making
the best use of resources. Maximize parallelization wherever possible. For example, if you are using
AdlCopy, run using a Data Lake Analytics account and specify an appropriate number of Data Lake
Analytics units. If you are using the PowerShell Import-AzureRmDataLakeStoreItem cmdlet, specify the
PerFileThreadCount and ConcurrentFileCount parameters appropriately. Use the PerFileThreadCount
parameter to set the number of threads that are used in parallel for uploading each file. The
ConcurrentFileCount parameter indicates the maximum number of files that might be uploaded
concurrently.
MCT USE ONLY. STUDENT USE PROHIBITED
4-8 Managing Big Data in Azure Data Lake Store
It’s also important to understand that you are responsible for handling disaster recovery. Data Lake Store
is robust and managed by Microsoft, but you should always ensure that you have at least two copies of
critical data stored in separate regions to protect you from unscheduled outages and other regional
disasters. You automate this process by using scripts that run AdlCopy to copy data from one store to
another, or by using Data Factory to perform these tasks according to a regular schedule. This approach is
discussed further in Module 9: Automating the Data Flow with Azure Data Factory.
You should also protect your data to ensure that it can’t be overwritten or deleted (accidentally or
maliciously) by applying the appropriate security controls, as discussed in Lesson 2. Additionally, you
should consider applying an Azure resource lock over a Data Lake Storage account to prevent the entire
account from being removed.
Part 2—upload a set of files to Data Lake Store from Blob storage:
o Use AdlCopy to transfer files from Blob storage to Data Lake Store.
Question: What makes a Data Lake Store an unlikely choice for the replacement of
corporate file shares?
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-9
Lesson 2
Monitoring and protecting data in Data Lake Store
In this lesson, you will learn about the techniques that are available for protecting the data held in Data
Lake Store, and how to prevent unauthorized access to this data. You will also see how to monitor data
and track attempts to access that data.
You will consider the Access Control List (ACL) model employed by Data Lake Store. Most of the
techniques and concepts will be familiar to those with a Microsoft background but one subject in
particular—the POSIX ACL model—might be entirely new. You will also learn about additional features
such as network security and encryption, which typically work in the same way as they do in other Azure
services.
Lesson Objectives
By the end of this lesson, you should be able to:
Describe how to encrypt data in Data Lake Store, and manage the encryption keys.
Prevent access to Data Lake Store requests that originate from unknown sites.
Explain how to apply authentication and security in applications that use a Data Lake Store.
Encrypting data
All interactions with a Data Lake Store take place
over an HTTPS connection. This protocol helps to
ensure that all data that enters or exits the
service is encrypted. However, a Data Lake Store
also encrypts data as it is stored—even if a
successful attempt is made to break into the
service, the data itself is unusable without the
appropriate encryption keys.
You use Azure’s Key Vault service for the management of encryption keys. There are two modes of master
encryption key (MEK) management in Data Lake Store:
Access to the MEK, or MEKs, is required to access data within a store. The following table lists the
comparison of capabilities between the two methods. Note that, after you choose the method at the time
of creation, it can’t be changed unless data is migrated to a new store.
When is my data It is encrypted prior to being stored. It is encrypted prior to being stored.
encrypted?
Are encryption No No
keys stored in the
clear anywhere
outside of the
Key Vault?
Is it possible to No. After the MEK is stored in Key No. After the MEK is stored in Key
retrieve my MEK Vault, it is locked and is then only Vault, it is locked and is then only
from Key Vault? used for encryption and decryption. used for encryption and decryption.
Who owns the The Data Lake Store service. The owner of the Azure subscription
Key Vault also owns the instance of Key Vault
instance and the that stores the MEK. Note that a
MEK? Hardware Security Module (HSM)
can also be used—as provided by the
service.
The following table summarizes the three key types used within the design of Data Lake encryption:
Encryption in Data Lake Store is transparent. This feature means that data is encrypted before being
persisted, and then decrypted prior to retrieval. This approach is especially important to note in the
context of applications that are called by APIs. The good news is that, by using this model, no special
consideration is required for interaction with Data Lake.
https://aka.ms/Unxahe
Note that the firewall is disabled by default, but you enable it and set client IP address ranges using the
Azure portal. You can also perform these tasks programmatically. The following example uses PowerShell:
# Log in to Azure
Login-AzureRmAccount
# Specify a subscription
Set-AzureRmContext -SubscriptionId <subscription ID>
# Enable the firewall for the specified Data Lake Store account
$accountName = "<account name>"
Set-AzureRmDataLakeStoreAccount -Name $accountName -FirewallState Enabled
# Enable Azure services and applications to connect (including Data Lake Analytics)
Set-AzureRmDataLakeStoreAccount -Name $accountName -AllowAzureIpState Enabled
End-user authentication
Service-to-service authentication
Use the portal to create an Azure AD application and service principal that accesses resources
https://aka.ms/Kk4w9t
Service-to-service authentication occurs when a custom web application, which operates on behalf of
users, needs to access a resource, such as Data Lake Store. The web application runs using its own identity
rather than impersonating a user, and must therefore be authenticated before it retrieves or modifies data
in the store. This approach requires you to create an Azure AD web application to perform the
authentication and issue OAuth 2.0 access tokens. Your custom web application passes identity
information to the Azure AD web application, which validates the credentials and returns access tokens
that your web application uses to access the Data Lake Store. For more information, see:
Authorizing users
User authorization operates at two levels in Data
Lake Store. You use RBAC to specify or limit the
operations that a user performs across the entire
store; you apply Access Control Lists (ACLs) to
control the operations that users perform over
specific files and folders in the store.
role creates folders and uploads files to the store, and the Reader role accesses files held in the store but
does not modify them. You assign users from your Azure AD directory to one or more of these roles to
give them the corresponding access rights. Note that RBAC is not specific to Data Lake Store, and can be
applied across many Azure services. For more information about RBAC, visit:
ACLs give you a finer-grained degree of control over individual files and folders. They define three
privileges: Read (R), Write (W), and Execute (X). You assign these privileges to different sets of users:
owners, named users and groups, and everyone else. An owner is the user who creates a file or folder.
Named users and groups refer to identities held in your Azure AD directory (this includes system-defined
groups and identities, such as “Azure Key Vault” and “Azure Machine Learning”). You assign ACLs to files
and folders using the Access tool in Data Explorer in the Azure portal. You also assign ACLs
programmatically when you upload files and create folders. For more information, see:
Assign users or a security group as ACLs to the Data Lake Store file system
https://aka.ms/Yqqghf
Note: Users who have the Owner RBAC role for the Data Lake Account are known as
“superusers” for ACL purposes. This means that they are not subject to any of the restrictions
imposed by ACLs, and always have full control over all files and folders, regardless of whether
they have been assigned Read, Write, or Execute privileges.
In the context of ACLs, the terms Read, Write, and Execute are a result of history (they have been inherited
from POSIX), and don’t necessarily mean exactly what their names imply. Read and Write permissions over
a file enable the specified user or group to read or write (append) the contents of that file. Execute
permission has no meaning over files in a Data Lake Store and is ignored. Read and Write permissions
over a folder work in conjunction with the Execute privilege and enable a user to read the contents of a
folder (this requires Read and Execute permissions), or to write to a folder (this requires Write and Execute
privileges). Execute permission by itself gives the user the ability to access a file in that folder and traverse
through the folder into subfolders—but not to actually list the contents of the folder. Only the owner of a
file or folder, or a superuser, can set the permissions for that file or folder.
For example, to read the file mydata.txt, located in the folder /folder1/folder2, you must have:
Note that you don’t require any permissions on the mydata.txt file itself.
To list all the files in the folder2 folder, you must have:
For a more detailed discussion on ACLs with Data Lake Store, see:
Like logging in other Azure services, Data Lake Store logging enables you to capture auditing and request
log information to three different sinks:
Storage account. This is useful for the batch processing of historic logs. A Blob storage account
needs to be either created or defined. When logging has been enabled, you download the log data
from the Diagnostics Log blade for the Data Lake Store in the Azure portal, or retrieve it directly from
the Blob storage account.
Event hub. This is useful in cases where you need to alert specific events in real time. For example,
you might run a Streaming Analytics job that filters and processes this data, and perhaps incorporate
Machine Learning to spot unusual access patterns that signify an attempt to break in to the system
and steal data.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-15
Log analytics. This destination is useful for operations teams who use Azure Operations Manager and
its dashboard capabilities.
For more information, see:
https://aka.ms/Pxudma
For the next phase of the project, you will use a range of tools to enable batch mode, and automated
operations, with Data Lake Store. You will also add security to your Data Lake Store, using custom ACLs
and data encryption that uses your own managed key.
Objectives
After completing this lab, you will be able to:
Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.
Username: ADATUM\AdatumAdmin
Password: Pa55w.rd
2. Install AdlCopy
Task 5: Use PowerShell to manage files and folders in Data Lake Store
Results: At the end of this exercise, you will have used PowerShell cmdlets to:
4. Use AdlCopy to copy the files from Blob storage to Data Lake Store
Task 4: Use AdlCopy to copy the files from Blob storage to Data Lake Store
14. Use PowerShell to attempt to overwrite a file after a folder permissions change
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-19
Task 3: Set guest user permissions for a Data Lake Store folder
Task 7: Use PowerShell to list the folder contents after a permissions change
Task 14: Use PowerShell to attempt to overwrite a file after a folder permissions
change
Results: At the end of this exercise, you will have created a guest user account in Azure AD, and then
tested the ability of this account to view folder contents, open files, and upload and update files in Data
Lake Store, depending on the specific permissions that are set. You will use the Azure portal to manage
the permissions, and use PowerShell as the guest user environment.
Exercise 4: Encrypt Data Lake Store data using your own key
Scenario
As you have already seen, Adatum are building a traffic surveillance system that will use Data Lake Store
for primary data storage. Therefore, you will also configure additional security for your Data Lake Store by
setting up data encryption using your own managed key. In this exercise, you will create a new Key Vault
and key, use this key to protect a new Data Lake Store, and then investigate the effects of key deletion
and key restoration on the ability to access data.
2. Configure a new Data Lake storage account to use Key Vault encryption
4. Back up the Key Vault key, and then delete the key
8. Lab closedown
Task 2: Configure a new Data Lake storage account to use Key Vault encryption
Task 3: Use Data Explorer to upload a file and verify its contents
Task 4: Back up the Key Vault key, and then delete the key
Results: At the end of this exercise, you will have created a new Key Vault and key, used this key to
encrypt a new Data Lake Store, uploaded data to the store, created a key backup and then deleted the
key. You will also have attempted to access and upload data after key deletion, and restored the deleted
key and verified data access.
Question: Why might you set a default permission entry on a folder in Data Lake Store?
Question: Is encryption using Key Vault the only way to encrypt data at rest properly in Data
Lake Store?
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 4-21
Module 5
Processing big data using Azure Data Lake Analytics
Contents:
Module Overview 5-1
Lesson 1: Introduction to Azure Data Lake Analytics 5-2
Module Overview
Microsoft® Azure® Data Lake Analytics (ADLA) provides a framework and set of tools that you use to
analyze data held in Microsoft Azure Data Lake Store, and other repositories. This module describes in
detail how ADLA works, and how you use it to create and run analytics jobs.
Objectives
At the end of this module, you will be able to:
Describe the purpose of Azure Data Lake Analytics, and how you create and run jobs.
Describe how to use windowing to sort data and perform aggregated operations, and how to join
data from multiple sources.
MCT USE ONLY. STUDENT USE PROHIBITED
5-2 Processing big data using Azure Data Lake Analytics
Lesson 1
Introduction to Azure Data Lake Analytics
Azure Data Lake Analytics (ADLA) is a platform-as-a-service (PaaS) offering that enables advanced
analytics and batch processing at scale on big data. At first glance, it might appear that ADLA and Azure
Stream Analytics fulfil similar roles. However, they are designed and optimized for different scenarios;
Stream Analytics is intended for processing large-scale streaming data sources, whereas ADLA is designed
to analyze big data at rest, ideally held in a Data Lake Store (although you can also retrieve data from
other sources).
Lesson Objectives
In this lesson, you will learn:
How to use Visual Studio® tools to package and submit ADLA jobs.
“Processing” defines the initial analytics and other transformations performed on the ingested data.
“Storage” describes the way in which the transformed data is held, optimized in structures to support
the operations and queries required by the business.
“Delivery” covers the way in which the data is used; it could be subjected to further detailed analysis,
joined with other datasets to provide additional insights, summarized to generate an overall view of
the system, and presented graphically, possibly with drill-down capabilities to support detailed
investigation. Some of these tasks might be implemented by performing an additional “Processing”
iteration.
ADLA is concerned with the processing phase, taking data that has been ingested into large-scale storage
(such as Data Lake Store), transforming it, and then processing it for analytics and data storage purposes.
Within the data processing phase, there are the concepts of “hot” and “cold” data streams. While Stream
Analytics is the best choice for real-time, inbound data analysis (hot data), ADLA is optimized for jobs that
take minutes, hours, or even days against cold data.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-3
ADLA behavior is not dissimilar to that of a MapReduce programming model that you would find in
traditional Hadoop environments. In this model, low-level algorithms typically authored in Java create a
framework by which input data is separated into “chunks” that are processed independently and in
parallel—this reduces the overall execution time of a given job or jobs. ADLA tracks these process
mappings and can “stitch” them back together for the final output. This can also be thought of as its own
pipeline and represents a typical ADLA job at a high level. However, unlike many of the other options for
batch processing—for example, SQL Data Warehouse and Hadoop—ADLA is a true PaaS solution because
the interface is abstracted from the underlying distributed architecture. ADLA is focused on running jobs,
writing scripts, and managing processing tasks. Microsoft provides the infrastructure for building and
executing jobs, and you are insulated from the low-level details of ensuring that sufficient resources are
available to perform your tasks.
The tasks that run in parallel in each stage are referred to as “vertices” by ADLA. Each vertex is managed
and scheduled by using a distributed component called Yet Another Resource Negotiator (YARN). This is
the same component that is used by many Hadoop installations. YARN takes the responsibility for
handing tasks to the available computing resources, and retrieving the results from those resources. This
enables the system to decouple the resource management responsibilities of the platform from the
processing components; Microsoft might modify or replace the management component at any time
without affecting any existing U-SQL jobs.
Computing resources are allocated to jobs in Analytics Units (AUs). Each AU represents the processing
power of two CPU cores and 6 GB of RAM. The more AUs you have allocated to a job, the greater the
potential degree of parallelization, and the faster the job will run. However, each AU has an associated
cost, so increasing the number of AUs will have a financial impact.
Although ADLA splits jobs into vertices for the purposes of parallelism, individual vertices might take
longer to complete than one another. Therefore, if one aspect of the job—an extremely complex
calculation, for example—inherently takes 10 minutes to complete, then adding more AUs might not
decrease the time to completion. In this instance, you might need to take a closer look at algorithm
optimization. Adding additional AUs will enable ADLA to break up the totality of a job, but only if the
characteristics of the job allow for this.
MCT USE ONLY. STUDENT USE PROHIBITED
5-4 Processing big data using Azure Data Lake Analytics
For more information about how AUs are applied to vertices, see:
Module 6: Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics
describes the tools that are available for examining the way in which a job has been broken down into
vertices.
Note: Important: The Data Lake Store account must be located in the same region as the
ADLA account. This is to minimize the time and costs required to access data.
Additionally, you specify the pricing tier for the account. The pricing tier determines how many AUs will
be available to run jobs.
You then create an ADLA job as a U-SQL script. The following example shows a simple U-SQL script that
reads stock market price data (tickers and prices) from a CSV file, calculates the maximum price for each
ticker, and saves the results to another CSV file.
U-SQL job that calculates the maximum price for a stock market ticker
@priceData =
EXTRACT Ticker string,
Price int,
HourOfDay int
FROM "/StockMarket/StockPrices.csv"
USING Extractors.Csv(skipFirstNRows: 1);
@maxPrices =
SELECT Ticker, MAX(Price) AS MaxPrice
FROM @priceData
GROUP BY Ticker;
OUTPUT @maxPrices
TO "/output/MaxPrices.csv"
USING Outputters.Csv(outputHeader: true);
After you have defined the work for the job, you submit it for processing. Again, the simplest way to
achieve this is to use the Azure portal. At this point, the U-SQL code is compiled and broken down into
stages and vertices. The vertices are then queued ready for submission by YARN to the ADLA processors.
You give the job a priority to determine which of your jobs should run ahead of others that you have
submitted. When a job reaches the front of the queue, the vertices are run. The number of ADLA
processors available to execute the vertices is governed by the number of AUs you assign to the job (you
specify this number when you submit the job—up to the value permitted by the pricing plan for the ADLA
account). The higher the number of AUs, the greater the parallelism and the faster the job runs. When the
vertices have completed their work, the results are combined and aggregated into the final result. The job
pane in the Azure portal enables you to track these tasks, view how the job is being processed, and
examine the results.
The following PowerShell script shows how to run the U-SQL job shown in the previous topic. The
StockPriceJob.usql file referenced by this example contains the U-SQL code:
# Specify a subscription
Set-AzureRmContext -SubscriptionId <Subscription ID>
Visual Studio provides another environment for building and running ADLA jobs, if you have the Data
Lake and Stream Analytics Tools for Visual Studio installed. These tools are available as part of Visual
Studio 2017—you can also download these tools for older versions of Visual Studio from the following
page:
Plug-in for Data Lake and Stream Analytics development using Visual Studio
https://aka.ms/Wds9j1
The Azure Data Lake Tools for Visual Studio provides a number of templates for building Data Lake
applications, including a U-SQL project template. Use this template to incorporate and debug user-
defined functions and other code items into your jobs more easily than using the Azure portal. You
submit jobs directly from this template, single-step through your code, and track the progress of a job as
it runs. You can also run and debug jobs locally on your own computer rather than in the cloud. This
process is described in Module 6.
Run a job using the Data Lake Tools for Visual Studio.
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Lesson 2
Analyzing data with U-SQL
U-SQL is the language that you use to describe an ADLA job. It is a nonprocedural language, and you
write U-SQL code that specifies the results that you want to see rather than the process to be performed
to obtain these results—that is the purpose of the ADLA compiler and the YARN scheduler. This lesson
describes how to use U-SQL to implement a job.
Lesson Objectives
After completing this lesson, you will be able to:
What is U-SQL?
U-SQL is the language that you use to implement
ADLA jobs. It represents a hybrid language that
takes features from both SQL and C#, and
provides declarative and procedural capabilities. If
you are from a development background, U-SQL
will present many concepts, constructs, and a
syntax with which you will be familiar. If you are
from an engineering or DBA background, you will
also benefit from U-SQL’s origins, which are based
on SQL and T-SQL. U-SQL abstracts the parallelism
and the distributed architecture from the scripts
that you create, making it simpler to write scripts
that perform complex tasks.
The intention of U-SQL is to provide a simple way to describe complex processing using SQL-like syntax,
combined with the ability to customize the way in which this processing works and transform data by
using procedural code. You extend the capabilities of U-SQL by implementing your own user-defined
functions, operators, and aggregators. This gives you the ability to implement analytical processes that
would be difficult to achieve using SQL alone. Module 6: Implementing Custom Operations and
Monitoring Performance in Azure Data Lake Analytics describes how to implement custom functionality in
more detail.
The hybrid nature of U-SQL demands careful attention when defining your jobs. U-SQL has the case
sensitivity of C#, even for keywords—for example, you must be careful to use “SELECT” and “FROM” rather
than “Select” and “From”. Additionally, U-SQL uses C# comparison operators and expressions. This means
that you must use operators such as “==” to test for equality, and not “=”.
MCT USE ONLY. STUDENT USE PROHIBITED
5-8 Processing big data using Azure Data Lake Analytics
The key abstraction used by U-SQL is the rowset. A rowset is a tabular structure containing unordered
rows of data. Each row in a rowset has the same schema that defines a set of columns, and each row can
be up to 4 MB in size. However, the number of rows in a rowset is potentially unlimited, and it is the task
of ADLA to determine how to divide up a rowset into manageable subsets for processing, and then
combine the results.
The columns in a rowset might be simple types, such as int, float, double, string, char, or DateTime, or
they can be complex structures such as maps and arrays (a map is a collection that provides key/value
lookup functionality).
Note: The simple data types available correspond to those used by C#, and include the
nullable types (types that hold null values).
Note: All statements in U-SQL code must finish with a semicolon (;) terminator.
EXTRACT <schema>
FROM <source>
USING <extractor>
The <schema> part lists the individual fields in the input data and the types of data that they contain.
The input data will typically comprise schema-less text information (in the form of CSV, TSV, or
possibly JSON text files). The <schema> section is intended to try to map the individual elements in
the data into identifiable items in each row in a rowset, and then convert the input data into types
that can be processed. The data types are specified as C# types, as described earlier. The following
code fragment shows an example, describing the fields in a company’s personnel data:
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-9
For detailed information about the type conversions performed by the EXTRACT section, see:
U-SQL built-in extractors
https://aka.ms/Su2pbo
The <source> part lists the input data files. There might be one or more of these files. In its simplest
form, the FROM clause names a file in the default Data Lake Store associated with the ADLA account.
For example:
FROM “/PersonnelData/PersonnelFile.csv”
However, you can also specify a file in another Data Lake Store account or Blob storage (you must
first register the Data Lake Store account or Blob storage account with the ADLA account, to provide
the necessary security settings and keys). Here are some examples:
// myadlsaccount is a separate ADLS account
FROM “adl://myadlsaccount.azuredatalakestore.net/Data/MoreData.csv”
// Blob storage
FROM “wasb://mycontainer@mystorageaccount/Data.csv”
The USING clause specifies an extractor you use to read the data from the input file. ADLA provides
three built-in extractors: Extractors.Csv (which handles CSV files), Extractors.Tsv (for TSV files), and
Extractors.Text (for reading more generalized text files). If you need to retrieve data held in a different
format (such as JSON), you implement your own custom extractor. Module 6 covers this topic in more
detail.
The built-in extractors take parameters that modify the way in which they work. For example, you
switch the row delimiters used when an extractor parses a file—change the encoding in use—and
specify that certain rows (such as headers) should be skipped. For more details, see:
Extractor parameters (U-SQL)
https://aka.ms/K1pza1
The value returned by the EXTRACT statement is a reference to the rowset generated from the data.
You then use this reference to specify the processing to perform. The processing typically takes the
form of a series of SQL SELECT statements that incorporate user-defined functions, operators, and
aggregators. You might also use common SQL clauses, such as WHERE, ORDER BY, and GROUP BY
operators. There is also an IF … ELSE construct available. As before, you assign the rowset generated
as the result of the processing to a reference variable, which you then pass in to further SQL
statements. For example, the following code fragment finds the number of employees in each
department, and then refines this rowset to find all departments with more than 100 employees:
@personnelData =
EXTRACT EmployeeID int,
ForeName string,
LastName string,
Department int
FROM "/PersonnelData/Personnel.csv"
USING Extractors.Csv(skipFirstNRows: 1);
@numInEachDepartment = SELECT COUNT(*) AS Num, Department
FROM @personnelData
GROUP BY Department;
MCT USE ONLY. STUDENT USE PROHIBITED
5-10 Processing big data using Azure Data Lake Analytics
@bigDepartments = SELECT *
FROM @numInEachDepartment
WHERE Num > 100;
The OUTPUT statement describes where to send the results of the processing, and in what format. You
reference an outputter to save the data. You specify the location of the data with the TO clause—as with
the EXTRACT statement, this might specify a file in the default Data Lake Store, a separate Data Lake
Store, or Blob storage. ADLA provides built-in outputters for CSV, TSV, and generalized text data but, if
you require a different format, you should create your own custom outputters. Module 6 describes this
topic in more detail. Like extractors, the built-in outputters take parameters that modify the format of the
data, such as adding column headers. For further information, see:
The following example shows an outputter that sends the data to a CSV file in the default Data Lake Store.
Note that the folder mentioned does not have to already exist—the outputter will create it if necessary:
OUTPUT @bigDepartments
TO "/Departments/BigDepts.csv"
Note: Important: An outputter will overwrite a file that already exists, so be careful that
you don’t lose any important results. Additionally, outputters are atomic; they either write the
entire results rowset (on success) or they don’t write anything (if an error occurs during
processing). You should never get partial results.
The following example uses a scalar variable to avoid repeating the same location for Data Lake Store
storage in a job:
@data =
EXTRACT EmployeeID int,
ForeName string,
LastName string,
Department int
FROM @myAdlAccountString + “Personnel.csv”
USING Extractors.Csv();
@results = SELECT …;
OUTPUT @results
TO @myAdlAccounString + “Results.csv”
USING Outputters.csv();
Note: Unlike T-SQL, you must create and initialize a variable in the same statement; you
can’t declare a variable and use the T-SQL SET statement to assign it a value later. Additionally,
you can’t declare two variables with the same name.
You create arrays and maps using the generic structured types SQL.ARRAY and SQL.MAP. The SQL.ARRAY
type holds a list of values (you specify the type), and you use subscript notation to store and retrieve data.
A SQL.MAP object holds a list of key/value pairs. When you add an item to the map, you assign it a
unique key. You look up items by providing the key.
Although you might define SQL.ARRAY and SQL.MAP variables with DECLARE statements, it’s more
common to create and populate them by using SELECT statements. The following code creates a
SQL.MAP object from the personnel data shown in the previous example. The map uses the employee ID
as the key, and the department number as the value. The second SELECT statement counts the number of
items in the map where the value is 1. This yields the number of employees who work in that department:
@empMap =
SELECT new SQL.MAP<int, int?>{{EmployeeID, Department}} AS emp_dept
FROM @testData;
@query =
SELECT COUNT(*) AS Num
FROM @empMap
WHERE emp_dept.First().Value == 1;
For further information on using the SQL.ARRAY and SQL.MAP types, see:
An array or map is a single column that contains a collection of values. This is different from rowset data
where each column contains a single value, and you use multiple rows if you need to store multiple
values. If you combine the data in an array or map with a rowset, you must first convert the array or map
data into a set of separate rows (one for each value in the array or map). You achieve this using the
EXPLODE function.
The EXPLODE function takes an array or map as an argument, and generates an exploded rowset where
each row contains a single value from the array or map. However, you can’t use the rowset generated in
this way as an ordinary rowset; you only work with it using the CROSS APPLY operator. The CROSS APPLY
operator takes each item in the rowset on its left side and combines it with the corresponding values in
the rowset on its right side. Each row in the rowset from the left side is repeated for each matching row
on the right side. If there are no corresponding values in the rowset on the right side, the row from the
left side is omitted from the results (there’s also an OUTER APPLY operator that includes rows with missing
rowsets and combines them with a null value instead).
In the following example, the test data in the PriceMovement rowset represents the prices for different
stock market items over time (for simplicity, the time is not included in the data). The prices are held as a
single string for each item. The @result variable generates a new rowset where the prices for each item
are converted into an array. The @exploded rowset uses the EXPLODE function to convert the array in
each row into a new rowset called Temp, with a single column named Price. The CROSS APPLY operator
then combines the data in the Temp rowset with the data in the @result variable. The final rowset
contains the ticker (from the @result rowset) repeated for each price from the Temp rowset:
@result =
SELECT Ticker,
new SQL.ARRAY<string>(Prices.Split(',')) AS PricesArray
FROM @stockPrices;
@exploded =
SELECT Ticker, Price.Trim() AS Price
FROM @result
CROSS APPLY
EXPLODE(PricesArray) AS Temp(Price);
OUTPUT @exploded
TO "/Output/StockPrices.csv"
USING Outputters.Csv();
"AAAA","20"
"AAAA","21"
"AAAA","20"
"AAAA","19"
"AAAA","18"
"AAAA","19"
MCT USE ONLY. STUDENT USE PROHIBITED
5-14 Processing big data using Azure Data Lake Analytics
"AAAA","20"
"AAAA","22"
"AAAA","25"
"AAAA","22"
"AAAA","28"
"AAAA","27"
"BBBB","56"
"BBBB","58"
"BBBB","60"
"BBBB","65"
"BBBB","64"
"BBBB","63"
"BBBB","62"
"CCCC","77"
"CCCC","76"
"CCCC","74"
"CCCC","72"
"CCCC","68"
"CCCC","65"
"CCCC","67"
"CCCC","68"
"DDDD","45"
"DDDD","46"
"DDDD","45"
"DDDD","44"
"DDDD","43"
"DDDD","45"
"DDDD","44"
"DDDD","46"
"DDDD","47"
"DDDD","45"
"EEEE","1"
"EEEE","3"
"EEEE","6"
"EEEE","11"
"EEEE","15"
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-15
"EEEE","12"
"EEEE","15"
"EEEE","14"
"EEEE","15"
FROM
“/PersonnelData/PersonnelFile.
csv”,
“adl://myadlsaccount.azuredatalakest
ore.net/Data/MoreData.csv”
// Multiple files (using file-name
pattern matching – all CSV files in the specified directory, beginning with
a “P”)
FROM “/PersonnelData/P{*}.csv”
However, note that all files must have the same input format (such as CSV, TSV) and should contain data
that maps to the same set of fields. For more information, see:
If you have multiple input files of different formats, you must use multiple EXTRACT statements, and
process the files for each EXTRACT statement separately.
A useful feature of the way in which input filename matching operates concerns the ability to reference
virtual columns in the EXTRACT statement. For example, consider that Data Lake Store might hold a vast
number of files, and that one way to organize these files is to split them out into directories based on the
year, month, day, and hour in which the files were created (this approach is commonplace for systems that
generate files by streaming, such as Stream Analytics). If you need to process a set of files for a specific
period, you add a virtual column to the EXTRACT statement that identifies the period in question, and
then populate this column in the SELECT statement that processes the data. The EXTRACT statement
parses the value of this virtual column, and uses it to identify the directories and folders from which it
should fetch the data.
The following example shows an EXTRACT statement that fetches sales data held in CSV format from the
folder for a specific date. The CSV file only holds the productID, numSold, and pricePerItem fields; the
values for the date and filename virtual columns are determined by the WHERE clause of the SELECT
statement:
MCT USE ONLY. STUDENT USE PROHIBITED
5-16 Processing big data using Azure Data Lake Analytics
@data =
SELECT productID, numSold, pricePerItem
FROM @salesData
WHERE date == DateTime.Today AND fileName LIKE "%.csv";
The following example generates a series of files containing the results of a U-SQL job. Each file contains
the data from a single processing node:
@data =
SELECT …
FROM @salesData
WHERE year == DateTime.Today.Year
AND month == DateTime.Today.Month
AND day BETWEEN 20 AND 30
AND fileName LIKE "%.csv";
OUTPUT @data
TO "/SalesOutputs/{*}.csv"
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-17
USING Outputters.Csv();
Creating and populating a U-SQL Catalog database and table in a U-SQL job
// Create database for holding Personnel data if it doesn't already exist
DROP DATABASE IF EXISTS Personnel;
CREATE DATABASE IF NOT EXISTS Personnel;
// Create table
CREATE TABLE IF NOT EXISTS Personnel.dbo.PersonnelDept
(
EmpID int,
FName string,
LName string,
Dept int,
INDEX EmpIdx CLUSTERED (EmpID ASC) DISTRIBUTED BY HASH (EmpID)
);
Note: Important: You can’t INSERT into a table and read from it within the same U-SQL job.
This is because the INSERT and read operations could be spread across different vertices and be
performed in parallel; the results would be unreliable due to the possibility of data mutating as it
is read.
MCT USE ONLY. STUDENT USE PROHIBITED
5-18 Processing big data using Azure Data Lake Analytics
A table can have an unlimited number of rows, in addition to multiple indexes over various columns. To
help maintain performance, ADLA uses statistics to optimize queries performed against tables. These
statistics enable ADLA to determine the most efficient way to retrieve data from a table using the indexes
available; one index might be more suitable for satisfying a query than another, or ADLA might choose to
use multiple indexes to retrieve data and merge the results together, for example. You create the statistics
for a table using the CREATE STATISTICS command. The statistics are static, and if you change the
distribution of data in a table (by dropping or adding a large number of rows), they might become out of
date, resulting in poorer query performance. In this case, you use the UPDATE STATISTICS command to
refresh the statistics. The “WITH INCREMENTAL = ON” clause causes the statement to recompute the
statistics only for new rows that have been added since the statistics were created or last updated. If you
specify “WITH INCREMENTAL = OFF”, the command will drop the existing statistics for the table and
generate a completely new set. This could take a significant time over a large table.
The following examples illustrate the syntax for the CREATE STATISTICS and UPDATE STATISTICS
commands:
It’s possible to create views in the catalog, in addition to tables. A view is a persistent, named query. A
view doesn’t hold any data of its own; instead, it references a SELECT statement that retrieves data from
one or more tables. You retrieve data from a view using a SELECT statement, in much the same way as
you fetch data from a table. Note that a VIEW is different from a rowset variable because you reuse across
multiple jobs without having to define it every time.
You create a view using the CREATE VIEW command. This example creates a simple view that limits the
employees to those who work in Department 2:
Note: There are currently no specific security commands for the U-SQL Catalog. Instead,
you should use ADLA access control lists (ACLs) and RBAC to protect the data in the /catalog
folder in the ADLA account. For more information, see Module 4: Managing Big Data in Azure
Data Lake Store.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-19
The following example illustrates the use of the string Split function, to parse a string that contains
elements separated by the semicolon (;) character into its individual components. Note that the Split
function returns an array that is stored in a SQL.ARRAY object. The second SELECT statement retrieves the
rows from the array and outputs the individual elements as fields:
// Split the data into an array, with each row containing title, author, and year of
publication for each book
@splitData =
SELECT new SQL.ARRAY<string>(Books.Split(';')) AS BookData
FROM @someBooks;
All C# operators, with the exception of the assignment operators, are available. This includes the more
esoteric but extremely powerful items such as the ternary conditional operator (:?), the null coalescing
operation (??), and the Lambda expression operators (=>).
The ternary conditional operator is a succinct form of the if..else construct; the first operand, which must
be a Boolean expression, is evaluated—if it is true, the value of the second expression is returned as the
result; otherwise the value of the third expression is calculated and used. The following code shows an
example:
MCT USE ONLY. STUDENT USE PROHIBITED
5-20 Processing big data using Azure Data Lake Analytics
@result =
SELECT Name, Grade >= 10 ? “Senior Management” : “Worker” AS JobLevel
FROM @data;
The null coalescing operator examines its first operand, and if it is null, returns the value of the second
operand—otherwise, it uses the value of the first operand:
@result =
SELECT Name, Grade ?? 0 AS JobGrade
FROM @data;
You use Lambda expressions to create custom delegate functions for methods that support late-binding
of functions as arguments to expressions. Some library functions in the .NET Framework expect you to
provide your own C# code, which is run as part of the function call. An example of this is the Max function
of the SQL.ARRAY type. The Max function returns the biggest item in the array, but you have to provide
the code that defines the data where you find the maximum value. In the following example, the Max
function is used with a Lambda expression to convert a string containing a numeric value into a number.
The Max function will then return the item with the highest numeric value. Without performing this
conversion, the maximum prices for each stock would be calculated based on the character values of each
string rather than the numbers that they represent:
@result =
SELECT Ticker,
new SQL.ARRAY<string>(Prices.Split(',')) AS PricesArray
FROM @stockPrices;
@highestPrices =
// Use a Lambda expression to convert the string value of each element in the array
to a number
SELECT Ticker, PricesArray.Max(p => Convert.ToInt32(p)) AS HighestPrice
FROM @result;
OUTPUT @highestPrices
TO "/Output/HighestPrices.csv"
USING Outputters.Csv();
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-21
"AAAA",28
"BBBB",65
"CCCC",77
"DDDD",47
"EEEE",106
For a complete list of the operators (C# and SQL) that you might use in U-SQL code, see:
Operators (U-SQL)
https://aka.ms/E1c5o8
Occasionally, there might be some ambiguity between a C# function and a SQL function with the
same name. Additionally, some C# constants might cause the U-SQL compiler to fail. This is
because the U-SQL compiler assumes all uppercase words in a statement are U-SQL reserved
words; you can’t create your own functions or variables that use completely uppercase letters.
However, C# defines some of its own constants as uppercase, the most common example being
Math.PI. A SELECT statement such as the following—that calculates the volume of a sphere with a
given radius—will fail to compile:
SELECT 4/3 * Math.PI * Math.Pow(radius, 3)
FROM ...
You disable the U-SQL parser from parsing an expression by applying the CSHARP operator to
the expression, as shown here:
SELECT 4/3 * CSHARP(Math.PI) * Math.Pow(radius, 3)
FROM ...
Use an EXTRACT statement to combine multiple job inputs using virtual fields.
Perform parallel data saving to a Lake Store catalog and CSV file.
CROSS APPLY
MAP
EXPLODE
CONVERT
Lesson 3
Sorting, grouping, and joining data
Many analyses require you to aggregate and summarize data. This also frequently means that you
combine data from different sources. These sources could be unstructured files, structured databases, or a
mixture of both. In this lesson, you will learn how to use U-SQL to sort, group, and combine data.
Lesson Objectives
After completing this lesson, you will be able to:
Sorting data
Sorting is a fundamental data processing
operation you use to present and handle data in a
specific sequence. U-SQL provides the ORDER BY
clause to perform this task.
@result =
SELECT EmployeeID, ForeName, LastName, Department
FROM @personnelData;
MCT USE ONLY. STUDENT USE PROHIBITED
5-24 Processing big data using Azure Data Lake Analytics
OUTPUT @result
TO "/Output/PersonnelFile.csv"
ORDER BY Department DESC
USING Outputters.Csv();
102,"xzcvxvn","bemnyu",3
103,"iuywe","bvche",3
101,"slkjlkjlkj","nbmnbmn",2
99,"jhgjhgjh","iuoioui",1
100,"oiuoiu","poipoi",1
104,"quytue","lkjo",1
Apart from not guaranteeing to preserve the sequence of the data, using ORDER BY in a SELECT
statement has other implications. For example, if you have a large set of data, ordering the data while
processing it can become a very time consuming operation. To limit the effects of such a potentially costly
task, you can only use ORDER BY in a SELECT statement over a subset of data defined by using
FETCH/OFFSET clauses.
The FETCH clause specifies a number of rows to retrieve, and the optional OFFSET clause indicates from
where to start fetching the data (the default OFFSET is 0). The following code sorts the data in the
@personnelData rowset from the previous example, but only fetches three rows, starting at the third row
(OFFSET 2 means discard the first two rows):
Note that the FETCH clause curtails the SELECT operation completely, and no further data is retrieved
from the rowset or passed on to subsequent steps in the processing. For example, if the @result rowset is
output, it will look like this:
102,"xzcvxvn","bemnyu",3
101,"slkjlkjlkj","nbmnbmn",2
104,"quytue","lkjo",1
For further information on using SORT BY, see:
Grouping data
U-SQL supports the GROUP BY and HAVING
clauses of SQL to calculate aggregate values across
a set of rows, and limit the output based on the
results of an aggregated calculation.
Note: You create your own custom aggregate functions with U-SQL—this is described in
Module 6.
Remember that, when you use an aggregate function over one or more columns in a rowset, you must
use GROUP BY over any remaining scalar columns. The following example uses the previously used
personnel dataset to calculate the number of employees in each department. Note that you must provide
an alias, using the AS clause, for expressions calculated by using aggregate functions:
@result =
SELECT Department, COUNT(EmployeeID) AS NumEmployees
FROM @personnelData
GROUP BY Department;
OUTPUT @result
TO "/Output/DepartmentFile.csv"
USING Outputters.Csv();
MCT USE ONLY. STUDENT USE PROHIBITED
5-26 Processing big data using Azure Data Lake Analytics
If you need to limit which groups are returned, you use the HAVING clause to provide a filter. HAVING
acts like a WHERE clause, but it only applies to aggregates, as shown in the next example:
By default, the aggregate functions include duplicate values in their calculations. You exclude duplicates
by including the DISTINCT keyword, like this:
Windowing data
Windowing in U-SQL gives you a way to partition
data for processing. A window is a set of data, a
little like that defined by a GROUP BY clause—you
use the same aggregate functions over a window
as you do over a group. However, unlike GROUP
BY, which reduces data into groups for calculating
aggregate values, the aggregated results in a
window are available to every row in the window;
the window defines the scope of the aggregation.
Using an ordinary GROUP BY statement to calculate average prices per stock item
@stockPriceData =
EXTRACT Ticker string,
Price int,
Time int
FROM "/Input/StockPrices.csv"
USING Extractors.Csv(skipFirstNRows: 1);
@avgPrices =
SELECT Ticker, AVG(Price) AS AvgPrice
FROM @stockPriceData
GROUP BY Ticker;
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-27
"MQTZ",21
"NARN",19
"NBG5",4
"NEQ4",5
"NMIN",56
The first column contains the ticker, and the second is the average price.
Now suppose you need to display the ticker, current price, and average price. To do this, you generate an
intermediate rowset with the average prices, perform another query to find all the current prices, and then
join the two rowsets over the ticker. However, using a window, you perform all of this processing in a
single query.
To define a window, you use the OVER operator when using an aggregate function. This operator
specifies how you should partition the data for the aggregation. The function is performed once for each
partition, but the results are made available to every row in the partition. The result is that you combine
aggregated values and nonaggregated columns together in the same query. This example uses the same
data as before, but defines a window over the ticker column, so data is grouped by ticker and the average
price is calculated for each ticker. However, for this time, the average price is displayed as part of the data
for each row:
"MQTZ",7,21
"MQTZ",0,21
"MQTZ",26,21
"MQTZ",16,21
"MQTZ",42,21
"MQTZ",27,21
"NARN",25,19
"NARN",17,19
"NARN",18,19
"NARN",21,19
"NARN",19,19
"NARN",0,19
MCT USE ONLY. STUDENT USE PROHIBITED
5-28 Processing big data using Azure Data Lake Analytics
"NARN",17,19
"NMIN",21,56
"NMIN",69,56
"NMIN",25,56
"NMIN",0,56
"NMIN",49,56
"NMIN",66,56
"NMIN",29,56
"NMIN",20,56
"NMIN",85,56
You modify the PARTITION BY clause of the OVER operator to change the extent of the window. For
example, you might reduce the size of the window around each row as it is processed by using the ROWS
clause. The next example partitions the data by ticker, but calculates a rolling average of the price over
the current row and the preceding 10 rows in the partition. Note that, if you window data like this, you
must also order the data in some way:
There are other options available; for instance, you might specify following rather than preceding rows.
For more information, see:
U-SQL supplies ranking functions, RANK, ROW_NUMBER, NTILE, and DENSE_RANK, that you use to
determine the ranking value of each row in a partition. For example, the RANK function returns the order
in which an item in the partition is ranked according to the order of the data in the partition.
The following code sorts stock records by price, and displays the rank for each row. Rows with the same
price will have the same ranking value.
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-29
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
"NARN",0,1,1
"NARN",1,19,2
"NARN",1,19,2
"NARN",1,19,2
"NARN",1,19,2
"NARN",2,23,3
"NARN",2,23,3
"NARN",2,23,3
"NARN",2,23,3
"NARN",3,27,4
"NARN",4,28,5
"NARN",4,28,5
"NARN",5,30,6
"NARN",5,30,6
"NARN",5,30,6
...
Notice that rows with the same stock price receive the same ranking value. Subsequent ranks take into
account the number of items already ranked; if there are 18 items with rank 1, the 19th item has rank 19. If
you want to rank data without any gaps in the ranking sequence, use the DENSE_RANK functions.
LEAD and LAG. Enable access to a row that either follows (LEAD) or precedes (LAG) the current row.
You specify the number of rows ahead (or behind) to look from the current row. These functions are
useful if you need to examine the data in the next or previous rows while examining the current row.
The following example uses the LAG function to retrieve the data in the Price column from the previous
row in the window together with the current row. If there is no previous row, it returns the default value -
1. Note that, in this case, you do not need to explicitly limit the data referenced by the ORDER BY clause
(usually, you would need to provide a FETCH/OFFSET clause when using ORDER BY in a SELECT statement,
as described earlier in this lesson).
Outer joins. In an outer join, if a row has a key with no corresponding row in the other rowset, it is
combined with a dummy row containing null values and output.
Left (in which the rowset that is mentioned first is joined with null rows, but not the other way
around).
Right (in which the rowset that is mentioned is joined with null rows, but not the other way around).
Full (in which both rowsets are joined with null rows).
Cross join. This type of join generates the Cartesian product of both rowsets. Each row in the first
rowset is combined with every row in the other. If the first rowset contains N rows, and the second
contains M rows, the resulting rowset contains N * M rows.
This example shows an inner join between product and sales data for a retail outlet. The product rowset
contains information such as the product name, description, and a unique identifier (productID; the key).
The sales rowset contains information about each purchase made for the product. The ON clause joins the
two rowsets over the productID column.
@salesData =
SELECT * FROM
( VALUES
(100, 1, 20),
(101, 4, 2),
(102, 4, 3),
(103, 5, 10)
) AS Purchases(PurchaseID, ProductID, NumSold);
@result =
SELECT P.ProductID, P.ProductName, P.UnitPrice, S.NumSold
FROM @productData AS P
INNER JOIN @salesData AS S
MCT USE ONLY. STUDENT USE PROHIBITED
5-32 Processing big data using Azure Data Lake Analytics
ON P.ProductID == S.ProductID;
OUTPUT @result
TO "/Output/ProductSalesFile.csv"
USING Outputters.Csv();
1,"sprocket",10,20
4,"dibber",31,3
4,"dibber",31,2
5,"grommet",35,10
This next example uses a left outer join to find all products that are yet to be purchased (there are no sales
records for them). This type of query could be used to identify products that might be worth curtailing. In
this case, the two rowsets are joined over the ProductID column as before but, for products that have no
sales, the value in the ProductID column in the @salesData rowset will be null.
Using a left outer join to find products that have never been purchased
@result =
SELECT P.ProductID, P.ProductName, P.UnitPrice, S.NumSold
FROM @productData AS P
LEFT JOIN @salesData AS S
ON P.ProductID == S.ProductID
WHERE S.ProductID == null;
This time, the results look like this. Note that there is no NumSold data:
2,"flange",15,
3,"widget",26,
6,"baldrick",18,
Antisemijoin. This is similar to the semijoin except that it finds all rows in the rowset that don’t have
a match in the results of a subquery. Again, there are two variants—left and right.
This example shows a left semijoin to find all products that have been purchased at least once:
Using a left semijoin to find all products that have been purchased at least once.
@result =
SELECT P.ProductID, P.ProductName, P.UnitPrice
FROM @productData AS P
LEFT SEMIJOIN (SELECT ProductID
FROM @salesData) AS S
ON P.ProductID == S.ProductID;
Note that, in contrast with the first example, each product appears only once regardless of how many
sales it is responsible for:
1,"sprocket",10
4,"dibber",31
5,"grommet",35
For more information about joins, see:
U_SQL also supports PIVOT and UNPIVOT operations. The PIVOT operation transforms a set of values in a
specified column in a rowset into a series of columns, and uses a specified aggregation to calculate the
values for the data for these columns. UNPIVOT performs the opposite task, taking the data for a set of
columns and converting them into rows. For detailed information, see:
2. Using U-SQL, create a data source that references the remote SQL Server database using this
credential. The following code illustrates the syntax:
CREATE DATA SOURCE IF NOT EXISTS <Name of new data source>
FROM AZURESQLDB
WITH (
PROVIDER_STRING = "Database=<name of SQL Server database on the
remote server>;Trusted_Connection=False;Encrypt=True;",
CREDENTIAL = <Name of the credential created by using PowerShell>,
REMOTABLE_TYPES = (bool, byte, sbyte, short, int, long, decimal,
float, double, string, DateTime)
);
3. Retrieve data from the remote database by using the FROM EXTERNAL clause:
@results =
SELECT <column 1>, <column 2>, …
FROM EXTERNAL <Name of data source> LOCATION "<Table in remote
database>";
https://aka.ms/Sqx2d7
https://aka.ms/Iy8870
Use a federated query to join data from SQL Database with data retrieved from the ADLA catalog.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-35
INNER JOIN
PIVOT
OVER
LEFT JOIN
HAVING
MCT USE ONLY. STUDENT USE PROHIBITED
5-36 Processing big data using Azure Data Lake Analytics
For this phase of the project, you are going to use Data Lake Analytics to calculate the average speeds
detected by speed cameras, use data joins with a Data Lake Analytics job to generate speeding notices
linked to vehicle owner information, scale up the system to use Data Lake Analytics speed camera data
stored in the cloud in Azure SQL Database, and use Power BI to present the average speeds using map
visualizations.
Objectives
After completing this lab, you will be able to:
Use Data Lake Analytics to categorize data and present results using Power BI.
Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.
Username: ADATUM\AdatumAdmin
Password: Pa55w.rd
Task 1: Prepare data files for a local Data Lake Analytics instance
Task 5: Prepare data files for a cloud Data Lake Analytics instance
Results: At the end of this exercise, you will have prepared data files for a local Data Lake Analytics
instance, created a new Data Lake project, and run this Data Lake Analytics job locally. Also, you will have
created a Data Lake Analytics account, prepared data files for a cloud Data Lake Analytics instance, run
this Data Lake Analytics job in the cloud—then edited the job to add rolling averages, and run the
updated job.
Task 1: Prepare data files for a local Data Lake Analytics instance
Task 6: Run the updated Data Lake Analytics job in the cloud
Results: At the end of this exercise, you will have prepared data files for a local Data Lake Analytics
instance, created a new Data Lake project and tested this job locally, then modified the job to use joins,
and tested the job locally before running it in the cloud.
4. Configure a Visual Studio-based Data Lake Analytics job to use stored credentials
5. Run a Data Lake Analytics job using stored credentials to access data in SQL Database
Task 3: Store a SQL Database credential in the Data Lake Analytics catalog using
PowerShell
Task 4: Configure a Visual Studio-based Data Lake Analytics job to use stored
credentials
Task 5: Run a Data Lake Analytics job using stored credentials to access data in SQL
Database
Results: At the end of this exercise, you will have created a SQL Database, uploaded data to SQL Database
and added an index. You will also have stored a SQL Database credential in the Data Lake Analytics
catalog using PowerSheII, configured a Data Lake Analytics job to use stored credentials, and run this job
using stored credentials to access data in SQL Database.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 5-39
Exercise 4: Use Data Lake Analytics to categorize data and present results
using Power BI
Scenario
You need to be able to present average speed data on a digital map, so that it’s easy to see traffic
patterns across a city area. In this exercise, you will generate analytics for speed cameras, showing the
number of cars that passed each camera in different speed “buckets” (< 10 mph, 10-29 mph, 30-49 mph,
50-69 mph, 70-99 mph, and 100+ mph); this analysis should incorporate current and historical data from
the speed cameras. You will then visualize the data using the ARCGis map control in Power BI.
2. Test the Data Lake Analytics job locally then run it in the cloud
4. Lab closedown
Task 2: Test the Data Lake Analytics job locally then run it in the cloud
Results: At the end of this exercise, you will have created a new Data Lake project, tested the job locally,
run it in the cloud, and then used Power BI to visualize the data.
MCT USE ONLY. STUDENT USE PROHIBITED
5-40 Processing big data using Azure Data Lake Analytics
The purpose of Azure Data Lake Analytics, and how you create and run jobs.
Module 6
Implementing Custom Operations and Monitoring
Performance in Azure Data Lake Analytics
Contents:
Module Overview 6-1
Module Overview
The built-in functionality available with U-SQL with Microsoft® Azure® Data Lake Analytics (ADLA) acts
as a powerful platform for performing common analytical operations. Additionally, the ability to use C#
functions and operators inline within U-SQL expressions adds further flexibility. However, there might be
times when you need functionality that cannot be easily implemented by using U-SQL or simple inline C#
expressions. Perhaps your data is in a format that is different from that used by the built-in extractors—or
maybe you need to implement a custom analytics function. ADLA has extension points that you use to
incorporate your own features into jobs.
Another key aspect of ADLA concerns managing and optimizing jobs. An ADLA job can consume
considerable resources, so you might need to control which users actually use an ADLA account. You
should also ensure that your jobs run in as optimal a manner as possible, both in terms of time taken and
resources utilized. Therefore, you need to understand how to monitor jobs and the options available for
tuning them.
Objectives
In this module, you will learn how to:
Lesson 1
Incorporating custom functionality into analytics jobs
You should consider the functionality built into ADLA through U-SQL as a starting point for performing
analytics operations rather than a definitive set of tools. You think of ADLA as a framework rather than a
solution (a little like Hadoop or Spark). You then use this framework to implement the analytical features
and processes required by your organization.
Lesson Objectives
After completing this lesson, you will be able to:
Describe how to create custom extensions that incorporate code written using the .NET Framework.
Create user-defined operators that perform various tasks during the processing life cycle of a job.
ADLA incorporates the CLR, enabling you to run .NET Framework code as part of a U-SQL job. However,
you must compile your code into an assembly, and arrange for the assembly to be deployed as part of the
job.
1. By using a code-behind file that contains your .NET Framework code (typically written using C#). The
Azure Data Lake Tools for Visual Studio support this development technique—they automatically
create an assembly from your code, add it to the job, and then arrange for the assembly to be
removed from memory when the job has completed. The advantage of this approach is that it is
quick and easy. The primary disadvantage is that it limits the reusability of your .NET Framework
code; if you want to include the same routine in several U-SQL jobs, you must copy the source code
for that routine into each job. This can make maintenance difficult.
2. By using Visual Studio® to write your code and compiling it into a separate assembly. You manually
arrange for the assembly to be uploaded to your ADLA account, and add U-SQL meta-statements
that reference the assembly, so that the job knows where to find it. In this way, you reuse the same
assembly across multiple jobs.
Assemblies are stored in the U-SQL catalog held in the underlying Data Lake Store account. You secure
them as you would any other item stored in the U-SQL catalog.
For more information, see Module 5: Processing Big Data using Azure Data Lake Analytics.
For more information about building and deploying assemblies for U-SQL jobs, see:
Using assemblies
https://aka.ms/Rry3zd
Although these extractors cover many basic cases, you will likely find that your data is frequently held in a
different format, such as JSON or XML. In this case, you will need to create a custom extractor. To do this,
you implement your own class that inherits from the IExtractor abstract class. The IExtractor class defines a
single abstract method named Extract that you should override:
public override IEnumerable<IRow> Extract(IUnstructuredReader input,
IUpdatableRow output)
The purpose of this method is to retrieve the data from the input source then parse it, and pass it back
one row at a time. The value returned should be the data for the next available row. This method is called
by an enumerator in the U-SQL runtime, which is responsible for requesting each row and passing the
rows off for processing.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-5
A reference to an IUnstructuredReader object that is used to read the raw input data, in whatever
format it is supplied (JSON, XML, and so on). This object supplies a property named BaseStream that
represents the input data as a stream—you use a StreamReader object to read the data from this
stream. Your code should then parse this data and use it to create a series of rows that will be passed
to U-SQL.
An IUpdatableRow object that represents a row of data that has been read in using the
IUnstructuredReader object. The value returned is an enumerable collection of these objects.
You are free to implement the Extract method in your own way, according to the format of the data files
from which you are extracting data. For more detailed information and examples on creating custom
extractors, see:
User-defined Extractors
https://aka.ms/Qgy6cb
Optimizing an extractor
If the format of the input file contains multiple distinct input records, with each record on a single line, the
U-SQL runtime can split the file into pieces and read in each piece in parallel using multiple vertices.
However, if your input file consists of a single item (such as a JSON array or XML document), then the
data can only be extracted by a single vertex. You specify whether the file format supports splitting by
applying the [SqlUserDefinedExtractor(AtomicFileProcessing = false)] attribute to your extractor class
(setting this attribute to true, the default informs the USQL runtime that it should treat the file as a single,
indivisible item).
An IRow object that contains the data for a row to be output. You retrieve the individual columns
from the IRow object by using the Get<> generic method, and specifying the name of the column.
You obtain the details of each column (such as its name and type) by reading the Schema property of
the IRow object. This property contains a collection of IColumn objects, and each IColumn object
describes a single column.
An IUnstructuredWriter object that represents the destination. The BaseStream property of this
object contains an output stream that you use to write data to the destination.
The IOutputter class also provides a Close method that you can override to close the destination and
release resources, if necessary.
MCT USE ONLY. STUDENT USE PROHIBITED
6-6 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics
The following code shows an example of a simple outputter that saves data as an XML document. The
outputter has been simplified; a production version would use techniques such as reflection to support
any data type rather than the limited set available here:
// The constructor specifies the XML tags to use to wrap the data (defaults to
<Data><Row>...</Row><Row>...</Row>...</Data>)
public SimpleXMLOutputter(string xmlDocType = "Data", string xmlRowType = "Row")
{
this.xmlDocType = xmlDocType;
this.xmlRowType = xmlRowType;
this.outputWriter = null;
}
// Iterate through the columns in the row and convert the data to XML encoded
strings
StringBuilder rowData = new StringBuilder($@"<{xmlRowType}>");
foreach (var column in columnSchema)
{
rowData.Append($@"<{column.Name}>");
// This outputter currently only recognizes int, double, string, and DateTime
data.
if (column.Type == typeof(int))
rowData.Append($@"{input.Get<int>(column.Name)}");
if (column.Type == typeof(double))
rowData.Append($@"{input.Get<double>(column.Name)}");
if (column.Type == typeof(string))
rowData.Append($@"{input.Get<string>(column.Name)}");
if (column.Type == typeof(DateTime))
rowData.Append($@"{input.Get<DateTime>(column.Name)}");
rowData.Append($@"</{column.Name}>");
}
rowData.Append($@"</{xmlRowType}>");
rowData.Append(Environment.NewLine);
// Write the document end tag, flush any remaining buffered output, and then close
the destination
public override void Close()
{
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-7
@testData =
SELECT * FROM
( VALUES
(99, "jhgjhgjh", "iuoioui", 100000, 1),
(100, "oiuoiu", "poipoi", 115000, 2),
(101, "slkjlkjlkj", "nbmnbmn", 250000, 3),
(102, "xzcvxvn", "bemnyu", 33800, 1),
(103, "qutii", "uyfiu", 0, 2),
(104, "sakak", "lkpoi", 190000, 3),
(105, "kjsakjk", "cvnbmnx", -500, 1),
(106, "wqytyr", "psagdj", 101000, 2)
) AS Employees(EmployeeID, ForeName, LastName, Salary, Dept);
<Employees>
<Employee><EmployeeID>99</EmployeeID><ForeName>jhgjhgjh</ForeName><LastName>
iuoioui</LastName><Salary>100000</Salary><Dept>1</Dept></Employee>
<Employee><EmployeeID>100</EmployeeID><ForeName>oiuoiu</ForeName><LastName>p
oipoi</LastName><Salary>115000</Salary><Dept>2</Dept></Employee>
<Employee><EmployeeID>101</EmployeeID><ForeName>slkjlkjlkj</ForeName><LastNa
me>nbmnbmn</LastName><Salary>250000</Salary><Dept>3</Dept></Employee>
<Employee><EmployeeID>102</EmployeeID><ForeName>xzcvxvn</ForeName><LastName>
bemnyu</LastName><Salary>33800</Salary><Dept>1</Dept></Employee>
<Employee><EmployeeID>103</EmployeeID><ForeName>qutii</ForeName><LastName>uy
fiu</LastName><Salary>0</Salary><Dept>2</Dept></Employee>
<Employee><EmployeeID>104</EmployeeID><ForeName>sakak</ForeName><LastName>lk
poi</LastName><Salary>190000</Salary><Dept>3</Dept></Employee>
<Employee><EmployeeID>105</EmployeeID><ForeName>kjsakjk</ForeName><LastName>
cvnbmnx</LastName><Salary>-500</Salary><Dept>1</Dept></Employee>
<Employee><EmployeeID>106</EmployeeID><ForeName>wqytyr</ForeName><LastName>p
sagdj</LastName><Salary>101000</Salary><Dept>2</Dept></Employee>
</Employees>
MCT USE ONLY. STUDENT USE PROHIBITED
6-8 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics
Optimizing an outputter
If the output file contains multiple distinct input records, you can arrange for the U-SQL runtime to
output groups of records in parallel using separate vertices. You specify whether the file format supports
splitting by applying the [SqlUserDefinedOutputter(AtomicFileProcessing = false)] attribute to your
outputter class.
For more information, see:
In this example, the function takes three parameters and returns a Boolean value; it’s intended to be used
in the WHERE clause of a U-SQL statement, as shown here. The function runs once for each row
processed. Notice that you prefix the function call with the namespace and class in which the function is
defined:
Note: User-defined functions enable you to maintain state between function calls. The class
in which the function is defined creates its own state cache that you use if you need to compare
the data in different rows. The SuspiciousMovement function illustrated in Demo 1 uses this
technique to store the price and quote time of a stock item in a List collection for comparison the
next time another row for the same stock item appears.
Init. This method runs once, at the very beginning of the aggregation process. You use this method
to initialize any variables required to calculate the aggregate value.
Accumulate. This method is executed for each row being aggregated. The parameters contain the
data for the row, passed in by the U-SQL runtime.
Terminate. This method runs once, at the end of the aggregation process. The value returned by this
method is passed to the U-SQL runtime as the result of the aggregation.
MCT USE ONLY. STUDENT USE PROHIBITED
6-10 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics
The following example shows a version of the AVG (average) aggregate function (called PositiveAVG) that
discards zero and negative values from its calculations. The value returned is the average of all the
positive data values only:
To run a user-defined aggregator in a U-SQL script, you call the AGG generic function, specifying the
aggregator as the function type parameter, together with the data parameters expected by the
aggregator. Note that, as with any aggregation, you should use a GROUP BY clause to specify the
groupings for the aggregator.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-11
The following example shows how to call the PositiveAVG function from U-SQL:
@testData =
SELECT * FROM
( VALUES
(99, "jhgjhgjh", "iuoioui", 100000, 1),
(100, "oiuoiu", "poipoi", 115000, 2),
(101, "slkjlkjlkj", "nbmnbmn", 250000, 3),
(102, "xzcvxvn", "bemnyu", 33800, 1),
(103, "qutii", "uyfiu", 0, 2),
(104, "sakak", "lkpoi", 190000, 3),
(105, "kjsakjk", "cvnbmnx", -500, 1),
(106, "wqytyr", "psagdj", 101000, 2)
) AS Employees(EmployeeID, ForeName, LastName, Salary, Dept);
OUTPUT @averageSalaryByDepartment
TO "/AverageSalaryByDepartment.csv"
USING Outputters.Csv(quoting: false, outputHeader: true);
AverageSalary
131633.33333333334
AverageSalary Dept
66900 1
108000 2
220000 3
You should note that not all aggregations are associative, and applying the
SqlUserDefinedReducer(IsRecursive = true) attribute to an inherently nonassociative operation will likely
lead to incorrect results.
Note: An associative operator returns the same value regardless of the order in which its
operands are evaluated. For example, the expression a + b + c can be evaluated as (a + b) + c (a
is added to b, and the result is then added to c), or a + (b + c) (b is added to c, and the result is
then added to a). This associativity is used by the built-in SUM aggregate function in U-SQL, and
enables the optimizer to split the data into subsets. The sum of each subset can be computed in
parallel, and the results then added together in a recursive operation. In pseudocode, you might
think of the situation like this:
SUM (a, b, c, d, e) = SUM(SUM(a, b), SUM(d, e))
You invoke R code from U-SQL by creating an instance of the Extension.R.Reducer class using the REDUCE
statement. The reducer passes the data held in a U-SQL rowset to the R script as a data frame named
inputFromUSQL. You manipulate the data in this data frame in the same way that you would any regular
R data frame. You pass the results of the processing back to U-SQL by storing them in another data frame
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-13
and returning this data frame, or by writing the results to a data frame named outputToUSQL (this is the
default data frame returned by the reducer if you don’t specify otherwise). The reducer converts the data
frame back into a U-SQL rowset.
Note: Lesson 2 describes reducers and the REDUCE statement in more detail.
There are some limitations in the conversion process between a U-SQL rowset and an R data frame; the
rowset or data frame can only contain columns with the double, string, bool, integer, or byte types,
although you can pass byte arrays by serializing them as strings. You should also be aware that the R
Factor type is not available in U-SQL, but you can set the Boolean stringsAsFactors parameter to true
when you call the R reducer—this will convert string values in the U-SQL rowset into factors in the R data
frame. In your R code, you can transform the data into any valid R data type for processing, providing you
convert the data into a type recognized by U-SQL at the end of the routine.
You can include your R code inline in a U-SQL script. The following example shows a U-SQL script that
calls an R routine to create a basic statistical summary of stock price movement data. The R code
generates a data frame listing the ticker, opening price, closing price, lowest price, and highest price for
each stock:
@stockPriceMovements =
EXTRACT Ticker string,
Price int,
QuoteTime string
FROM @stockData
USING Extractors.Csv(skipFirstNRows : 1);
OUTPUT @RScriptOutput
TO @resultsData
USING Outputters.Csv(outputHeader : true, quoting : false);
MCT USE ONLY. STUDENT USE PROHIBITED
6-14 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics
You can also deploy your R code as a separate script—you must upload it to the ADLS store associated
with your ADLA account, and use the DEPLOY RESOURCE U-SQL command to enable the U-SQL runtime
to locate the script and include it with the job when it is executed. The reducer references the R code
using the scriptFile parameter. The following example shows how to do this:
@stockPriceMovements =
EXTRACT Ticker string,
Price int,
QuoteTime string
FROM @stockData
USING Extractors.Csv(skipFirstNRows : 1);
OUTPUT @RScriptOutput
TO @resultsData
USING Outputters.Csv(outputHeader : true, quoting : false);
U-SQL currently supports R 3.2.2, and the R runtime includes the base R package and standard R modules
together with the Microsoft ScaleR package. You install and reference other packages (providing they will
run on R 3.2.2). To do this, you upload the zip file containing the package to ADLS, add a DEPLOY
RESOURCE statement that references this zip file to the U-SQL script, and then use the R install.packages
command in your R code to load the package. Note that, for security reasons, your R code cannot
download packages from other locations (such as the CRAN repository), and you must set the package
repository (repos) parameter of the install.packages function to NULL. For further details and examples
showing how to use R with U-SQL, see:
The following code shows an example, calling a Python function that generates the same statistical
summary as the R example in the previous topic:
@stockPriceMovements =
EXTRACT Ticker string,
Price int,
QuoteTime string
FROM @stockData
USING Extractors.Csv(skipFirstNRows : 1);
df.sort_values(['QuoteTime'])
prices = df.groupby(['Ticker'])
openingPrices = prices['Price'].first()
lowestPrices = prices['Price'].min()
highestPrices = prices['Price'].max()
closingPrices = prices['Price'].last()
result = pd.concat([openingPrices, lowestPrices, highestPrices, closingPrices], axis =
1).reset_index()
result.columns = ['Ticker', 'OpeningPrice', 'LowestPrice', 'HighestPrice',
'ClosingPrice']
return result
";
OUTPUT @RScriptOutput
TO @resultsData
USING Outputters.Csv(outputHeader : true, quoting : false);
U-SQL currently supports Python version 3.5.1. The Python extensions include all the standard Python
modules plus Pandas, Numpy, and Numexpr.
Cognitive Services
https://aka.ms/Q1j77v
Imaging—facial detection: Detects one or more human faces in an image. Rectangles show where
the faces are in the image, along with face attributes that contain machine learning-based predictions
of facial features.
Imaging—emotion detection: Takes a facial expression in an image as an input, and returns the
confidence across a set of emotions for each face in the image.
Imaging—object detection and tagging: Returns information about the different items detected in
an image and attempts to label them.
Text—key phrase extraction: Performs textual analysis and identifies key phrases in blocks of text.
U-SQL provides a series of user-defined extractors, appliers, and processors for calling the supported
Cognitive Services APIs:
Cognition.Text.KeyPhraseExtractor. This is a processor that identifies the key phrases from a rowset
containing string data.
Cognition.Text.Splitter. This is an applier that you use to cross-correlate the results of the
KeyPhraseExtractor with the original text to tokenize the key phrases.
These user-defined extractors, appliers, and processors are supplied in a series of assemblies that are
installed as part of the U-SQL extensions. You must add the appropriate REFERENCE ASSEMBLY
statements in your code to load the corresponding assembly into your job. You invoke a user-defined
processor by using the PROCESS command in a U-SQL script. The results of the processing are returned in
the output rowset.
Note: Lesson 2 covers user-defined processors and the PROCESS command in more detail.
For more information on using Cognitive Services with U-SQL, and links to sample code, see:
Lesson 2
Optimizing jobs
ADLA is designed to process massive volumes of data as quickly as possible. However, to achieve these
aims, you need to structure your data and implement your processing to take full advantage of the
scalability available to ADLA jobs. This lesson describes how to design jobs that meet these requirements,
together with some issues of which you should be aware when customizing jobs.
Lesson Objectives
After completing this lesson, you will be able to:
Explain how best to partition data held in the ADLA catalog to improve scalability.
Monitor how ADLA jobs are parallelized by examining the vertices created to run these jobs.
You classify the input data for U-SQL jobs into two primary types:
Unstructured files, such as CSV, Text, JSON, XML, and other forms of data input where U-SQL has no
prior knowledge of the data schema. U-SQL requires additional information to partition this data
effectively.
Structured data held in tables in the ADLA catalog. In these cases, the schema is defined by the
definitions of the tables, and the catalog can also contain statistical information about the distribution
and range of values in each column in a table. U-SQL utilizes this information to help optimize jobs.
U-SQL cannot easily partition unstructured data that is held in a single file. Even if a file is splittable, when
the data has been read in, the processing performed by the job might necessitate copying rows between
vertices. This is potentially time-consuming and expensive.
For example, the following U-SQL job reads sales data from a CSV file, and then performs an aggregation
that calculates the total sales value of all items sold in New York:
@nyData =
SELECT state, SUM(numSold * pricePerItem) AS value
FROM @salesData
WHERE state == "NY"
GROUP BY state;
…
Although this approach works, it is very inefficient. The issue is that every row has to be retrieved and then
grouped by state to enable the aggregation to be performed. However, only the results for New York are
actually required, and the remainder are discarded. This technique wastes time performing unnecessary
I/O, in addition to compute resources.
A better approach is to partition data manually by using file groups. Module 5 described how it’s possible
to read multiple files, and how you generate filenames for the EXTRACT statement dynamically by using
virtual columns. In the following example, the sales data is now held in separate CSV files, one for each
state:
@nyData =
SELECT state, SUM(numSold * pricePerItem) AS value
FROM @salesData
WHERE filename == "NY"
GROUP BY state;
The U-SQL runtime applies the predicate in the WHERE clause of the SELECT statement to the EXTRACT
statement, effectively eliminating all irrelevant data from the input before it is read. This job now only
retrieves and processes data for New York.
Note: Hint: If you are aiming to achieve maximum efficiency with unstructured data files,
you should implement manual partitioning, and use a file format that supports parallel
processing by being splittable.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-21
You also have the option to partition tables. A partition ensures that related data is held in the same
locality in the catalog. Careful partition design can eliminate significant overhead when performing
queries by reducing the amount of I/O required.
For example, consider the sales data scenario from the previous section. You could store the sales data in
a table in the catalog rather than as a set of CSV files. The following CREATE TABLE statement shows one
possible implementation of this table:
Now consider the following ADLA job. This job generates a report summarizing the total value of all sales
for all products sold in New York:
@nySales =
SELECT ProductID, SUM(NumSold * PricePerItem) AS TotalSalesValue
FROM @salesData
WHERE State == "NY"
GROUP BY ProductID;
OUTPUT @nySales
TO "/NYSales.csv"
USING Outputters.Csv(outputHeader: true, quoting: false);
The index over the ProductID column helps to optimize the GROUP BY clause in the query run by this job,
but the data is potentially spread across the entire database. To satisfy this query, the U-SQL runtime will
need to examine the entire table.
MCT USE ONLY. STUDENT USE PROHIBITED
6-22 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics
Partitioning the data by state can alleviate the situation. The version of the table shown here has the same
set of columns and index as the previous example, but the data is partitioned by state. This ensures that
the data for a given state is held in the same partition in the database. In this case, performing the
preceding query will only require that the U-SQL runtime retrieves data from the partition holding data
for New York; it can ignore all other partitions.
How best to spread the effort required across resources to minimize the response time. To
continue the previous example, it might be possible to retrieve and process rows independently,
generate a set of intermediate results, and then combine these results together to generate the final
output (much like the map/reduce strategy of systems such as Hadoop). The optimizer will determine
whether it’s possible to parallelize the tasks for each stage of the processing, and allocate each task to
a vertex for execution.
Note: This is a simplified description of the work performed by the U-SQL runtime. For
complex jobs, the runtime might have to perform several iterations of reading, processing, and
combining data as it processes and refines results.
A vertex represents the resources used to perform a given task. Each vertex runs using an Analytics Unit
(AU), and each AU provides the processing power of two CPU cores and 6 GB of RAM. Additionally, to
avoid runaway costs, a vertex is allotted a maximum of five hours of runtime before it is forcibly
terminated.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-23
Note: The time and memory limits apply only for each individual vertex. A job that
comprises many stages, each running its own set of vertices, runs for much longer than five hours
and consume more than 6 GB of memory in total.
If you find that a vertex terminates before completion, either by timing out or requiring too much
memory, you should look at the work it’s being asked to do. Typically, this is a result of skewed data;
perhaps there are a vast number of records in one partition compared to others, for example—in which
case, you might need to adjust your partitioning or distribution strategies. It might also be necessary to
rephrase your job to break up the processing into smaller, less resource-hungry steps.
Resolve data-skew problems by using Azure Data Lake Tools for Visual Studio
https://aka.ms/Umgfek
The U-SQL optimizer depends on several sources of information when deciding how best to run a job.
These sources include the structure of the data (the optimizer can generate an execution plan more easily
for a table in the ADLA catalog than it can for a data file), and estimates of the amount of data likely to be
retrieved and processed, based on the statistics for each table. These statistics are not maintained
automatically, so if the distribution of data in a table changes significantly, you should regenerate them
by using the UPDATE STATISTICS command.
Vertex Execution View. This view contains a series of bar charts illustrating the time spent creating,
queuing, and running each vertex. The runtime for each vertex is arguably the item of most
significance. You also use this view to examine the relationships between vertices. For example, one
vertex (referred to as a downstream vertex) might depend on the work performed by another
(referred to as an upstream vertex). If you notice that one vertex is taking much longer than others,
this could be the result of data skew.
Stage Scatter View. This view displays the amount of I/O performed by each vertex, and the time
spent performing this I/O. Again, this is a useful tool for spotting skewed data.
Vertex Operator View. Use this view to examine the way in which the various U-SQL operators were
called to process the data in that stage. An operator is an item such as an extractor, outputter, or
aggregator (including the user-defined implementations described in Lesson 1), and also other
features such as processors, appliers, combiners, and reducers (described later in this lesson).
MCT USE ONLY. STUDENT USE PROHIBITED
6-24 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics
Use Job Browser and Job View for Azure Data Lake Analytics jobs
https://aka.ms/Kjg61l
You use the PROCESS U-SQL command to invoke a user-defined processor. You provide a rowset
containing the data to be processed, and the definition of the expected result set, as shown here:
// Retrieve the data from the StockPriceData.csv file in Data Lake Storage
@stockData = EXTRACT
Ticker string,
Price int,
QuoteTime DateTime
FROM "/StockPriceData.csv"
USING Extractors.Csv(skipFirstNRows: 1);
Each row produced by the processor will either have an “X” in the Suspicious column, or it will be blank.
Those with an “X” are suspect:
Ticker Price QuoteTime Suspicious
...
SKGG 34 2017-08-27T15:46:05.7051528+01:00
HXA2 11 2017-08-27T15:45:05.7051528+01:00
BW5P 59 2017-08-27T15:45:05.7051528+01:00 X
YVGA 73 2017-08-27T15:49:05.7051528+01:00
SUR1 27 2017-08-27T15:48:05.7051528+01:00
GHRC 82 2017-08-27T15:48:05.7051528+01:00
DBWM 14 2017-08-27T15:48:05.7051528+01:00
1PZ2 25 2017-08-27T15:45:05.7051528+01:00 X
SWDN 7 2017-08-27T15:45:05.7051528+01:00
CU14 26 2017-08-27T15:46:05.7051528+01:00
KV5X 39 2017-08-27T15:45:05.7051528+01:00
UYP5 2 2017-08-27T15:48:05.7051528+01:00 X
ALNY 91 2017-08-27T15:45:05.7051528+01:00 X
Q3DH 92 2017-08-27T15:46:05.7051528+01:00
LJU5 11 2017-08-27T15:46:05.7051528+01:00
...
A user-defined applier extends the Microsoft.Analytics.Interfaces.IApplier abstract base. This class provides
the Apply method that you should override in your own code:
The following example shows an applier for the stock market scenario. The input rows contain the ticker,
current price, and quote time. The applier finds the opening price for the stock (code not shown) and the
percentage difference between the opening price and the current price. The output row generated
contains the ticker, opening price, and percentage change in price. You then use this data in a U-SQL job
with the CROSS APPLY operator to join each stock price change with its opening price and percentage
change. An analyst sees at a glance whether a stock is performing poorly or well.
// Retrieve the data from the StockPriceData.csv file in Data Lake Storage
@stockData = EXTRACT
Ticker string,
Price int,
QuoteTime DateTime
FROM "/StockPriceData.csv"
USING Extractors.Csv(skipFirstNRows: 1);
// Use the custom applier to add the opening price and percentage price change to the
data
@result =
SELECT S.Ticker, S.Price, S.QuoteTime, OpeningPrice, PercentChange
FROM @stockData AS S
CROSS APPLY
USING new CustomOperators.GetStockAnalytics() AS PriceData(Ticker string,
OpeningPrice int, PercentChange double);
DBZG 8 2017-08-28T12:07:05.7051528+01:00 14 -
42.857142857142854
UH1P 3 2017-08-28T11:44:05.7051528+01:00 58 -
94.827586206896555
FFAU 1 2017-08-28T11:41:05.7051528+01:00 7 -
85.71428571428570
H3XQ 35 2017-08-28T11:40:05.7051528+01:00 48 -
27.08333333333333
HXA2 4 2017-08-28T12:09:05.7051528+01:00 11 -
63.636363636363633
4U0Z 4 2017-08-28T12:16:05.7051528+01:00 51 -
92.156862745098039
BW5P 4 2017-08-28T12:51:05.7051528+01:00 59 -
93.220338983050837
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-29
The first version provides references to the rowsets on the left and right sides of the join operation. You
iterate through these rowsets to combine rows in whatever manner you require, and then return each row
one at a time using the C# yield statement. Note that enumerating either rowset is a forward-only action
that you can perform only once, so it’s common practice to cache the data from these rowsets in a list or
similar collection before processing them.
The second version takes the left and right rowsets as a list, but otherwise its purpose is the same as the
first version. It’s more common to override the first version and leave the second with its default
implementation in the ICombiner base class.
As an example, consider the following sample data containing a list of employees in an organization, the
departments in which they work, and the roles that they have fulfilled over time in that department. The
Role History column is a comma separated list of varying length, depending on the roles that the
employee has had:
Department Name
100 Sales
101 Marketing
102 Accounts
103 Personnel
104 Engineering
105 Purchasing
106 Manufacturing
107 Design
You join the Employee and Department data quite easily across the department ID column to combine
this data. However, suppose you want to generate a report that lists each employee and their roles in a
department in a tabular format? It might look like this:
…
100043 Dgssbyfz Product Support associate
…
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-31
This type of transformation is not easy to generate from the specified input formats using the built-in U-
SQL operators. However, this is the type of operation for which you can create a user-defined combiner.
The following code shows one possible solution:
// Iterate through this list of roles for the employee and output each
one in turn
foreach (string role in rolesForEmployee)
{
output.Set<int>("EmpID", leftRow.Get<int>("EmployeeID"));
output.Set<string>("EmpName", leftRow.Get<string>("EmployeeName"));
output.Set<string>("DeptName", department.name);
output.Set<string>("Role", role);
yield return output.AsReadOnly();
}
}
}
}
}
The following U-SQL script shows how to call this combiner using the COMBINE statement. Note that the
ON clause still specifies how to join data in both rowsets, but the FindDepartmentRoles combiner
performs the actual join operation.
MCT USE ONLY. STUDENT USE PROHIBITED
6-32 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics
@departmentDetails =
EXTRACT
DepartmentID int,
DepartmentName string
FROM "/Departments.tsv"
USING Extractors.Tsv(skipFirstNRows: 1);
Full. This value indicates that the combiner performs a full join (Cartesian product), and that every
row in the left rowset will be joined with every row in the right rowset. This causes the entire left and
right rowsets to be passed to a single instance of the combiner, and results in the lowest degree of
parallelism.
Inner. This value specifies that the combiner performs an inner join. Rows in the left rowset will only
be joined with the corresponding rows in the right rowset. Depending on the data sources, the U-SQL
runtime uses this information to partition the rowsets and arrange for each partition to be processed
in parallel.
Left. This value indicates that the combiner implements a left outer join, and that all rows in the left
rowset will be utilized by every instance of the combiner, even if there are no corresponding rows in
the right rowset.
Right. This value specifies that the combiner performs a right outer join, and that each instance of the
combiner should be passed for all of the data in the right rowset.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-33
However, you should only do this if the groups processed by the reducer are independent of each other.
The pattern is similar to that of combiners and appliers; the input variable contains the original rows, and
you filter (reduce) or transform these rows, write them to the output variable, and return each instance of
the output variable using the C# yield statement. There is, however, an important difference in the way in
which the data is presented by the U-SQL runtime to the reducer. To optimize the reduction process, the
U-SQL runtime will break the data down into groups, according to the values in a column or list of
columns that you specify at runtime. Each instance of the reducer receives the data for a single group.
This approach enables the U-SQL runtime to parallelize the process; each group can be handled by a
separate vertex. When you iterate through the data in the input rowset, you should therefore remember
that the rowset will contain the data for a single group only.
The following example shows a reducer that finds the number of employees for each role in each
department. It’s assumed that the data is grouped by department (you use the REDUCE statement, shown
later, to do this), so each input rowset will contain the data for a single department:
// Iterate through this list of roles for the employee aggregate the role
counts
foreach (string role in rolesForEmployee)
{
roles[role]++;
}
}
You run the reducer from a U-SQL script by using the REDUCE statement. The following code sorts
employee data by department, and then splits the data into groups based on the department values. Each
group is passed to an instance of the ReduceByRole reducer that returns the department ID and the
number of instances of each role in that department.
// Use the custom reducer to find the number of records for each role in each department
@rolesByDepartment =
REDUCE @employeeWorkHistoryData
PRESORT DepartmentID
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-35
ON DepartmentID
PRODUCE DepartmentID int,
NumberOfAssociates int,
NumberOfEmployees int,
NumberOfTeamLeaders int,
NumberOfManagers int,
NumberOfVicePresidents int,
NumberOfPresidents int
USING new CustomOperators.ReduceByRole();
User-defined processor
User-defined applier
User-defined combiner
User-defined reducer
Verify the correctness of the statement by placing a mark in the column to the right.
MCT USE ONLY. STUDENT USE PROHIBITED
6-36 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics
Statement Answer
Lesson 3
Managing jobs and protecting resources
ADLA jobs utilize resources that might be expensive, or access data that should remain confidential.
Therefore, you need to protect the data and resources available to an ADLA to prevent unauthorized use
and access. This lesson describes how to perform these tasks.
Note: Many of the concepts that apply for protecting ADLA jobs are similar to those of
ADLS. For example, you protect an ADLA account at the network level by setting up firewall rules,
and you use Role-Based Access Control (RBAC) to control how users interact with an account.
Lesson Objectives
After completing this lesson, you will be able to:
As with ADLS, you authorize the access to resources in ADLA at two levels—by using RBAC to control the
operations that users can perform, and by using Access Control Lists (ACLs) to specify which files and
catalogs they use.
You assign users to roles using the Access Control blade for the ADLA account in the Azure portal. The set
of roles, and the permissions that they enable, are the same as for other services in Azure (Owner,
Contributor, Reader, and so on). For more information about RBAC, see:
ADLA uses a subset of the ACLs available to other services for controlling access to resources in catalogs.
Specifically, you use the Data Explorer for the ADLA account in the Azure portal to assign read/write
permissions at the catalog level; execute permission is not applicable to a catalog and is not an available
option. Note that these permissions apply across an entire catalog. You cannot currently grant or deny
access to individual objects in a catalog at the user level.
The ADLA Overview blade includes the Add User Wizard that you use to step through the tasks of adding
a user to the account, assigning roles, and setting file permissions in a systematic manner.
Note: ADLA is provided with a simple software-based firewall that enables you to restrict
access to users connecting from known locations only. The firewall is disabled by default, but you
enable it by using the setting in the ADLA blade in the portal. If you have other Azure services
that access the ADLA account, you should also enable access to Azure services through the
firewall (this is a separate switch in the same blade).
Job level policies, which applies to jobs run by specific users of the account.
You create a policy using the Properties section in the ADLA blade, in the Azure portal. An account level
policy enables you to specify:
The maximum number of AUs that an account can use when submitting jobs. These AUs will be
shared by all users running jobs concurrently. For example, if you set the limit to 200, a single user
could execute a job that consumes 200 AUs, or 10 users could run jobs that consume 20 AUs each.
The maximum number of concurrent jobs that an account can execute. When this limit is reached,
jobs are queued until other jobs have finished running.
The number of days to save U-SQL job resources (such as scripts and job graphs) in the ADLS account
associated with the ADLA account.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-39
A job level policy is applied to specific users, or groups in an ADLA account. It determines:
Auditing jobs
The data used by an ADLA job might be sensitive.
Additionally, the resources consumed by a job
can be expensive if the job is long-running.
Therefore, it’s vital that you maintain an audit
trail of jobs, so that you can trace back to the
sources of any security and resource issues.
Monitoring jobs
The ADLA monitoring tools provide a low-level
insight into the details of jobs, both as they are
running and after they have completed. The
ADLA blade contains three utilities:
The Duplicate Script button opens an editor window with a copy of the U-SQL script for the job. You can
edit and save this script, and also resubmit it to run the job again.
MCT USE ONLY. STUDENT USE PROHIBITED
6-40 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics
If a job has failed, you will see the error that caused the failure. You click the error message to obtain
additional information.
Job Insights. This utility displays a graph showing the number of jobs that have been submitted, and
indicating which ones failed, succeeded, or were cancelled. Currently running jobs are also included.
You will also see whether jobs are being queued before execution. A long queue could be the result
of a series of resource-hogging jobs currently being executed, or might be an indication that the
ADLA polices being applied are too limiting.
If you switch the graph to display Compute Hours, you will see how many AU hours have been consumed
recently.
Metrics. This utility gives quick access to a set of graphs showing the numbers of jobs or compute
hours for tasks that have succeeded, failed, or been cancelled.
For more information about monitoring jobs using the Azure portal, see:
https://aka.ms/Hvh5lq
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
For this phase of the project, you are going to use ADLA to perform a range of analyses on data that has
been captured and saved into the traffic surveillance system by using tools such as Azure Stream
Analytics.
Objectives
After completing this lab, you will be able to:
Use a custom extractor, read JSON file data, and use a custom outputter to XML.
Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.
Username: ADATUM\AdatumAdmin
Password: Pa55w.rd
Exercise 1: Use a custom extractor, read JSON file data, and use a custom
outputter to XML
Scenario
The Azure Stream Analytics job that you worked with in previous labs saves speed camera data to ADLS in
JSON format. You want to be able to read and analyze this data using ADLA. You also want to be able to
write a summary of the data to XML files for use by other tools.
In this exercise, you will create and use a custom extractor, to read JSON file data into ADLA, and a
custom formatter to output data to XML.
MCT USE ONLY. STUDENT USE PROHIBITED
6-42 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics
Results: At the end of this exercise, you will have deployed and tested a custom extractor, and deployed
and tested a custom outputter.
In this exercise, you will first create a table in the ADLA catalog, then use a U-SQL job to analyze data at a
specific camera location; finally, you will redistribute the data in the catalog to optimize query
performance.
Task 2: Create a U-SQL job that analyzes data for a specific camera
Results: At the end of this exercise, you will have created a table in the ADLA catalog, analyzed data in
this table, and redistributed data in the catalog for optimal retrieval.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-43
You have stolen vehicle data covering eight years, organized in folders by year/month/day, with a total of
2,914 separate CSV files. The data contains the vehicle registration, date stolen, and date recovered (which
could be empty). The same vehicle could be reported stolen and recovered several times in these records.
Additionally, the police aren't always informed when a vehicle is recovered, so some records might have
an empty date recovered, even if the vehicle is no longer missing. This means that a vehicle could be
reported as stolen on several dates, but not recovered in the intervening period. The vehicle might be
recovered later. Therefore, to determine whether a vehicle should be considered as stolen, you need to
look at the most recent record in the history. If the vehicle has a recovery date, it is not missing, regardless
of what any previous records might infer. If no recovery date is shown on this record, the date stolen
should be used.
Because of these peculiarities in the data concerning stolen and recovery dates, it’s difficult to process this
data using regular U-SQL operators; you will, therefore, use a custom reducer that performs the necessary
magic.
In this exercise, you will first upload the dataset to ADLS. You will deploy and test the custom reducer, and
then perform some analyses to identify stolen vehicles in the speed camera data.
1. Preparation: upload stolen vehicle data to ADLS (using AzCopy and Adlcopy)
2. Examine and deploy a custom reducer
Task 1: Preparation: upload stolen vehicle data to ADLS (using AzCopy and Adlcopy)
Task 4: Analyze the speed camera data to check for stolen vehicles
Results: At the end of this exercise, you will have uploaded speed camera data to ADLS, examined the
code in the custom reducer, deployed and tested the custom reducer, and then used the reducer to
attempt to identify stolen vehicles in the speed camera data.
MCT USE ONLY. STUDENT USE PROHIBITED
6-44 Implementing Custom Operations and Monitoring Performance in Azure Data Lake Analytics
In this exercise, you will call an R script from a U-SQL job to determine if there is any correlation between
a vehicle being identified as speeding and being identified as being stolen; you will then repeat this
analysis using a Python script in a U-SQL job.
Results: At the end of this exercise, you will have used an R script script in a U-SQL job.
Question: Why do you need to deploy your own custom extractor for JSON file data?
Question: Why is it important to optimize the table structure in your ADLA catalog?
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 6-45
Module 7
Implementing Azure SQL Data Warehouse
Contents:
Module Overview 7-1
Lesson 1: Introduction to SQL Data Warehouse 7-2
Module Overview
This module describes how to utilize the power of Microsoft® Azure® SQL Data Warehouse to store and
analyze large volumes of data.
Objectives
By the end of this module, you will be able to:
Use tools and techniques for importing data into SQL Data Warehouse at scale.
MCT USE ONLY. STUDENT USE PROHIBITED
7-2 Implementing Azure SQL Data Warehouse
Lesson 1
Introduction to SQL Data Warehouse
Historically, companies had to invest heavily in infrastructure up front to store and process massive data
volumes. By using SQL Data Warehouse, companies can now store large volumes of data and process
them efficiently, in terms of cost and effort, without having to maintain an expensive infrastructure.
SQL Data Warehouse is a massively parallel processing (MPP) cloud-based platform that uses distributed
database technology. SQL Data Warehouse is one of the many pay-as-you-go services within Azure that
you can use to store and process large volumes of data, and produce rich analytics and insights very
quickly.
Lesson Objectives
By the end of this lesson, you should be able to:
Compute nodes
Storage
Control node—the control node receives SQL queries from users and optimizes them. All applications
and connections communicate via the control node because it is the user-facing part of SQL Data
Warehouse. In fact, the control node is a SQL Server® Database, so connecting to it is exactly like
connecting to a SQL Server Database. The control node receives a SQL query and converts it into multiple
SQL queries that execute on multiple compute nodes. The control node then coordinates the data
movement and computation required to execute those parallel queries over the distributed data.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-3
Compute nodes—the compute nodes are the brains behind the SQL Data Warehouse. Compute nodes
receive the required data and the query from the control node, then execute the query in parallel with
other compute nodes and produce output that is sent back to the control node. The control node
aggregates all the results from various compute nodes and produces the results that are sent back to the
user. Each compute node is a SQL Server Database.
Storage—all data within SQL Data Warehouse is stored in Microsoft Azure Blob storage. Compute nodes
read and write data directly from Blob storage. In SQL Data Warehouse, storage and compute are
independent so they can be scaled up or down separately as per the user’s needs.
Data Movement Service (DMS)—this is an internal Windows® service that is not exposed to the end
user. DMS helps to move data between nodes so that queries execute in parallel.
SQL Data Warehouse is suitable for big data scenarios that involve working with large datasets:
Long running queries that analyze a large dataset (for example, big data).
Short running queries that you use for reporting and generating aggregations.
Performing analytics over large volumes of historical data that doesn’t change.
MCT USE ONLY. STUDENT USE PROHIBITED
7-4 Implementing Azure SQL Data Warehouse
Operational workloads that tend to have high frequency of reads and writes.
Attribute Description
Database name A unique name for the SQL Data Warehouse within your database server.
Resource group Name of the resource group where the SQL Data Warehouse will reside—
this provides a way to group resources together in Azure so it’s easy to
locate them.
Source Three options that could be used as a source are currently available:
Blank database—creates a blank SQL Data Warehouse.
Sample—creates a SQL Data Warehouse based on the
AdventureWorks database.
Backup—creates a SQL Data Warehouse from an existing backup.
Server Name of the database server that contains the SQL Data Warehouse. You
could create a new server that hosts this or choose an existing database
server.
Location The location of the database server that hosts the SQL Data Warehouse.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-5
Attribute Description
Collation The collation you use when you create the SQL Data Warehouse. The
default collation is SQL_Latin1_General_CP1_CI_AS. This cannot be
changed after the database is created.
Performance level This is the compute power required for running the SQL Data Warehouse.
It’s good practice to start with 400 DWUs, and then scale up or down as
required. If needed, you can increase this to improve performance.
You could also create a SQL Data Warehouse using PowerShell™ commands. Remember the following key
information when you create SQL Data Warehouses using PowerShell:
You must always set the Edition parameter to “DataWarehouse” to create a SQL Data Warehouse.
The CollationName parameter is optional. If not specified, the SQL Data Warehouse will use the
default collation (SQL_Latin1_General_CP1_CI_AS).
The MaxSizeBytes parameter is also optional. If not specified, the database maximum size is set as 10
GB.
For example, the following PowerShell command will create a new SQL Data Warehouse, called DWDB1,
on a server called DWSERVER1, in a resource group called DWRG1, and with 400 DWUs:
Azure portal
Visual Studio
PowerShell
The fully qualified name of the server that hosts the SQL Data Warehouse (you get this information
from the Azure portal).
Server name: the fully qualified name of the SQL Data Warehouse server. This will be of the form
<server name>.database.windows.net.
Get-AzureRmSqlDatabase
Get-AzureRmSqlDeletedDatabaseBackup
Get-AzureRmSqlDatabaseRestorePoints
New-AzureRmSqlDatabase
Remove-AzureRmSqlDatabase
Restore-AzureRmSqlDatabase
Resume-AzureRmSqlDatabase
Select-AzureRmSubscription
Set-AzureRmSqlDatabase
Suspend-AzureRmSqlDatabase
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-7
Question: Would you use SQL Data Warehouse for long running queries that analyze a large
dataset?
Connect to the SQL Data Warehouse using Visual Studio and SQL Server Management Studio.
Access the SQL Data Warehouse using PowerShell.
5 GB
10 GB
15 GB
20 GB
25 GB
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Lesson 2
Designing tables for efficient queries
SQL Data Warehouse is a very large scale distributed database system. Based on how the tables are
designed, you can improve the performance of executing queries. If you choose the appropriate data
distribution strategy, you will achieve optimal query performance. To ensure good query performance, it’s
also vital to minimize data movement.
Lesson Objectives
By the end of this lesson, you should be able to:
In the sales database for a global organization, the details of customers and the purchases that they
have made are most likely to be used in the same locality (for example, US East, US West, UK,
Australia, and so on). The system is unlikely to perform queries that require combining the details of
sales made in the US East territory with customers located in Australia with any frequency. Therefore,
it would make sense to organize the data according to this locality.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-9
Queries that report on products sold by the organization might often need to fetch the data for items
that are used together. For example, customers might frequently purchase hammers and nails
together, so it could make sense to ensure that the sales records for hammers and nails are stored in
the same database.
The corporation performs daily financial analyses of sales, calculating the revenue for the previous
day. In this case, it would make sense to ensure that the financial records for a given day are all held
together.
If the corporation also performs analyses by week, month, quarter, or year (to help spot trends), the
historical summary information generated by the daily analysis—and for previous weeks, months, and
years—can be stored in lookup tables. These lookup tables are likely to be small compared to the rest
of the financial data. The information that the tables contain is static, so they could be copied to
every database in the data warehouse to reduce the costs of retrieving this data.
Round robin
Hashing
Replication
If your data doesn’t frequently join with data from other tables.
The following example shows how to create a table using the round robin distribution method:
Hashing is a very common and effective data distribution method. The data is distributed based on the
hash value of a single column that you select, according to some hashing algorithm. This distribution
column then dictates how the data is spread across the underlying distributions. Items that have the same
data value and data type for the hash key always end up in the same distribution.
The following example creates a table using the hash distribution method:
Note the following key points you need to consider to identify the best distribution column:
Choose an appropriate distribution column that reduces the data movement activity between
different distributions.
For more information about using the round robin and hash methods of distributing data, see:
Distributing tables in SQL Data Warehouse
https://aka.ms/S5fh5a
Replication is very good for small lookup tables or dimension tables that are frequently joined with other
big tables. Instead of distributing small sections of data across the underlying distributions for small
tables, the replication data strategy creates a copy of the entire table in each of the underlying
distributions.
For further details and considerations for replicating data in SQL Data Warehouse, see:
Design guidance for using replicated tables in Azure SQL Data Warehouse
https://aka.ms/Wkdg8p
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-11
You examine the system catalog views to see the distribution policy for each table in the data warehouse,
as follows:
Results
--------
Name Distribution_policy_desc
CustomerPortfolio ROUND_ROBIN
StockPriceMovement HASH
Stocks REPLICATE
Clustered indexes
Nonclustered indexes
By contrast, nonclustered indexes do not alter the way in which the data rows are stored in a table.
Nonclustered indexes are created as separate objects from the database table and have pointers back to
the data rows in the table. You create a nonclustered index by using the CREATE INDEX construct without
specifying any other keywords. You can have more than one nonclustered index in a table. Nonclustered
indexes are good for queries that filter on a column. If you create a nonclustered index on that filtered
column, you will improve query performance.
As a rule, if a table has less than 100 million rows, it’s advisable to create it as a heap table within SQL
Data Warehouse. To create a heap table, you specify the HEAP keyword in the WITH clause. The following
example shows the syntax you need to create a heap table:
If you have an existing table with indexes into which you need to load additional data, you drop the
indexes first, upload the data, and then rebuild the indexes. To do this, you use the DROP INDEX and
CREATE INDEX statements:
A clustered columnstore index physically reorganizes a table. The data is divided into a series of
rowgroups of up to 1 million rows (approximately) that are compressed to improve I/O performance; the
greater the compression ratio, the more data is retrieved in a single I/O operation. Each rowgroup is then
divided into a set of column segments, one segment for each column. The contents of each column
segment are stored together. When querying data by column, the data warehouse simply needs to read
the column segments for that column. Decompression is performed quickly in memory, and the results
returned to the query. Note that, when you create a clustered columnstore index over a table, you don’t
specify which columns to index; the entire table is indexed.
MCT USE ONLY. STUDENT USE PROHIBITED
7-14 Implementing Azure SQL Data Warehouse
You create a clustered columnstore index by using CLUSTERED COLUMNSTORE INDEX in the WITH clause
of the CREATE TABLE command:
When you create a table in SQL Data Warehouse, a clustered columnstore index is built by default if no
other index is specified. Clustered columnstore indexes are usually the preferred option when you don’t
know what kind of queries users are likely to run.
It’s important to note that a table created with a clustered columnstore index can also have other
nonclustered indexes defined on the table.
Columnstore indexes—overview
https://aka.ms/Csd3kb
Partitioning tables
You use partitioning to group the data in a table
into a series of chunks according to the value
held by a specified column. The data for each
chunk is stored together. The primary intent
behind partitioning in SQL Data Warehouse is to
improve the performance of bulk load
operations; if the data being uploaded conforms
to the partitioning scheme, SQL Data Warehouse
quickly determines where data should be stored.
Partitioning is also useful with queries that filter
data based on the partition column because data
in partitions that don’t match the filter is quickly
eliminated.
Partitioning works on top of the distribution mechanism implemented by a table—you apply partitioning
to round robin and hashed tables. Bear in mind that, before partitioning, the data is already spread across
60 distributions, and partitioning operates at the database (distribution) level. You should be mindful of
how many partitions you create and the size of each one—dividing the data into a large number of small
partitions might hinder performance rather than improve it. The recommendation is to avoid having
fewer than one million rows per partition per distribution.
Generally, you partition tables based on a date column. For example, you might partition sales data by
month. When you query data for a specific month or a few days of data within a month, the database
retrieves that specific month partition instead of performing a full table scan. Similarly, when you want to
delete a month’s sales data because it is too old, it’s easy to delete a single partition instead of deleting
row by row.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-15
Note that the partitioning scheme used by SQL Data Warehouse is simpler than that implemented by SQL
Database—you specify the partition column and the ranges for each partition. You can’t create your own
partition functions.
The following example creates a table with data distributed by using a hash function, organized as a
columnstore, and partitioned by month. This is a common approach for defining large fact tables with
many millions of rows that are time-oriented:
For more information about partitioning tables in SQL Data Warehouse, see:
Query the system catalog to examine the structure and indexes of tables.
Question: When you are not sure about which queries are run against a table in SQL Data
Warehouse, what type of index will you use?
MCT USE ONLY. STUDENT USE PROHIBITED
7-16 Implementing Azure SQL Data Warehouse
20
40
60
80
100
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Lesson 3
Importing data into SQL Data Warehouse
When you create a SQL Data Warehouse, the next logical step is to import data into it. This lesson explains
the various ways in which you can import data into SQL Data Warehouse.
Lesson Objectives
By the end of this lesson, you should be able to:
Import data into a SQL Data Warehouse from a SQL Server database using the bulk copy program
(bcp).
Import data into a SQL Data Warehouse from a SQL Server database using the AzCopy utility.
Import data into a SQL Data Warehouse from a SQL Server database using SQL Server Integration
Services (SSIS).
Import data into a SQL Data Warehouse from various sources using PolyBase.
Import data into a SQL Data Warehouse from Blob storage.
Import data into a SQL Data Warehouse from an Azure Data Lake Store.
Import data into a SQL Data Warehouse from Azure Stream Analytics.
Describe best practices for loading data into a SQL Data Warehouse.
bcp utility
https://aka.ms/Px5ae0
Use the following two-stage process to import data from SQL Server to SQL Data Warehouse using bcp:
2. Import data from the flat file to SQL Data Warehouse using bcp.
MCT USE ONLY. STUDENT USE PROHIBITED
7-18 Implementing Azure SQL Data Warehouse
The following code example illustrates the syntax for exporting data from SQL Server to flat file
(DimProduct.txt) using bcp:
To import the data from the text file to SQL Data Warehouse, you use a command similar to that shown
here:
Note that, while loading into a columnstore index table, you can execute multiple concurrent bulk loads
from separate flat files using bcp.
The source and destination for AzCopy can be a local folder, or the name of a container in Blob storage (if
you are uploading files to Blob storage, the container will be created if it doesn’t already exist). The
options available enable you to specify the storage keys to use, in addition to filename-matching patterns,
data delimiters, and other attributes. The following example uploads CSV files from a local folder named
StockData to a container called stocknames in Blob storage:
When the file is in Blob storage, you can use PolyBase to transfer the data from this file into the SQL Data
Warehouse. The process is described later in this lesson.
Use the following steps to create an import job—import data from on-premises that is to be transferred
to the Azure datacenter:
1. Identify the data you need and the hard disk drives that are required.
2. Identify the destination Blob storage location within your Azure account.
3. Use the WAImportExport tool to transfer your data to one or more hard disk drives and encrypt them
using BitLocker.
4. Create an import job using Azure portal within your destination Blob storage account.
6. Provide the carrier details and return address for Microsoft to ship the drives back to you.
7. Ship the hard disk drives to the shipping address provided during the import job creation.
8. Update the tracking information in the import job and submit the job.
Hard disk drives are received at the shipped datacenter and processed as per the instructions. When the
data transfer is successful, the hard disk drives are shipped back to the customer.
Use the following steps to create an export job—to export data from Azure that is to be transferred back
to the customer:
1. Identify the data that needs to be exported and the number of hard disk drives that are required.
2. Identify the source Blob storage location within your Azure account.
MCT USE ONLY. STUDENT USE PROHIBITED
7-20 Implementing Azure SQL Data Warehouse
3. Create an export job using Azure portal within your source Blob storage account.
5. Provide the carrier details and return address for Microsoft to ship the drives back to you.
6. Ship the empty hard disk drives to the shipping address provided during the export job creation.
7. Update the tracking information in the export job and submit the job.
Hard disk drives that are received at the shipped datacenter are processed as per the instructions. The
drives are encrypted using BitLocker and the keys are uploaded to the Azure portal. When the data
transfer is successful, the hard disk drives are shipped back to the customer.
Use the Microsoft Azure import/export service to transfer data to Blob storage
https://aka.ms/yeyh37
3. Create a scoped database credential using the following syntax to access the external data source—in
this case, Blob storage:
4. Create an external data source in SQL Data Warehouse to point to the Blob storage. Note that
PolyBase uses the Hadoop API with a wasbs URL to access Blob storage:
5. Define the external data file format within Blob storage. The following example shows how to define
a comma separated data format:
6. Create the external table that references Blob storage. The following example assumes that the data
in Blob storage has four columns—id, firstName, lastName and zipCode. In this example, FileLocation
is the name of the CSV file held in the container in the Blob storage account:
It’s important to note that PolyBase is strongly typed and any records that do not honor the schema
format will be rejected. You control the number or percentage of rows rejected by adjusting
REJECT_TYPE and REJECT_VALUE parameters before the load fails.
You now access the data held in Blob storage using the external table created in the SQL Data Warehouse.
At this point, the data is not physically copied from Blob storage to the SQL Data Warehouse but it is
accessible to the data warehouse through T-SQL SELECT commands.
Note: External tables defined using PolyBase do not support INSERT, UPDATE, or DELETE
operations.
The next logical thing to do is to use this external table to load the data into a table within the data
warehouse. There are two ways to do this in T-SQL:
CTAS is more configurable than SELECT…INTO. The SELECT…INTO command uses the default
ROUND_ROBIN distribution method with a clustered columnstore index; you cannot change the
distribution method or index. With CTAS, you specify the distribution method to use and define the
indexes to create.
MCT USE ONLY. STUDENT USE PROHIBITED
7-22 Implementing Azure SQL Data Warehouse
The following example uses CTAS to create a ROUND_ROBIN distributed table that has a clustered
columnstore index from the external table pointing to the Blob storage:
It’s important to note that, even though you use PolyBase to execute T-SQL queries over large volumes of
data (big data) in a nonrelational data store, it’s recommended to use PolyBase with CTAS to import the
data into the SQL Data Warehouse for running analytical queries.
PolyBase Guide
https://aka.ms/V7vd02
For more information about using SSIS with SQL Data Warehouse, see:
Load data from SQL Server into SQL Data Warehouse (SSIS)
https://aka.ms/Lyyhuk
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-23
The Azure SQL DW Upload Task is part of the Azure Feature Pack for SSIS. For more information about
the SSIS tasks that this pack contains, see:
Azure Feature Pack for Integration Services (SSIS)
https://aka.ms/Nck4p0
2. Create a database scoped credential for accessing the ADLS account. Note that the credential is a
combination of your client ID and an OAuth 2.0 token endpoint. You can obtain these items by
creating an Azure Active Directory application using the Azure portal:
3. Create an external data source in SQL Data Warehouse to point to the Data Lake Store. As before,
note that PolyBase uses the Hadoop API to access ADLS, but this time using a Data Lake URL:
4. Define the external data file format within the Data Lake Store. The following example specifies that
data fields are separated by using the '|' character. The DATE_FORMAT parameter indicates how
folders in the ADLS account are organized:
5. Create the external table to access the Data Lake Store. The following code assumes that the Data
Lake Store has four columns in the CSV file—id, firstName, lastName and zipCode. PolyBase uses
recursive folder traversal to read all the files within the folder, in addition to the subfolders specified
in the LOCATION parameter:
6. Use the external table to access the data within the Data Lake Store from the SQL Data Warehouse.
7. Copy the data from the Data Lake Store to SQL Data Warehouse via the external table using a CTAS
command. The following example creates a table that is hash distributed across the data warehouse:
Load data from Data Lake Store into SQL Data Warehouse
https://aka.ms/Ltxp1c
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-25
3. Specify the output for the job to the SQL Data Warehouse you have created. You need the following
details to configure this:
o Output alias—the name to identify the output.
4. Create a streaming query that redirects the data from input stream to the output.
5. Run the job, and start the streaming source that provides the input. The data will appear in the data
warehouse as it is processed by the Stream Analytics job.
https://aka.ms/D2oudf
MCT USE ONLY. STUDENT USE PROHIBITED
7-26 Implementing Azure SQL Data Warehouse
For better query performance, always reindex your tables after loading the data.
Question: Can you execute multiple concurrent bulk loads from separate files into a
columnstore index table using bcp?
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-27
10
20
30
40
50
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
In this phase of the project, you will consolidate the data storage for the traffic surveillance system by
using SQL Data Warehouse as a single data location for static, or rarely updated, information including
stolen vehicle data, vehicle owner data, and speed camera location data. You will also configure the traffic
surveillance system to use the same SQL Data Warehouse to hold dynamic data streamed live from the
speed cameras.
Objectives
After completing this lab, you will be able to:
Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.
Lab Setup
Estimated Time: 90 minutes
Password: Pa55w.rd
This lab uses the following resources from Lab 5 and earlier:
Resource group: CamerasRG
4. Explore the SQL Data Warehouse using SQL Server Management Studio
Task 4: Explore the SQL Data Warehouse using SQL Server Management Studio
Results: At the end of this exercise, you will have created a new database server, created a new data
warehouse, explored the data warehouse using SQL Server Management Studio, and used scaling with
data warehouse.
2. Use SQL Server Management Server to create data warehouse tables and indexes
Task 1: Design tables and indexes for a SQL Data Warehouse application
Task 2: Use SQL Server Management Server to create data warehouse tables and
indexes
MCT USE ONLY. STUDENT USE PROHIBITED
7-30 Implementing Azure SQL Data Warehouse
Results: At the end of this exercise, you will have designed tables and indexes for a data warehouse
application, and used SQL Server Management Server to create the required data warehouse tables and
indexes.
You will first upload the stolen vehicle data to Data Lake Store (using AzCopy and Adlcopy), as a
temporary staging location. You will then create a local SQL Server database for holding vehicle owner
data, again as a staging location. You will then upload speed camera location data directly into the data
warehouse from a local CSV file using AzCopy, PolyBase, and CTAS (dropping the existing table first, and
using CTAS with the same options that were used to create the table in the first place). You will then
import the staged stolen vehicle data from Data Lake Store into SQL Data Warehouse—leaving the
existing table in place—and then use INSERT INTO to append data to the table. Finally, you will import
the staged vehicle/owner data from the on-premises SQL Server database by using an ADO.NET source
and destination with SQL Server Integration Services—again leaving the existing table in place.
1. Stage data in Data Lake Store prior to SQL Data Warehouse import
2. Stage data in an on-premises SQL Server database prior to SQL Data Warehouse import
3. Import data from a local CSV file into SQL Data Warehouse
4. Import data from Data Lake Store into SQL Data Warehouse
5. Import data from an on-premises SQL Server database into SQL Data Warehouse
Task 1: Stage data in Data Lake Store prior to SQL Data Warehouse import
Task 2: Stage data in an on-premises SQL Server database prior to SQL Data
Warehouse import
Task 3: Import data from a local CSV file into SQL Data Warehouse
Task 4: Import data from Data Lake Store into SQL Data Warehouse
Task 5: Import data from an on-premises SQL Server database into SQL Data
Warehouse
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 7-31
Staged data in Data Lake Store prior to SQL Data Warehouse import.
Staged data in an on-premises SQL Server database prior to SQL Data Warehouse import.
Imported data from a local CSV file directly into SQL Data Warehouse.
Imported data from Data Lake Store into SQL Data Warehouse.
Imported data from an on-premises SQL Server database into SQL Data Warehouse.
4. Lab cleanup
Task 1: Configure an Azure Stream Analytics job to output to SQL Data Warehouse
Task 2: Configure a Visual Studio app to use the Stream Analytics job
Results: At the end of this exercise, you will have configured a Stream Analytics job to output to SQL Data
Warehouse, configured a Visual Studio app to use the Stream Analytics job, and viewed Stream Analytics
job data in SQL Data Warehouse.
Question: In the table design, why is it recommended that the vehicle speed data is hashed
by camera ID?
Question: What are two of the most important management options for SQL Data
Warehouse that help control your Azure costs?
MCT USE ONLY. STUDENT USE PROHIBITED
7-32 Implementing Azure SQL Data Warehouse
Module 8
Performing Analytics with Azure SQL Data Warehouse
Contents:
Module Overview 8-1
Lesson 1: Querying data in SQL Data Warehouse 8-2
Module Overview
This module describes how to query data that is stored in Microsoft® Azure® SQL Data Warehouse and
how to secure this data. It also describes the various ways in which you monitor the SQL Data Warehouse
to maintain good performance.
Objectives
By the end of this module, you will be able to:
Describe the features of Transact-SQL that are available for use with SQL Data Warehouse.
Configure and monitor the SQL Data Warehouse to maintain optimal performance.
Describe how to protect data and manage security in a SQL Data Warehouse.
MCT USE ONLY. STUDENT USE PROHIBITED
8-2 Performing Analytics with Azure SQL Data Warehouse
Lesson 1
Querying data in SQL Data Warehouse
SQL Data Warehouse uses a subset of the Transact-SQL (T-SQL) language to perform database operations.
This lesson summarizes the common features of T-SQL that are and are not available for use with SQL
Data Warehouse. This lesson also explains how SQL Data Warehouse works with machine learning, and
how you generate Power BI reports using the information stored in SQL Data Warehouse.
Lesson Objectives
By the end of this lesson, you should be able to:
List the features and limitations of T-SQL within SQL Data Warehouse.
List the features of views that are not available within SQL Data Warehouse.
Unique indexes
Sparse columns
User-defined types
Sequences
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-3
Synonyms
Data types including bigint, numeric, bit, smallint, decimal, int, float, date, char, and varchar. SQL Data
Warehouse does not currently support blob types, including varchar(max) and nvarchar(max).
Control flow elements such as BEGIN...END, BREAK, IF…ELSE, THROW, TRY…CATCH, and WHILE.
Operators such as Add (+), Subtraction (-), Multiply (*), and Divide (/).
Wildcard characters.
For the full list of SQL Data Warehouse language elements, see:
Language elements
https://aka.ms/xuijkv
For more information on supported T-SQL constructs in SQL Data Warehouse, see:
Transact-SQL topics
https://aka.ms/fumw8l
For more information on catalog views, and dynamic management views (DMVs), see:
System views
https://aka.ms/ifb8qm
ROLLUP
GROUPING SETS
CUBE
To achieve ROLLUP functionality across Country and Region, you might use the following three SQL
statements:
A SQL statement that produces the aggregated data at Country and Region.
For more information on implementing queries that emulate these GROUP BY operations, see:
You can’t create updateable views—that is, underlying base tables can’t be updated through views.
Performing transactions
SQL Data Warehouse supports transactions for
data workloads but is limited when compared to
SQL Server. SQL Data Warehouse uses ACID
(Atomicity, Consistency, Isolation and Durability)
transactions—the isolation level is limited to
READ UNCOMMITTED and can’t be modified.
THROW and RAISEERROR are supported in SQL Data Warehouse but you should note the following:
User-defined error message numbers using THROW can’t be in the range between 100,000 and
150,000.
Programmatic constructs
Using variables
You create variables in SQL Data Warehouse by
using a DECLARE or SET statement—you should
note the following points:
Using IF...ELSE
You use the IF…ELSE construct in the same way as you would in SQL Server—as a conditional construct,
with the IF keyword followed by a Boolean expression. Based on the outcome of the Boolean expression,
the IF clause executes when the condition is true; the ELSE clause executes when the condition is false.
The IF…ELSE construct can be nested until there are no memory issues. The syntax for the IF…ELSE clause
is as follows:
IF Boolean_expression
{ sql statement | statement block }
[ ELSE
{ sql statement | statement block } ]
WHILE boolean_expression
{ sql statement | statement block | BREAK }
Dynamic SQL
Dynamic SQL is essential to make your code more generic, readable and flexible. SQL Data Warehouse no
longer supports blob data types, so it doesn’t have varchar(max) and nvarchar(max) data types. Therefore,
to write a SQL statement that uses many strings, you need to break the code into multiple statements
then merge and use the EXEC command to execute the combined SQL statement.
MCT USE ONLY. STUDENT USE PROHIBITED
8-6 Performing Analytics with Azure SQL Data Warehouse
Stored procedures
Stored procedures provide a great way to modularize many lines of code, helping developers to organize
the code into multiple logical chunks. Stored procedures can also take parameters as input and so
become more flexible. Stored procedures promote reusability of code to save time in having to develop
the same code again and again.
Stored procedures in SQL Data Warehouse behave like stored procedures in SQL Server. However, there
are subtle differences between them. They are as follows:
Stored procedures in SQL Data Warehouse are not precompiled (unlike SQL Server, where all stored
procedures are precompiled).
SQL Data Warehouse supports nesting up to eight levels whereas SQL Server supports up to 32 levels.
SQL Data Warehouse doesn’t support @@NESTLEVEL, so you must keep a manual count of the
nested levels.
It’s important to be aware of certain features in SQL Server that are not currently implemented in SQL
Data Warehouse. They are as follows:
Execution contexts.
Return statement.
Username and password: these are the credentials you need to connect to the SQL Data
Warehouse.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-7
Database query: this is the query that should be run to fetch the data. This can be any valid T-SQL
SELECT statement supported by SQL Data Warehouse.
You also use SQL Data Warehouse as the repository for results from a machine learning model. To do this,
add an Export Data module to your experiment. Connect this module to the Results dataset output of a
module that generates a dataset. Set the following properties of the Writer module:
Database server name: specify the name of the server hosting your SQL Data Warehouse
(<servername>.database.windows.net.).
Username and password: the credentials for connecting to the SQL Data Warehouse.
Comma separated list of columns to be saved: specify the fields in the results dataset that you wish
to write to the data warehouse.
Data table name: the name of the table in which to save the data in the data warehouse. You should
create this table first.
Comma separated list of datatable columns: the columns to populate in the table (the number
and type of columns must match those in the list of columns to be saved).
Number of rows written per SQL Azure operation: the batch size of write operations to the data
warehouse.
Note: If you try to import data into a table that does not already exist, the Export Data
module will attempt to create it for you. However, the Export Data module will try to implement
a Unique constraint that is not supported by SQL Data Warehouse, and the export operation will
fail.
Using this technique, Power BI uses DirectQuery to communicate with the SQL Data Warehouse. In Power
BI, for every operation performed, Power BI sends a query to SQL Data Warehouse in real time, utilizing
the power of SQL Data Warehouse. The data is aggregated in SQL Data Warehouse and sent back to
Power BI to display it to the user.
To connect to the SQL Data Warehouse, you need to provide the following details:
Server name: this is the name of the SQL Data Warehouse server (fully qualified name).
When Power BI Service is connected and linked to the SQL Data Warehouse, you select the tables and
columns you require to create the visualizations that form part of reports or dashboards. All of the queries
sent back to the SQL Data Warehouse are DirectQuery based, utilizing the full potential of SQL Data
Warehouse.
2. You need to provide the following information to connect to SQL Data Warehouse:
o Server: this is the name of the server where the SQL Data Warehouse is hosted.
o Data connectivity mode: this is either Import or DirectQuery. Choose Import if the data
required is a small subset, because you are limited by the processing and storage capabilities of
the PC where you are downloading the data. It’s advisable to choose DirectQuery to utilize the
full computing power of SQL Data Warehouse.
Click OK.
3. Provide user authentication details like Username and Password to connect to SQL Data Warehouse.
4. When the connection is successful, you can perform data analytics on the data stored within SQL Data
Warehouse.
Utilize data held in SQL Data Warehouse in a Machine Learning predictive model.
ROLLUP
GROUPING SETS
CUBE
GROUP BY
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Lesson 2
Maintaining performance
This lesson describes how to configure and monitor the performance of a SQL Data Warehouse. This
lesson also gives details of best practices for maintaining performance.
Lesson Objectives
By the end of this lesson, you should be able to:
Understand how to update database statistics, and rebuild and optimize indexes.
Describe the various monitoring tools that are available in the Azure portal.
Updating statistics
Database statistics play a vital role in optimizing
user queries. It’s important that table statistics are
kept up to date because, generally, the optimizer
produces the most optimal plans under these
conditions. You collect statistics on a single
column or a set of columns or indexes. It’s
advisable to collect or execute statistics after a
data load, or at least once each day. If the
statistics on a table are old, the explain plan
produced by the optimizer for a given query
might not perform in an optimal way.
The following explains the various Transact-SQL (T-SQL) statements you use to collect statistics, update
statistics, and so on:
Use the following syntax to create statistics on a single column of a table using default options. By
default, SQL Data Warehouse uses a 20 percent sample size when creating these statistics:
If you want to specify the sample size instead of using the default 20 percent, use the following
syntax:
Managing workloads
SQL Data Warehouse is designed to deliver
predictable performance when you scale up or
down. SQL Data Warehouse also has features
that control concurrency and resource allocation
(CPU and memory).
A SQL Data Warehouse supports up to 1,024
connections at the same time. These 1,024
connections submit queries concurrently but
queries might be queued based on key
parameters like concurrent queries and
concurrency slots.
Concurrency slots: for every 100 DWUs, four concurrency slots are allocated—for example, 1,000 DWUs
mean 40 concurrency slots. Each query requires a set number of concurrency slots based on the resource
class of the query. Queries executed in the smaller resource class only require one concurrency slot,
whereas queries executed in the higher resource class require more concurrency slots. There are four
resource classes:
smallrc
mediumrc
largerc
xlargerc
MCT USE ONLY. STUDENT USE PROHIBITED
8-12 Performing Analytics with Azure SQL Data Warehouse
The following table summarizes the number of concurrent queries that are available, in addition to the
number of concurrency slots allocated for a given number of DWUs:
DW100 4 4 1 1 2 4
DW200 8 8 1 2 4 8
DW300 12 12 1 2 4 8
DW400 16 16 1 4 8 16
DW500 20 20 1 4 8 16
DW600 24 24 1 4 8 16
DW1000 32 40 1 8 16 32
DW1200 32 48 1 8 16 32
DW1500 32 60 1 8 16 32
DW2000 32 80 1 16 32 64
DW3000 32 120 1 16 32 64
Based on what kind of queries are running, you can run a lot more smallrc queries, because they consume
a lower number of concurrency slots when compared to largerc or xlargerc queries for the same amount
of DWUs allocated.
Memory allocation: the resource class of a query dictates the amount of memory allocated for a query.
Because memory is a fixed resource, it’s important to understand how a SQL Data Warehouse provisions
memory based on the resource class. The following table summarizes the amount of memory allocated in
MB, based on the resource class:
The preceding table shows details of every distribution within the SQL Data Warehouse. There are 60
distributions for a SQL Data Warehouse instance. Therefore, each instance of allocated memory shown in
the table must be multiplied by 60 to understand the overall memory that is allocated for the SQL Data
Warehouse.
It’s important to note that not all queries are constrained by concurrency limits. Queries that access only
the metadata—like queries that access dynamic management views or catalog views—wouldn’t fall under
the concurrency limits.
DROP INDEX
100 1 60
200 2 30
300 3 20
400 4 15
500 5 12
600 6 10
1000 10 6
1200 12 5
1500 15 4
2000 20 3
3000 30 2
6000 60 1
It’s important to note that the total number of distributions remains at 60, irrespective of the DWU size.
Resume compute: when the SQL Data Warehouse is required, you resume compute. Resuming compute
allocates the required CPU and memory based on the allocated DWUs for that instance. It also makes the
data available for end-user queries. You resume compute using Azure portal, PowerShell or REST APIs.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-15
Scale compute: you scale up or down, according to user requirements. To improve performance and
execute queries faster, you scale up the compute power to have more CPU and memory resources
allocated to your instance. Similarly, when you do not require the capacity, you scale down the compute.
It’s vital to remember that either scaling up or down will terminate all incoming queries until the system is
reconfigured accordingly. You scale compute by using Azure portal, PowerShell, REST APIs or T-SQL.
Rebuilding indexes provides an efficient way to improve performance by optimizing rowgroups. You
should ensure that you rebuild indexes with a user who has a large enough resource group. The
recommended minimum sizes, based on DWUs, are as follows:
To rebuild an entire clustered columnstore index, you execute the following statement:
Monitoring tools
SQL Data Warehouse comes packed with many
tools and techniques that you use to monitor
running queries, queued queries, and so on.
Dynamic management views (DMVs) provide
much of this functionality with database views
that encapsulate the logic of joining different
logging and auditing tables.
Monitoring executing queries: every query to the SQL Data Warehouse is logged in
sys.dm_pdw_exec_requests—this contains the previous 10,000 queries that have been executed.
Request_id is unique for each query executed. The following query retrieves all the previous 10,000
queries, apart from the one that is currently running this query:
Monitoring queued queries: if a query is waiting for a resource or resources, it will be in the state
“acquire resources”. You will find this information in the DMV sys.dm_pdw_waits.
Monitoring data movement and tempdb usage: queries that join or compare data held in different
nodes will need to move that data around as part of the join or comparison operations. This is expensive.
Additionally, these queries are likely to construct temporary tables on each node to hold the data
retrieved from other nodes. If you are performing a large number of queries, this might cause contention
in tempdb on each node. You query the dm_pdw_nodes_db_session_space_usage,
sys.dm_pdw_nodes_exec_sessions, sys.dm_pdw_nodes_exec_connections DMVs to monitor how queries
are being performed, or use the monitoring feature in the Azure portal that presents this data in a more
convenient manner.
Using labels in SQL Data Warehouse: when you have many queries running on the SQL Data
Warehouse, it is tiresome to search which query belongs to which project or ETL routine. SQL Data
Warehouse provides a concept named “query labels”—you label each query that you execute with an
optional string OPTION (LABEL = 'Project 1: ETL Routine 1: Step 1').
This makes it much easier to find where a particular long-running query belongs and to pinpoint the
problem location.
Using the Azure portal: use the data warehouse blade in the Azure portal to monitor and analyze the
performance of operations. Click Monitoring, and you will see the volume of recent activity. An important
part of this blade is the DWU Usage chart, which enables you to ascertain whether you are close to the
resource limits currently allocated and indicates that you should consider scaling by adding more DWUs.
You tailor the graph by clicking it to see additional information. You also configure alerts to inform an
operator if resources are close to their limit; the operator then decides whether or not to scale the data
warehouse.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-17
The Monitoring blade also shows the workload imposed in the system by query activity. If you click this
graph, you see the individual queries that have been performed recently. You drill down into the details to
see the query execution plans and use this information to assess whether to rewrite queries, add or
remove indexes, or completely restructure a table by changing its distribution policy.
Before you use the pause or scale functionality, it’s important to make sure that there are no major
transactions in progress within the data warehouse, because this might affect the time it takes to
pause or scale. Always complete your transactions before starting the pause or scale procedures.
Always maintain up-to-date database statistics with the entities of SQL Data Warehouse. An older
version of the statistics will affect performance and queries will take longer to execute. Always update
the statistics after a big data load.
Always use bulk insert instead of individual insert statements because this will improve load
performance.
You use SQL Data Warehouse to load data using methods like bcp, Azure Data Factory or PolyBase—
it’s best to use PolyBase for very large datasets.
By default, tables are distributed by using the round robin method. However, you use a hash
distribution over a selected column to ensure that your data is organized by that column instead. This
arrangement helps to make join operations over that column with other tables faster.
Never overuse the partitions. Remember that there are 60 distributions for each SQL Data Warehouse
instance—creating partitions will create partitions on each individual distribution.
When queries require more memory to execute, you should always use the higher resource class to
maximize performance.
10%
20%
30%
40%
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Lesson 3
Protecting data in SQL Data Warehouse
Authentication, authorization, encryption and auditing are the key pillars for securing data stored in any
system. SQL Data Warehouse provides many ways in which to secure stored data and protect it from
malicious access.
Lesson Objectives
By the end of this lesson, you should be able to:
Create users and understand how firewall rules need to be configured to give them access.
When the user login is created, you create a user based on that login. Use the following syntax to create
an AppUser user for the login you created earlier:
To configure Azure AD to authenticate users for SQL Data Warehouse, use the following steps:
1. Create a new directory using Azure AD, and add users that require access to SQL Data Warehouse.
Authorizing users
Authorization centers around what a user can do
within the SQL Data Warehouse, including the
objects they see, what data they see, and so on.
This is managed by using the following roles and
permissions:
The following code shows how to enable a user named MyUser to run queries (SELECT) statements over
MyTable:
You can also use the REVOKE command to remove privileges from a user.
Database roles: SQL Data Warehouse provides many built-in database roles, which you use to grant users
entire levels of access. Typical database roles include db_datareader and db_datawriter. A database role
has an associated set of privileges. For example, the db_owner role has complete access to the contents of
a database, whereas the db_datareader role reads data from any table in the database but does not
modify it. You can also create custom, user-defined database roles.
You use the sp_addrole stored procedure to define a new role, and sp_addrolemember to assign a user to
a role. The sp_droprolemember revokes the role privileges from a user.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-21
The following example creates a new role named query_role that encapsulates SELECT privilege over three
tables (MyTable, MyTable1, and MyTable2). The code then grants this role to a user named MyUser2:
Stored procedures: stored procedures provide another way to control access to users’ actions in SQL
Data Warehouse. You can hide tables behind the code in a stored procedure. You only need to grant
EXECUTE permission on the stored procedure to a user; the user does not need to have direct access to
any tables used by the stored procedure.
Encrypting data
SQL Data Warehouse uses Transparent Data
Encryption (TDE) for encrypting and decrypting
data at rest in real time. All data, including
backups and transaction log files, is encrypted.
TDE uses a symmetric key (database encryption
key) to encrypt the whole database, and each
instance of SQL Data Warehouse has a unique
server certificate that protects the database
encryption key. These certificates are rotated
automatically by Microsoft at least every 90 days,
to keep the data safe and secure from malicious
activities. SQL Data Warehouse implements the
AES-256 encryption algorithm to encrypt the data. You use the Azure portal or T-SQL to encrypt data in
SQL Data Warehouse.
1. Open the selected SQL Data Warehouse instance in the Azure portal.
5. Click Save.
To disable encryption using the Azure portal, follow the same steps but select the Off setting.
MCT USE ONLY. STUDENT USE PROHIBITED
8-22 Performing Analytics with Azure SQL Data Warehouse
Using T-SQL
To enable encryption using T-SQL, the user should be an administrator or have a “dbmanager” role in the
database. You should connect to the master database within the SQL Data Warehouse instance and use
the following T-SQL syntax to enable encryption:
To disable encryption using T-SQL, the user should be an administrator or have a “dbmanager” role in the
database. You should connect to the master database within the SQL Data Warehouse instance and use
the following T-SQL syntax to disable encryption:
It’s important to enable auditing and threat detection so that you can monitor who is using the database,
in addition to identifying potential threats that might cause a financial loss or a major disruption to
services.
AES-64
AES-128
AES-256
AES-512
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-23
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
For this phase of the project, you are going to use data output from Azure Stream Analytics into SQL Data
Warehouse to produce visually-rich reports from the stored data, as a first step in the identification of any
patterns and trends in vehicle speeds at each camera location. You are then going to use Machine
Learning against the data in SQL Data Warehouse to try to predict speeds at a given camera location at a
given time; this information could be used to deploy patrol cars to potential hotspots ahead of time. Next,
you want to be able to query the data in SQL Data Warehouse to find information such as the registration
numbers of all cars that have never been caught speeding. Because you will be performing such queries at
regular intervals, and the datasets are very large, it’s important that the data is properly structured to
optimize the query processes. Finally, because a lot of the traffic data stored in SQL Data Warehouse
includes personal and confidential details, it’s essential that the databases are protected from both
accidental and malicious threats. You will, therefore, configure SQL Data Warehouse auditing and look at
how to protect against various threats.
Objectives
After completing this lab, you will be able to:
Assess SQL Data Warehouse query performance and optimize database configuration.
Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.
Lab Setup
Estimated Time: 90 minutes
Virtual machine: 20776A-LON-DEV
Username: ADATUM\AdatumAdmin
Password: Pa55w.rd
This lab uses the following resources from Lab 7 and earlier:
Note: In this exercise, you will upload the speed data from a CSV file rather than streaming it from the
cameras; the CSV file contains data for a longer period of time than is feasible by using streaming.
1. Install AzCopy
Results: At the end of this exercise, you will have uploaded data to Blob storage, and then used PolyBase
to transfer this blob data to SQL Data Warehouse. You will then use Power BI to visualize this data, and
look for patterns in the data.
1. Create experiment
Results: At the end of this exercise, you will have created a trained model, deployed this model as a web
service, and then used this service in an application to generate traffic speed predictions for particular
camera locations, at particular times of day, and for particular days of the week.
Task 5: Assess query performance when distributing linked data to the same node
Results: At the end of this exercise, you will have run a series of queries against SQL Data Warehouse,
assessed how each query is processed, and reconfigured the data structure several times to see the impact
on the query process.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 8-27
5. Lab cleanup
Results: At the end of this exercise, you will have enabled auditing, and used a sample application that
includes a SQL injection vulnerability to attempt to attack the data warehouse. You will also have
examined the audit logs, including identifying login failures.
Question: If you have an example big data application in your own organization, can you list
any strategies that are currently used to minimize data movement?
Question: Based on your own experience, and your organization’s data structures, can you
think of scenarios where you would not enable auditing and/or threat detection?
MCT USE ONLY. STUDENT USE PROHIBITED
8-28 Performing Analytics with Azure SQL Data Warehouse
Use the features of Transact-SQL that are available for SQL Data Warehouse.
Module 9
Automating Data Flow with Azure Data Factory
Contents:
Module Overview 9-1
Lesson 1: Introduction to Data Factory 9-2
Lab: Automating the Data Flow with Azure Data Factory 9-38
Module Overview
Microsoft® Azure® Data Factory is a data orchestration service that ties together many different
compute and storage services to help you build powerful data pipelines. This orchestration service builds
on top of individual data analytics and storage components by integrating each into an overall data flow
application. Data Factory integrates with many different storage services and orchestrates transformations
on many compute services, including Azure Batch, Azure Data Lake Analytics, HDInsight® (Hadoop), and
Azure Machine Learning.
Objectives
By the end of this module, you will be able to:
Describe how to create Data Factory pipelines that transfer data efficiently.
Describe how to monitor Data Factory pipelines and how to protect the data flowing through these
pipelines.
MCT USE ONLY. STUDENT USE PROHIBITED
9-2 Automating Data Flow with Azure Data Factory
Lesson 1
Introduction to Data Factory
Enterprise data continues to grow with regards to volume, velocity, and variety. It’s crucial for enterprises
to be able to integrate these disparate data sources and data types easily into one cohesive data driven
application. Data Factory is an orchestration tool that provides the integration between each of these
systems to build a set of robust and cohesive data integration units called pipelines. You use Data Factory
to automate and schedule all data movement and transformation activities, all from a single set of
interfaces, where you monitor and manage the pipelines as your data is flowing through the application.
Lesson Objectives
By the end of this lesson, you should be able to:
Understand what Data Factory workflows, datasets and linked services are.
Describe scheduling pipelines, controlling processing windows, and data slices.
Data Factory is automated by scheduling the execution of each copy or transformation activity in a
pipeline. The schedule is broken down into activity units known as time slices. These time slices provide a
recurring schedule for an activity to run. For example, you might schedule an activity to copy data from
an on-premises database to Azure Data Lake Store to run every hour.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-3
Linked services
Datasets
Activities
Pipelines
Datasets. Datasets are used to define the location and structure of a particular set of data. Data Factory
uses connection information from linked services to connect to a data source or data sink, and uses
information in the dataset definition to find the exact location of the data to be copied. Datasets are the
link between activities and are used to define dependencies in data availability for each activity. The
frequency of the time slice is also defined in the output dataset for each activity.
Activities. Activities define the processing that is performed on the data and require at least one input
and one output dataset. There are two types of activities—copy activities and transformation activities.
Copy activities are used to copy data from the location defined in the input dataset to the location
defined in the output dataset. Transformation activities are used to submit a script file or trigger a job to a
compute resource using the connection information in the compute linked service.
Pipelines. Pipelines, which can have one or more activities, are used to group similar activities into an
overall task. Grouping these activities into a set helps to simplify the data factory by managing the set at a
pipeline level instead of managing each task individually.
Data Management Gateway. The Data Management Gateway (DMG) is used to connect Data Factory
with on-premises data sources. You might install DMG on a machine in your network so it has access to
the on-premises datasets, and you connect DMG to your data factory using a uniquely generated key.
Each key concept of Data Factory is deployed by creating JSON files that define the object’s properties.
When deploying a data factory, linked services are deployed first, then datasets, then pipelines with one
or many activities defined. This order is required because of the dependency between the objects. For
example, you can’t create an Azure Blob dataset without having an Azure Blob linked service with the
connection information for the storage account. Similarly, you can’t create a pipeline with an activity to
copy data from Azure Blob to Data Lake Store before the input and output datasets are created.
MCT USE ONLY. STUDENT USE PROHIBITED
9-4 Automating Data Flow with Azure Data Factory
{
"name": "WasbInput",
"properties": {
"type": "AzureBlob",
"linkedServiceName": "WasbLinkedService",
"typeProperties": {
"fileName": "input.txt",
"folderPath": "data/raw/",
"format": {
"type": "TextFormat",
"columnDelimiter": ","
}
},
"availability": {
"frequency": "Hour",
"interval": 1
},
"external": true,
"policy": {}
}
}
Many storage linked service types are available, including on-premises and cloud stores. However, some
data stores are only used as a source for a copy activity and some are only used as a sink. For example,
Data Lake Store datasets are used as a source and sink, MongoDB datasets are only used as a source and
Azure Search Index datasets are only used as a sink.
Structure
You use the structure of a Data Factory dataset to define the schema of the dataset. The structure is a
combination of field names and data types for the dataset. The dataset structure is optional, but it’s
necessary in two scenarios: mapping columns of the source dataset to columns in the sink dataset, and
converting data types of the source dataset to native data types of the sink dataset.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-5
To define the structure of a dataset, add the structure property to the properties section of the dataset
definition in the JSON file. For example, the following structure has five columns: guid of type guid,
storeId of type String, productId of type int32, quantity of type int32, and price of type Decimal.
structure:
[
{ "name": "guid", "type": "Guid"},
{ "name": "storeId", "type": "String"},
{ "name": "productId", "type": "int32"},
{ "name": "quantity", "type": "int32"},
{ "name": "price", "type": "Decimal"}
]
Data Factory supports the following data types for use in defining dataset structure:
Single String
Double GUID
External datasets
External datasets in Data Factory are any datasets that are not included as the output of a data factory
activity. This means that the dataset is generated by an external source, and this dataset will be used as an
input dataset to the first activity in a pipeline. To mark a dataset as external, you set the external tag to
true in the properties section of the dataset JSON file.
Scoped datasets
You set Data Factory pipelines to run in one of two modes, either Scheduled or OneTime. Pipelines that
are set to OneTime mode are defined with scoped datasets, or datasets that are only scoped to that
specific pipeline. Therefore, these datasets can only be accessed by activities that are defined in the same
pipeline as the scoped datasets.
properties work with the scheduler property of an activity to determine which time slices you should
create for the activity.
Activity schedule
The scheduler property of an activity defines the duration of the time slices that you will create and the
frequency for the time slices that will be created. For example, if you create a copy activity with the
scheduler frequency set to Hour and the interval set to 4, your activity will run every four hours.
The activity scheduler property is optional. However, when you define the property, it must match the
availability property of the output dataset (the dataset availability property has the same frequency and
interval properties).
The schedule of each activity is defined by the output dataset. For this reason, the schedule for the activity
must match the schedule of the output dataset. For example, you might define a Hive activity with a
frequency of Hour and an interval of 6. The output dataset must also have a frequency of Hour and an
interval of 6 to create the hour-long time slices every six hours.
The following table shows the possible dataset availability properties. For more examples of these
properties, see:
Dataset availability
https://aka.ms/Vqhvfv
Name Description
anchorDateTime States the absolute datetime position used by the activity and dataset
schedulers to create time slice boundaries.
You set Dataset validation policy properties to perform simple validation on the data before you begin an
activity. If the validation is successful, the slice status is changed to Ready. If the validation is unsuccessful,
the slice status is changed to Failed. The following table shows possible dataset policy properties:
Name Description
To create the correct activity schedule that has the desired time slices, you must bring the pipeline start
and end times together with the activity and output dataset schedule.
For example, you want to run a copy activity from Azure Blob to Data Lake Store once daily for the entire
month of July 2017.
2. In the activity definition, in the pipeline JSON file, create the scheduler property with a frequency of
Day and an interval of 1 to reflect the schedule of the output dataset.
3. In the pipeline JSON file, set the start property to 2017-07-01T00:00:00Z and the end property to
2017-08-01T00:00:00Z.
This procedure will create daily time slices for each of the 31 days in the pipeline availability interval.
For example, you are copying reports from Blob storage to Data Lake Store once daily. However, you want
to build weekly aggregate reports of these daily reports. You might create an activity and output dataset
with a weekly frequency to process the aggregates, but the input dataset would have a daily frequency.
In the preceding example, the input is processed on a daily schedule. Because the activity frequency is
weekly, the activity would wait for seven input time slices to complete successfully before running the
single weekly time slice for all seven days of input data.
MCT USE ONLY. STUDENT USE PROHIBITED
9-8 Automating Data Flow with Azure Data Factory
You create multiple activities in a pipeline that will run in parallel if the input of one activity is not
dependent on the output of another activity within the same pipeline.
You set time slices to run in parallel by setting the concurrency property in the policy section of the
activity in the pipeline JSON file. The concurrency setting states how many time slices are processed
simultaneously. The default value is 1 and the maximum value is 10. Running past time slices in parallel
speeds up the processing of the activity.
Installation options
There are two options to install the DMG. You might install directly from the Microsoft Download Center
and manually register with your data factory. The alternative is to follow the EXPRESS SETUP directly in the
Data Factory authoring tool in the Azure portal.
1. Browse to the Data Management Gateway download page in the Microsoft Download Center:
https://aka.ms/G4rtki
2. Click the Download button and select either the 32-bit or 64-bit version, in addition to the optional
release notes.
4. Provide a name for the data gateway, and then click OK.
Create Data Factory linked services for source and sink data stores.
Create a Data Factory pipeline with a copy activity to move the data.
Question: How might you use Data Factory to extract, transform and load data in to and out
of Azure within your organization?
MCT USE ONLY. STUDENT USE PROHIBITED
9-10 Automating Data Flow with Azure Data Factory
Pipelines
Activities
Datasets
Linked services
Stream Analytics
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Lesson 2
Transferring data
One of the strengths of using Data Factory to orchestrate your data workflows is that you utilize copy
activities to collocate all data in one place. You use copy activities to connect to and copy data from on-
premises and cloud data sources. The data is then integrated and transformed with transformation
activities to produce the data products that you need to publish. Usually, Data Factory copy activities are
used at the beginning of a data workflow to retrieve input data that is later transformed and at the end of
a data workflow to publish data products.
Lesson Objectives
By the end of this lesson, you should be able to:
Understand how to copy data to and from different data sources using Data Factory.
The following is a sample JSON file that defines a copy activity from the Blob store to the Data Lake Store:
{
"name": "ExampleCopyPipeline",
"properties": {
"description": "Copy data from Azure blob to Azure Data Lake Store",
"activities": [
{
"name": "CopyFromBlobToAdls",
"type": "Copy",
"inputs": [
{
"name": "BlobInputDataset"
}
],
"outputs": [
MCT USE ONLY. STUDENT USE PROHIBITED
9-12 Automating Data Flow with Azure Data Factory
{
"name": "AdlsOutputDataset"
}
],
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "AzureDataLakeStoreSink"
},
"executionLocation": "East US 2"
},
"Policy": {
"concurrency": 1,
"executionPriorityOrder": "NewestFirst",
"retry": 0,
"timeout": "01:00:00"
}
}
],
"start": "2017-07-12T00:00:00Z",
"end": "2017-07-13T00:00:00Z"
}
}
JSON format
AVRO format
ORC format
Parquet format
You can define different properties for each of these file formats that define how the Data Factory copy
activity interacts with the dataset.
Text format
Datasets defined as TextFormat refer to simple text files, like .txt, .csv, and .tsv files. You define the
following properties in the format section of the dataset to define how the text files are read or written.
columnDelimiter Defines the character used as the column separator in the file. Default
value is “,”.
rowDelimiter Defines the character used as the row separator in the file. Default value
is “\r”, “\n”, or “\r\n” on read and “\r\n” on write.
escapeChar Defines the character used to escape the column delimiter in the input
file.
quoteChar Defines the character used to quote string values in both input and
output datasets. The quoteChar property can’t be used with the
escapeChar property for a dataset.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-13
nullValue Defines the character(s) used to represent a null value in the input and
output datasets. Default value is “\N” or “NULL” on read and “\N” on
write.
encodingName Defines the encoding of the text file. Default value is UTF-8.
firstRowAsHeader Defines whether Data Factory should treat the first line of a dataset as the
header row. Data Factory will read the first row as a header for input
datasets and will write the header as the first row for output dataset.
Default is false.
skipLineCount Defines the number of rows that Data Factory should skip when reading
in the dataset. If this property is used with the firstRowAsHeader
property, the number of lines provided is skipped, and the next line is
treated as the header.
treatEmptyAsNull Defines whether Data Factory should treat empty string values as null.
Default is true.
JSON format
Data Factory parses or writes JSON files based on the configuration properties listed in the format section
of the dataset JSON file.
filePattern Defines the pattern of the data stored in the JSON file. Optional values
are setOfObjects or arrayOfObjects. The default value is setOfObjects.
jsonNodeReference Defines the JSON path of an array to iterate and extract from when you
copy data from JSON files. To use fields under the root object of the
JSON file, begin the expression with $ as root.
jsonPathDefinition Defines the JSON path expression to use with the column mapping
feature in a Data Factory copy activity.
encodingName Defines the encoding of the text file. Default value is UTF-8.
nestingSeparator Defines the character that is used to separate the nesting levels of the
JSON document. Default value is ‘.’.
AVRO format
To use AVRO format in Data Factory, set the format type property to AvroFormat. There are no other
type properties to define for the AVRO format. Complex data types (unions, maps, arrays, enums, records,
and fixed) are not supported with the AVRO format.
ORC format
To use ORC format in Data Factory, set the format type property to OrcFormat. There are no other type
properties to define for the ORC format. Complex data types (union, list, map, struct) are not supported
with the ORC format. You use Data Factory to read uncompressed ORC files or ORC files compressed with
zlib or snappy. However, Data Factory only writes ORC files using the default compression zlib.
MCT USE ONLY. STUDENT USE PROHIBITED
9-14 Automating Data Flow with Azure Data Factory
Parquet format
To use Parquet format in Data Factory, set the format type property to ParquetFormat. There are no
other type properties to define for the Parquet format. Complex data types (list, map) are not supported
with the Parquet format. You use Data Factory to read uncompressed Parquet files or Parquet files that
are compressed with lzo, gzip, or snappy. However, Data Factory only writes Parquet files using snappy,
which is the default compression.
Compression options
Compressing large datasets helps with performance while copying and transforming data. You use the
dataset compression property to read from and write to compressed files while performing a copy activity.
The compression section of a dataset has two properties you can set:
Type—the type of compression to be used in the copy activity. Available compression types are
ZipDeflate, BZip2, Deflate, and GZip.
You use Blob storage as a source or sink with supported source and sink data stores in Data Factory.
For more information on supported source and sink data stores in Data Factory, see:
Supported data stores and formats
https://aka.ms/Hm2orx
Copy Activity supports reading data from block, append or page blobs only, and supports writing
data to block blobs only.
Azure Premium Storage is currently not supported as a sink data store because it’s backed by page
blobs.
Copy Activity doesn’t delete source data after it is copied from the source data store to the sink data
store.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-15
The following table summarizes the key properties required when you use Blob storage as a source or sink
data store:
Copy
Property Property Mandatory (Yes
Activity Valid values
name description or No)
type
Service Principal Authentication. This is the recommended way to authenticate for executing the
scheduled copying of data.
User Credential Authentication (OAuth). You need be aware that tokens might expire when you
use User Credential Authentication.
The following table describes the key JSON properties for linked services, authentication, dataset, and
copy activities:
Property Mandatory
Service type Property name Valid values
description (Yes or No)
Property Mandatory
Service type Property name Valid values
description (Yes or No)
Property Mandatory
Service type Property name Valid values
description (Yes or No)
Deflate
BZip2
ZipDeflate
Valid values for level
of compression are:
Optimal
Fastest
The SQL Data Warehouse connector for Data Factory only supports basic authentication.
The following table describes the key JSON properties for linked services, dataset and copy activities:
JSON examples for copying data to and from SQL Data Warehouse
https://aka.ms/Qurtuo
The following JSON sample shows how to set the DMUs to 64 when copying data from Blob storage to
SQL Data Warehouse:
"activities":[
{
"name": "Example Copy Activity",
"description": "",
"type": "Copy",
"inputs": [{ "name": "SourceDataset" }],
"outputs": [{ "name": "DestinationDataset" }],
"typeProperties": {
"source": {
"type": "BlobSource",
},
"sink": {
"type": "SqlDWSink"
},
"cloudDataMovementUnits": 64
}
}
]
The parallelCopies parameter is another good way to increase throughput—parallelCopies indicates the
maximum number of threads that run within a copy activity when reading data from the source and
writing data to the sink in parallel. Data Factory automatically decides the number of parallel copies based
on the source data store, sink data store, load on the host machines, and so on. You override this to a
number between 1 and 32.
The following steps explain how to use the Copy Data Wizard in Data Factory to copy data from source to
destination:
1. Open the Azure portal and go to the specific Data Factory resource.
2. Click the Copy data tile to invoke the Copy Data Wizard within Data Factory.
3. Provide a task name and description for this copy data task.
4. Specify the load schedule. The Data Factory Copy Data Wizard offers two options:
a. Run once now. This option is particularly useful if you need a one-off data copy from source to
destination.
b. Run regularly on schedule. Use this option to schedule the data load on a regular basis as more
data is available in the source system. Data Factory currently has many options to choose from
(hourly, daily, weekly, monthly, and so on). It’s important to note that you can’t select a frequency
of less than 15 minutes.
5. Select the source from where the data is being copied. The Data Factory Copy Data Wizard supports a
wide variety of source systems and provides an easy interface to supply source connection
information.
6. When the source type is selected and connection information is provided, the Copy Data Wizard
automatically scans the sample data and schematic definition to give the user a preview. This helps
users to identify whether the correct source data to be copied to the destination has been chosen.
7. If the source is a text file or set of text files, the Data Factory Copy Data Wizard provides options to
select the column delimiter, row delimiter, skip header rows, and so on.
8. Map the source schema to the destination schema. This is a very important step—the Copy Data
Wizard provides drop-down lists to map columns between source and destination.
You also use the Data Factory Copy Data Wizard to filter data when only a subset of data is required from
the source system. This is particularly helpful to avoid copying the full dataset when the project requires,
for example, data from the previous year.
9. The Data Factory Copy Data Wizard also provides some advanced settings that you tune, such as
setting the number of cloud units required to execute this task, the number of parallel copies to be
used, and so on.
10. When the required options are selected, the Data Factory Copy Data Wizard provides a summary of
all the choices. When the user is happy with the information, and the options have been deployed,
the Copy Data Wizard automatically generates the underlying JSON definitions based on the user
choices made during the wizard.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-23
Hybrid data movement. Hybrid data movement helps you to copy data between on-premises and
the cloud. Copying large volumes of data on slow networks is typically time consuming, so it’s
advisable to compress the data before copying from on-premises to the cloud, and load into an
interim staging data store. You then decompress the data in the interim staging data store before you
load it into a destination data store.
Firewall restrictions. Typically, it’s difficult to open ports other than port 80 for HTTP or port 443 for
HTTPS because of corporate IT policies. In such cases, it’s advisable to stage data into Blob storage
when you copy data from on-premises to the cloud. DMG only requires access to port 80 or 443 to
copy data from on-premises to the cloud whereas, if data is to be loaded to a SQL Data Warehouse, it
would require port 1433 to be open.
By using staged copy, you now have a two-step process to load data from a source data store to a
destination data store. To enable staged copy, you use the setting enableStaging in the copy activity. The
following properties help you to enable staging copy and provide the required information to the data
factory to stage data:
The following is a sample JSON that enables staged data when loading data from an on-premises SQL
Server to SQL Database:
"activities":[
{
"name": "Example Copy Activity",
"type": "Copy",
"inputs": [{ "name": "OnpremisesSQLServerInputDataStore" }],
"outputs": [{ "name": "AzureSQLDBDestinationDataStore" }],
"typeProperties": {
"source": {
"type": "SqlSource",
},
"sink": {
"type": "SqlSink"
},
"enableStaging": true,
"stagingSettings": {
"linkedServiceName": "StagingBlobStore",
"path": "stagingcontainer/stagingpath",
"enableCompression": true
}
}
}
]
Every day
Every month
Every week
Every 10 minutes
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-25
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Lesson 3
Transforming data
After the data is copied from source data stores, the logical next step is to transform it as per the
requirements—so that it’s ready to be loaded on to the destination data store. This lesson looks at how
data transformations are used within Data Factory and how to implement custom transformations. This
lesson also considers how to transform data using Data Lake Analytics and SQL Data Warehouse.
Lesson Objectives
By the end of this lesson, you should be able to:
Explain how to transform data using Data Lake Analytics and SQL Data Warehouse.
Defining transformations
To help you to process and transform data, Data
Factory provides two different configurations of
compute environments. These are the on-
demand and bring your own (BYO) compute
environments.
Data Factory can automatically create an on-demand HDInsight cluster, based on Windows/Linux, to
process and transform data. This cluster is created in the same storage account as the Data Factory job.
The following JSON linked service snippet configures an on-demand Windows-based HDInsight cluster of
size 1 that is used for running the Data Factory job:
{
"name": "SampleHDInsightOnDemandLinkedService",
"properties": {
"type": "HDInsightOnDemand",
"typeProperties": {
"version": "3.5",
"clusterSize": 1,
"timeToLive": "00:10:00",
"osType": "Windows",
"linkedServiceName": "AzureStorageLinkedService"
}
}
}
It’s important to note that the OSType could be changed to Linux from Windows to create a Linux-based
HDInsight.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-27
The following table explains the key properties used in the above JSON:
Advanced Properties
https://aka.ms/I6xm5c
Azure HDInsight
Azure Batch
There are two ways to authenticate Data Factory to Data Lake Analytics:
Service Principal Authentication. This is the recommended authentication for Data Lake Analytics. You
will need the following key information:
Application ID
Application Key
Tenant ID
User Credential Authentication. To use user credential authentication, you click the Authorize button in
the Azure portal. However, you should be aware that, depending on the scenario, the tokens might expire
and need to be reauthorized.
Token expiration
https://aka.ms/Jp4g3n
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-29
To create a Data Factory pipeline with a Stored Procedure Activity, create a new pipeline and use the
following JSON snippet to create a SQL Server Stored Procedure Activity that calls the stored procedure:
{
"name": "storedProcedureActivitySamplePipeline",
"properties": {
"activities": [
{
"type": "SqlServerStoredProcedure",
"typeProperties": {
"storedProcedureName": "sampleStoredProcedure",
"storedProcedureParameters": {
"DateTime": "$$Text.Format('{0:yyyy-MM-dd HH:mm:ss}',
SliceStart)"
}
},
"outputs": [
{
"name": "storedProcedureSampleOut"
}
],
"scheduler": {
"frequency": "Hour",
"interval": 4
},
"name": "storedProcedureActivitySample"
}
],
"start": "2017-09-02T00:00:00Z",
"end": "2017-09-02T05:00:00Z",
"isPaused": false
}
}
“SqlServerStoredProcedure” is set in the type property because it is a SQL Server stored procedure.
The storedProcedureParameters property block contains only one input parameter that is a
DateTime type.
{
"name":
"AzureMachineLearningLinkedService",
"properties": {
"type": "AzureML",
"typeProperties": {
"mlEndpoint": "https://[batch_scoring_endpoint]/jobs",
"apiKey": "<api_key>"
}
}
}
Batch Execution Activity is used within Data Factory to invoke a Machine Learning web service from a
Data Factory pipeline to make predictions on the data in the batch process. You use the Batch Execution
Activity to invoke both training and scoring Machine Learning web services.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-31
To create a custom .NET activity, you require Visual Studio 2012 or later, in addition to Azure .NET SDK.
1. Create a .NET Class Library project to create a new class (for example, SampleDotNetActivity) that
implements the IDotNetActivity interface. The interface consists of one method—the Execute
method—that requires the following four parameters:
o Logger. This is an object of class IActivityLogger that logs data for the Data Factory pipeline—
for debugging purposes.
2. Implement the Execute method with the custom transformations that need to be performed using
.NET.
3. Make sure you’ve installed the NuGet Packages for Data Factory and Azure Storage because they’re
required for custom .NET activities.
4. To compile the project, click Build from the Visual Studio menu, and then click Build Solution.
5. After the build is successful, zip the files under bin\debug or bin\release and make sure the .pdb file is
included—this helps to debug issues when there’s a failure.
6. Create a blob container and upload the zip file as a blob to that container. Make sure it’s a general-
purpose Blob storage.
7. This custom activity is used within any Data Factory pipeline by using the activity type
DotNetActivity.
MCT USE ONLY. STUDENT USE PROHIBITED
9-32 Automating Data Flow with Azure Data Factory
Create and deploy an ML model ready for use with Data Factory.
Question: How will you use Data Factory with Machine Learning in your organization?
Run
Execute
Method
Custom
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
Lesson 4
Monitoring performance and protecting data
After creating a data pipeline and activities in Data Factory, it’s important to understand how to check the
status of pipelines and activities, and how to pause and resume pipelines, and monitor their performance.
Lesson Objectives
By the end of this lesson, you should be able to:
Monitoring performance
Data Factory provides the tools you use to
monitor performance and to understand the
status of pipelines and activities that either have
been executed or are executing now. The Azure
portal offers these capabilities in an intuitive
graphical user interface (GUI) in a web browser.
2. Go to the data factory that contains the pipelines where you want to check the status.
3. Under Actions, click the Diagram icon. This shows the pictorial view of the pipeline.
4. Right-click the pipeline then click Open pipeline to open the pipeline and reveal the activities it
contains.
5. To check the status of the activity, double-click the dataset that is produced by that activity.
6. The dataset shows the summary of the results along with the data slices that are produced inside the
pipeline.
https://aka.ms/Oisghj
How to examine the activity log of a pipeline and rerun failed pipelines
Data Factory provides good capabilities for users to debug and troubleshoot pipelines using the Azure
portal and Azure PowerShell.
To find out what errors have occurred using the Azure portal, use the following steps:
1. Log in to the Azure portal.
2. Go to the data factory that contains the pipelines where you want to check the status.
3. Under Actions, click the Diagram icon. This shows the pictorial view of the pipeline.
4. Right-click the pipeline then click Open pipeline to open the pipeline and reveal the activities it
contains.
5. To check the status of the activity, double-click the dataset that is produced by that activity.
6. The dataset shows the summary of the results along with the data slices that are produced inside the
pipeline.
7. Click the data slice that has its status set to Failed then click on the activity run that has failed.
8. You can now view the error details and the log files associated with that error. You can also download
the error information.
After errors are analyzed and fixed, the pipeline can be rerun using the Run button on the command bar
of the data slice that failed.
Resource explorer. This is a tree view of all the resources shown in the left pane of the app.
Diagram view. This is a diagram view at the top, in the middle pane of the app.
Activity windows. The activity windows are present at the bottom, in the middle pane of the app.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-35
Multiple tabs. The right pane of the app consists of properties, the activity window explorer and
script tabs.
For more information about the Monitoring and Management app, see:
Monitor and manage Data Factory pipelines by using the Monitoring and Management app
https://aka.ms/Cbr8tq
Copy Activity supports three scenarios for detecting and logging incompatible records when fault
tolerance is enabled:
Incompatible data types between the source data store and the sink data store. If the data types
are different between the source data store and the sink data store, Data Factory rejects the record
and logs it as incompatible—those records are skipped.
Different number of columns between the source data store and the sink data store. If the
number of columns are between the source data store and the sink data store are different, then such
records are rejected and logged as incompatible—those records are skipped.
Primary key violation when writing to relational database. If the copy activity encounters a
primary key violation when writing to the sink data store, this is logged as “incompatible records”—
those records are skipped.
The following JSON enables fault tolerance in the copy activity by setting enableSkipIncompatibleRow
to true. The redirectIncompatibleRowSettings property provides the linked service to write the
incompatible records to Blob storage:
"typeProperties": {
"source": {
"type": "BlobSource"
},
"sink": {
"type": "SqlSink",
},
"enableSkipIncompatibleRow": true,
"redirectIncompatibleRowSettings": {
"linkedServiceName": "BlobStorage",
"path": "errorcontainer/output"
}
}
MCT USE ONLY. STUDENT USE PROHIBITED
9-36 Automating Data Flow with Azure Data Factory
After the copy activity completes, the total number of skipped records appears in the monitoring section.
You find any incompatible records that have been logged at the following location—it contains the
incompatible record and the error information:
https://[your_blob_account].blob.core.windows.net/[path_if_configured]/[copy_activity
_run_id]/[auto_generated_GUID].csv
Handling security
Data Factory provides the robust security that
you see throughout the various Azure services.
Data movement using Data Factory is certified
for the following:
HIPAA/HITECH
ISO/IEC 27001
ISO/IEC 27018
CSA STAR
The following are the key points to remember
regarding data security in Data Factory:
All credentials used to connect to data stores are encrypted using certificates managed by Microsoft.
These certificates are rotated every two years.
All data in transit is encrypted via HTTPS or TLS, provided the cloud data store supports them.
o Both SQL Data Warehouse and SQL Database support Transparent Data Encryption (TDE).
o Data Lake Store also encrypts data stored inside. It automatically encrypts data before storing
and decrypts it when it is retrieved.
Security can be further strengthened by using IPSec VPN or Express Route to transfer data between
an on-premises network and Azure.
Configuring corporate firewalls and Windows firewalls further increases data security.
Many Azure services such as SQL Database, SQL Data Warehouse, and Data Lake Store, require the
whitelisting of IP addresses from which the services are accessed.
Question: How would you use the Azure portal Monitoring and Management app to
monitor, debug and manage Data Factory pipelines and activities within your organization?
HIPAA
CSA STAR
ISO/IEC 27018
WORLD STAR
Verify the correctness of the statement by placing a mark in the column to the right.
Statement Answer
For this final phase of the project, you are going to use Azure Data Factory to automate the management
of data associated with the traffic surveillance system. You will use a Data Factory pipeline to
automatically backup stolen vehicle data from one Data Lake Store to another. You will also use Data
Factory pipelines to perform batch transformations—using Azure Data Analytics to summarize speed
camera data as the data is uploaded from one Data Lake Store to another, and using Azure ML to perform
predictive analytics on speed data as it is uploaded from an Azure Storage blob. In order to ensure the
reliability of your Data Factory pipelines, you are also going to test the monitoring and management
capabilities provided with Azure Data Factory.
Objectives
After completing this lab, you will be able to:
Use Data Factory to back up data from an Azure Data Lake Store to a second ADLS store.
Note: The lab steps for this course change frequently due to updates to Microsoft Azure.
Microsoft Learning updates the lab steps frequently, so they are not available in this manual. Your
instructor will provide you with the lab documentation.
Lab Setup
Estimated Time: 75 minutes
Username: ADATUM\AdatumAdmin
Password: Pa55w.rd
Exercise 1: Use Data Factory to back up data from an Azure Data Lake
Store to a second ADLS store
Scenario
You’re using Azure Data Factory to automate the management of data associated with the traffic
surveillance system, and will use a Data Factory pipeline to automatically back up stolen vehicle data from
one Data Lake Store to another.
In this exercise, you will check the data in the original ADLS account, and then create a second ADLS
account which will host the backup location. After creating a new Data Factory, you will create and
configure a service principal so that Data Lake Store authorization can occur during the execution of a
Data Factory pipeline. You will then assign permissions to the Service Principal to enable data copying,
and use the Data Factory Copy Wizard to back up the data from one ADLS account to another.
Created a service principal to enable Data Lake Store authorization in a Data Factory pipeline.
In this exercise, you will upload a CSV file containing camera speed data to ADLS ready for processing in
ADLA by using a U-SQL script; To authorize ADLA to process this data in a Data Factory pipeline, you will
add the Service Principal from Exercise 1 as a Contributor to the ADLA account. You will then create Data
Factory linked services for Azure Data Lake Analytics, for Data Lake Store (for the input and output
datasets), and for Azure Storage (as the scripts location for the U-SQL script). You will then create Data
Lake Store input and output datasets, create a script for the Data Lake Analytics U-SQL Activity that
extracts summary data for a specific speed camera. Finally, you will create and deploy a new pipeline to
run this activity, and check the results to verify the U-SQL data transformation.
4. Create a Data Lake Store linked service for input and output datasets
5. Create an Azure Storage Blob linked service for the U-SQL script
Task 4: Create a Data Lake Store linked service for input and output datasets
Task 5: Create an Azure Storage Blob linked service for the U-SQL script
Prepared your environment and uploaded test data to your Data Lake Store.
Created a Data Lake Store linked service for input and output datasets.
Created an Azure Storage Blob linked service for the U-SQL script.
Created Data Lake Store input and output datasets.
In this exercise, you will start your Data Warehouse, because this was the original data source for the
deployed ML model you are going to use—you will then obtain the API key and batch execution URL for
this ML model. You will then create an ML linked service, using these parameters, and then create Azure
Storage input and output datasets that link to live test input data, and an output results location
respectively. Finally, you will create and deploy a new pipeline, and check the results in order to verify the
ML data transformation.
2. Obtain the API key and batch execution URL for a deployed ML model
Task 2: Obtain the API key and batch execution URL for a deployed ML model
Obtained the API key and batch execution URL for the deployed ML model.
In this exercise, you will use the Diagram View, in the monitoring and management app, to see the status
of the Traffic DF Copy Pipeline that you created in Exercise 1. You will then use filters and views to find
specific status information on the inputs and outputs, and on the copy activity in the Traffic DF Copy
Pipeline.
3. Lab clean up
Question: Why might you choose to use Service Principal Authentication with a Data Factory
pipeline?
MCT USE ONLY. STUDENT USE PROHIBITED
9-44 Automating Data Flow with Azure Data Factory
How to monitor Data Factory pipelines and how to protect the data flowing through these pipelines.
MCT USE ONLY. STUDENT USE PROHIBITED
Performing Big Data Engineering on Microsoft Cloud Services 9-45
Course Evaluation
Your evaluation of this course will help Microsoft understand the quality of your learning experience.
Please work with your training provider to access the course evaluation form.
Microsoft will keep your answers to this survey private and confidential and will use your responses to
improve your future learning experience. Your open and honest feedback is valuable and appreciated.
MCT USE ONLY. STUDENT USE PROHIBITED