Sei sulla pagina 1di 86

Advanced LabVIEW Programming

Concepts for Multicore Systems

Agenda

Multithreading Overview
Parallel Programming Techniques
Multicore Programming Challenges
Application Examples
Conclusions

Multithreading Overview

Impact on Engineers and Scientists


Engineering and scientific applications are typically on
dedicated systems (i.e. little multitasking).

Creating Multithreaded Applications


Engineers and scientists must use threads to benefit
from multicore processors.

What are Processes and Threads?


Process
Every program executes in a process
Processes provide resources needed to execute
Each process has at least one thread

Thread
Entity within a process that can be executed
Shares resources of the process
Has individual thread resources

How the OS Handles Thread Scheduling


Different OSs use different thread scheduling techniques
With a multicore CPU, the OS will simultaneously execute
different threads in different processors when possible

Waiting

Ready to Run
Thread

Thread
Thread
Thread
Thread

Thread

Thread

Running
Thread

Thread

Core 1

Core 2

Thread

Example of different thread states

How LabVIEW Implements Multithreading


Implicit Parallelism / Threading
Automatic Multithreading using LabVIEW
Execution System
Parallel code paths on block diagram can execute
in unique threads

Explicit Parallelism / Threading


Timed Structures spawn a single thread

Automatic Multithreading in LabVIEW


LabVIEW automatically divides each application into multiple
execution threads (originally introduced in 1998 with LabVIEW 5.0)
Original goal of multithreading was to more elegantly handle
hardware interrupts and allow for a responsive UI

Automatic Multithreading in LabVIEW


An oversimplification of threading in LabVIEW is
shown below
Main idea is parallel code paths will execute in
unique threads

thread
thread
thread

How the LabVIEW Execution System Works


1

thread
thread
thread
thread

1. LabVIEW compiler analyzes diagram and

assigns code pieces to clumps


# of threads scales based
2. Information about which pieces of code
on # of CPUs
can run together are stored in a run queue
3. If block diagram contains enough parallelism, it will
simultaneously execute in all system threads

Multithreading Features in LabVIEW 8.5


Scale # of execution system threads based on
available cores
Improved thread scheduling for LabVIEW timed
loops (to allow multicore support)
Processor Affinity capability with Timed Structures
Real-Time Features:
Support for Real-Time targets with Symmetric Multiprocessing
Real-Time Execution Trace Toolkit 2.0

Deterministic Real-Time Systems


LabVIEW 8.5 adds Symmetric Multiprocessing
(SMP) for real-time systems.

Assigning Tasks to Specific Cores


In LabVIEW 8.5, users can assign code to specific
processor cores using the LabVIEW Timed Loop.

Explicit Threading with Timed Structures


Code within timed
structures will
execute in
precisely 1 thread
(no more)
Can be assigned a
relative priority
Can be used to set
processor affinity

DEMOS
Multithreading in LabVIEW
Implicit Parallelism (Automatic
Multithreading in LabVIEW)
Explicit Parallelism (Control of threads
with Timed Loops)

Parallel Programming Techniques

Parallel Programming in LabVIEW vs. C


LabVIEW Advantages

LabVIEW Disadvantages

Inherent Parallelism
Automatic multithreading
LabVIEW is cross-platform
(Windows, Mac, Linux), no
need to learn different
threading APIs
Support for parallel libraries
such as Intel Math Kernel
Library (MKL)

Just like C, optimization


techniques must be applied to
take full advantage of
multicore there is no silver
bullet
No support for compiler
optimizations with OpenMP
(but NI is working to add
similar capability)

Application Example: Control System for


Autonomous Vehicle
Chose LabVIEW for
multithreading approach
over C
LabVIEW running on two
quad-core HP servers

The ability of LabVIEW to automatically multithread our application, in


addition to the optimizations we performed in the language itself,
drastically reduced our development time.
- Michael Fleming, President of Torc Technologies

Multicore Programming Goals


1. Increase code execution speed (# of FLOPS)
2. Maintain rate of execution but increase data throughput
3. Evenly balance tasks across available CPUs (fair distribution of
processing load)
4. Dedicate time-critical tasks to a single CPU (Real-time use-case)

What are the Trade-offs?


Parallel programming overhead is greater than sequential
Optimizing for speed can come at a cost of more memory
utilization

Example: Unbalanced Parallel Tasks


CPU Core

Task A

Task B

CPU Core

Task E

Task F

CPU Core

CPU Core

Task C

Task D

time

Goal: Balanced Parallel Tasks


CPU Core

CPU Core

CPU Core

CPU Core

Task A

Task C

Task E

Task B

Task F

Task D

time

Application Example: Real-Time


Control
Wind Tunnel Safety-of-Flight
system at NASA Ames
Research Center
Benchmarks Results
Ran on PXI-8106 RT Controller
Time-Critical loop was reduced from
43% CPU load to 30% CPU load on one
CPU of the PXI-8106RT
A core leftover for processing of noncritical tasks.

Image Source: http://windtunnels.arc.nasa.gov

Application Decomposition
First step is to break down program into
core components

Application

Tasks

Data

Data Flow

Choosing the Right Parallel Strategy


Task Paradigm
Best suited for programs with independent functionalities
that have minimal data dependencies

Data Paradigm
Best suited for data operations that are completely
independent

Data Flow Paradigm


Best suited for data that have dependencies and require
prior computation

Parallel Strategies
Task
Paradigm

Application

Data
Paradigm

Data Flow
Paradigm

Task Parallelization
Divide & Conquer

Geometric
Recursive
Pipeline
Wave Front

Task Parallelism
Tasks
Code is comprised of logically independent blocks
of functionality

Divide & Conquer


A section of code that can be decomposed into
parallelized subsections, once completed results
are merged together

Task Parallelism
Not all code requires sequential execution
Isolate independent chunks of code and
mark them as tasks

Task Parallelism
Not all code requires sequential execution
Isolate independent chunks of code and
mark them as tasks
Task A
Task B
Task C

Divide & Conquer


Useful for problems that can be broken into
subsections
Recursive algorithms such as quick sort and
merge sort
Break the problem into manageable segments
according to your resources

Divide & Conquer


problem

split
subproblem

subproblem

split
subproblem

split
subproblem

solve

subproblem

solve

subsolution

subsolution

subproblem

solve

solve

subsolution

subsolution

merge

merge

subsolution

subsolution

merge
solution

Data Parallelism
Geometric Decomposition
No dependencies in data
Could be completely parallelized if enough
resources were available

Recursive Structure
Similar to Divide & Conquer strategy. Data is
inherently recursive and can be split up into
parallelized subsets

Data Parallelism
You can speed up processor-intensive operations
on large data sets by segmenting the data.
Data Set
CPU Core
CPU Core

Signal Processing
CPU Core
CPU Core

Result

Data Parallelism
You can speed up processor-intensive operations
on large data sets by segmenting the data.
Data Set
CPU Core

Signal Processing

CPU Core

Signal Processing

CPU Core

Signal Processing

CPU Core

Signal Processing

Combine
Results

Application Example: High-Speed Control


Max Planck Institute (Munich, Germany)
Plasma control in nuclear fusion tokamak with
LabVIEW on 8-core system using data parallelism
technique
with LabVIEW, we obtained a 5X processing speed-up on an
octal-core processor machine over a single-core processor
Dr. Louis Giannone
Lead Project Researcher
Max Planck Institute

Data Flow Parallelism


Pipelining
Data must go through a series of computations
(think of an automobile assembly line)

Wave Front
Data has dependencies but can be computed if
prior elements are computed

Pipelining
Many applications involve sequential,
multistep algorithms
Applying pipelining can increase performance
2

Acquire
1

Filter Analyze Log


3

Acquire
1

Filter Analyze Log


3

time
t0

t3

t4

t7

Pipelining Strategy
CPU Core

Acquire

Filter

CPU Core

Analyze

CPU Core

Log

CPU Core

t0

t1

t2

t3

time

Pipelining Strategy
CPU Core

Acquire

Acquire

Filter

CPU Core

Filter

Analyze

CPU Core

Analyze

Log

CPU Core

t0

t1

t2

t3

Log

time

Pipelining Strategy
CPU Core

Acquire

CPU Core

Acquire

Acquire

Acquire

Filter

Filter

Filter

Filter

Analyze

Analyze

Analyze

Analyze

Log

Log

Log

CPU Core

CPU Core

t0

t1

t2

t3

Log

time

Pipelining in LabVIEW
Sequential

Pipelined

?
Note: Queues may also be
used to pipeline data between
different loops

or

Key Considerations for Pipelining


Consider # of Processor Cores when determining # of
pipeline stages
Be sure to balance stages since longest stage will be
limiting factor in performance speed-up
Example:
Task A
Unbalanced
Pipeline
Task C
Task B

Application Example: Communications Test


AmFax Ltd. (United Kingdom)
Created wireless test systems for next-generation
phones using pipelined LabVIEW architecture based
on pipelining technique
With LabVIEW and the dualcore embedded controller we are
achieving up to 5x speed savings.
Mark Jewell
BDM Wireless
AmFax Ltd.

Tips for Balancing Pipeline Stages


Use LabVIEW Benchmarking techniques
Perform basic benchmarking with
timestamps and VI Profiler

Wave Front
Dependencies exist
in elements of the
data structure
For instance, the value
at (i,j) requires
computed value at
(i-1,j-1)

Wave Front
As long as the dependencies are satisfied
multiple operations can be carried out on
data

Wave Front
Wave Front effect appears as parallel
executions iterate over data
1

Wave Front
Practical Applications
Error Diffusion for Black & White Printers
Image Processing & Filtering

Predicting Speed Increase Amdahls Law


Theory used to calculate the expected speedup of a
parallelized implementation (Relative to serial).

Pk = % of instructions affected
Sk = Speed increase factor
K = Section of code label
N = Total number of sections
Stotal = Total speed change factor

DEMOS
Parallel Programming Techniques
Divide and Conquer/Recursion
Data Parallelism
Pipelining
Amdahls Law

Multicore Programming
Challenges

Multicore Programming Challenges


Programming Caveats
Thread Synchronization
Race Conditions
Deadlocks

Potential Code Bottlenecks


Debugging
More things happening at one time

Memory

Data transfer between processor cores and cache considerations

Thread Synchronization
With OS scheduling, there is no guarantee when
threads will execute without using synchronization
primitives
Order of events may change at each execution due
to the way the threads are scheduled
First execution

Thread
1

Thread
2

Thread
3

Second execution

Thread
2

Thread
3

Thread
1

Third execution

Thread
3

Thread
1

Thread
2

Race Conditions
This issue occurs when threads manipulate shared resources
simultaneously
Common problem when code is migrated from single CPU
system to multicore (software was not originally created for
multicore)
No synchronization utilized between threads, results in
anomalous behavior
Ex: Two threads simultaneously writing to one memory
location

Thread 1 Data

Thread 2 Data

Deadlocks and Thread Starvation


Deadlock occurs when two or more threads wait for
resources occupied by another thread in the same
group
Thread deadlock is
analogous to traffic
deadlock, nothing can
proceed

Thread Starvation occurs when a thread is


perpetually denied the resources it needs to execute

Addressing Multicore Challenges with


LabVIEW
Dataflow paradigm is a key
benefit for multicore
programming
Helps synchronize threads
and prevent race
conditions/deadlocks
thread

thread

Parallel code paths are


synchronized and order of
execution is determined by
LabVIEW wires

Synchronization in LabVIEW
When more synchronization
is required, use
synchronization mechanisms:

Notifiers
Queues
Semaphores
Rendezvous
Occurrences

Potential Bottlenecks: Shared Resources


1. Data Dependencies
Example: Data stored in global and shared
variables that need to be accessed by different VIs
would be a shared resource

2. Hard Disk
Example: Computers can only read or write to
the hard disk one item at a time (File I/O cannot
be made into parallel operations)

3. Blocking Items
Example: Non-reentrant VIs

4. Entire software stack must support multicore

Non-reentrant VIs

VIs that are non-reentrant cannot be called simultaneously,


one call runs and the other call waits for the first to finish
before running

To make a VI reentrant, select FileVI Properties, select the Execution


category and choose Reentrant execution.

Multithreaded Software Stack Support


Software Stack
Development tool Support provided on the
operating system of choice; tool
facilitates correct threading and
optimization

Example: Multithreaded
nature of LabVIEW and
structures that provide
optimization

Libraries

Thread-safe, re-entrant libraries

Example: BLAS libraries

Device drivers

Drivers designed for optimal


multithreaded performance

Example: NI-DAQmx driver


software

Operating system Operating system supports


multithreading and multitasking
and can load balance tasks

Example: Support for


Windows, Mac OS, Linux OS,
and real-time operating
systems

Application Example: Eaton Corporation


Eaton created a portable in-vehicle test system for
truck transmissions using LabVIEW.
Acquired and analyzed 16 channels on single core using
DAQmx multithreaded driver
Now acquire and analyze 80+ channels on multicore
There was no need to rewrite our
application for the new multicore
processing platforms.
Scott Sirrine
Lead Design Engineer
Eaton Truck Division

Debugging Methods
Functional Debugging
Trace Debugging
Performance Counters

Functional Debugging
LabVIEW supports debugging parallel code for
functional correctness
Use basic LabVIEW debugging tools (highlight
execution, probes, etc.) to ensure code is
functionally correct

Trace Debugging
On real-time systems, trace debugging can show thread
activity at the OS level
Thread activity on each core is displayed by selecting a
particular CPU

Performance Counters
Performance counters provide
detailed system information such as
CPU usage, memory usage, and cache
hits/misses
LabVIEW does not natively support
performance counters but can call
Windows counters programmatically
Example utilities for performance
counting include:
Windows Perfmon
Intels VTune

DEMOS
Debugging Methods
Functional Debugging
Trace Debugging
Performance Counters

Memory Considerations
Data transfer between cores
Cache considerations

Data Transfer Between Cores


Physical distance between processor and the
quality of the processor connections can have
large effect on execution speed

Cache Considerations
Multicore processors
typically utilize a shared
cache
Common cache problem is
false sharing where two
cores write to the same
cache line and cause
performance degradation
For cache optimization, use
processor affinity

Cache Optimization with Processor Affinity


Setting processor affinity tells the OS which processor to
execute the code on
Processor affinity can prevent OS from scheduling threads in a
configuration that hurts cache usage

Application Examples

Example #1 Multi-channel Acquisition


and Signal Processing
Overview: Two channels are read from a digitizer and an
FFT operation is performed on the data.

2 Channels
from a
Digitizer

Recommendation
1. Read data channels separately
2. Perform FFT operations in parallel

Example #2 Operations on Large Data Sets


Overview: Multiplication of two matrices puts heavy load on
processor, especially when the matrices are large
Multiply
Matrices
Matrix 1
Matrix 2

Result

Recommendation
Split data into subsets and then perform the operation.

Example #3: Multi-loop Producer /


Consumer Architecture with Queues
Multiloop architectures use queues to pass data between
parallel loops
Queues allow each
loop to run at an
optimal rate
In this example, the
acquisition is not
slowed by the write to
disk task

Recommendation
Balance acquisition rate and processing rate for
maximum throughput
Acquire from Scope

Dequeue
Element
Data Decomposition

Digital Output

7th Order
Low-pass filter
Enqueue
Element

DEMO
Multiloop Producer / Consumer

Conclusions
Multithreading - LabVIEW offers implicit parallelism
(automatic multithreading) and explicit parallelism
(timed structures)
Parallel Programming Techniques - There is no silver
bullet, the advantage of LabVIEW is that parallelism is
much easier expressed by the language

Conclusions (continued)
Multicore Programming Challenges - Debugging and
memory considerations have evolved with multicore
and play an important role
Application Examples - With minor modifications,
typical LabVIEW applications can be optimized for
multicore

Resources
www.ni.com/multicore

How to Develop
your LabVIEW skills

Fast-Track to Skill Development


New User

Experienced User

Courses

LabVIEW Advanced I

Core Courses
Begin
Here

LabVIEW
Basics I

Advanced User

LabVIEW
Basics II

Certifications
Certified LabVIEW
Associate Developer Exam

LabVIEW
Intermediate I

LabVIEW
Intermediate II

Certified LabVIEW
Developer Exam

If you are unsure


-Quick LabVIEW quiz
-Fundamentals exam

ni.com/training

Certified LabVIEW
Architect Exam

Certification

Training Membership the


Flexible Option
Ideal for developing your skills with NI
products
12 months to attend any NI regional or online
course and take certification exams
$4,999 for a 12 month membership in the USA

Next Steps
Visit ni.com/training
Identify your current expertise level and
desired level
Register for appropriate courses
$200 discount for attending LV Dev Day event!

Potrebbero piacerti anche