Part 1 - Lecture 2 - Parallel Hardware

CSC4305: Parallel Programming
Lecture 2: Parallel Hardware
Sana Abdullahi Mu’az & Ahmad Abba Datti

Bayero University, Kano
Part 1: Foundations
• Introduction
• Parallel Hardware
• Parallel Software
Roadmap
• Some background
• Modifications to the von Neumann model
• Parallel hardware
Some background
Serial hardware and software
programs
input
Computer runs one

program at a time.
output
The von Neumann Architecture
Main memory
• This is a collection of locations, each of which is capable of
storing both instructions and data.
• Every location consists of an address, which is used to

access the location, and the contents of the location.
Central processing unit (CPU)
Divided into two parts.
• Control unit - responsible for deciding which instruction

in a program should be executed. (oga at the top)
• Arithmetic and logic unit (ALU) - responsible for

executing the actual instructions. (the worker)
Key terms
• Register – very fast storage, part of the CPU.
• Program counter – stores address of the next instruction to
be executed.
• Bus – wires and hardware that connects the CPU and

memory.
memory
fetch/read
CPU
memory
write/store
CPU
von Neumann bottleneck
An operating system “process”
• An instance of a computer program that is being executed.
• Components of a process:
• The executable machine language program.
• A block of memory.
• allocated resources, security information, state of the process.
Multitasking
• Gives the illusion that a single processor system is running

multiple programs simultaneously.
• Each process takes turns running. (time slice)
• After its time is up, it waits until it has a turn again. (blocks)
Threading
• Threads are contained within processes.
• They allow programmers to divide their programs into

(more or less) independent tasks.
• The hope is that when one thread blocks because it is

waiting on a resource, another will have work to do and can
run.
A process and two threads
the “master” thread
terminating a thread
starting a thread Is called joining
Is called forking
Modifications to the von neumann
model
Basics of caching
• A collection of memory locations that can be accessed in

less time than some other memory locations.
• A CPU cache is typically located on the same chip, or one

that can be accessed much faster than ordinary memory.
Levels of Cache
smallest & fastest
L1
L2
L3
largest & slowest

Instruction Level Parallelism (ILP)
• Attempts to improve processor performance by having

multiple processor components or functional units
simultaneously executing instructions.
Instruction Level Parallelism
• Pipelining - functional units are arranged in stages.
• Multiple issue - multiple instructions can be

simultaneously initiated.
Pipelining
Pipelining
Suppose that assembling one car requires three tasks that take 20, 10,
and 15 minutes, respectively.
Then, if all three tasks were performed by a single station, the factory
would output one car every 45 minutes.
By using a pipeline of three stations, the factory would output the first
car in 45 minutes, and then a new one every 20 minutes.
As this example shows, pipelining does not decrease the latency, that
is, the total time for one item to go through the whole system. It does
however increase the system's throughput, that is, the rate at which
new items are processed after the first one.
Multiple Issue
• Multiple issue processors replicate functional units and
try to simultaneously execute different instructions in a
program.
for (i = 0; i < 1000; i++)

z[i] = x[i] + y[i];
z[3] z[4]
z[1] z[2]
adder #1 adder #2
So far....
 Classical Von Neumann Architecture
 Serial
 Von Neumann bottleneck
 OS Level
 Process
 Threads
 Multitasking
 Modified Von Neumann Architecture
 Caches
ILP
 Pipelining
 Multiple Issue
A programmer can write code to exploit.
Parallel hardware
Flynn’s Taxonomy
SISD (SIMD)
Single instruction stream Single instruction stream
Single data stream Multiple data stream
MISD (MIMD)
Multiple instruction stream Multiple instruction stream
Single data stream Multiple data stream
Single Instruction, Single Data (SISD)
• A serial (non-parallel) computer
• Single instruction:
• only one instruction stream is being acted on by
the CPU during any one clock cycle.
Single data:
• only one data stream is being used as input during
any one clock cycle
• Deterministic execution.
• This is the oldest and until recently, the
most prevalent form of computer
• Examples: relatively old PCs.
SIMD
• Parallelism achieved by dividing data among the processors.
• Applies the same instruction to multiple data items.
• Goes with data parallelism.

SIMD example
control unit
n data items
n ALUs
x[1] x[2] … x[n]

ALU1 ALU2 ALUn
for (i = 0; i < n; i++)

x[i] += y[i];
SIMD
• What if we don’t have as many ALUs as data items?
• Divide the work and process iteratively.
• Ex. m = 4 ALUs and n = 15 data items.
Round3 ALU1 ALU2 ALU3 ALU4
1 X[0] X[1] X[2] X[3]
2 X[4] X[5] X[6] X[7]
3 X[8] X[9] X[10] X[11]
4 X[12] X[13] X[14]
SIMD drawbacks
• All ALUs are required to execute the same instruction, or remain
idle.
• In classic design, they must also operate synchronously (at the same
time).
• The ALUs have no instruction storage.
• Efficient for large data parallel problems, but not other types of
more complex parallel problems.
Graphics Processing Units (GPU)
• Real time graphics application programming interfaces or
API’s use points, lines, and triangles to internally represent
the surface of an object.
GPUs
•A graphics processing pipeline converts the internal
representation into an array of pixels that can be sent to a
computer screen.
• Several stages of this pipeline (called shader functions) are

programmable.
• Typically just a few lines of C code.
GPUs
• Shader functions are also implicitly parallel, since they can

be applied to multiple elements in the graphics stream.
• GPU’s can often optimize performance by using SIMD

parallelism.
• The current generation of GPU’s use SIMD parallelism.

• Although they are not pure SIMD systems.
MIMD
• Supports multiple simultaneous instruction streams

operating on multiple data streams.
• Typically consist of a collection of fully independent

processing units or cores, each of which has its own control
unit and its own ALU.
MIMD
MIMD Subcategories: Shared Memory System
•A collection of autonomous processors is connected to a

memory system via an interconnection network.
• Each processor can access each memory location.
• The processors usually communicate implicitly by accessing

shared data structures.
Shared Memory System
• Most widely available shared memory systems use one or

more multicore processors.
• (multiple CPU’s or cores on a single chip)
Shared Memory System
UMA multicore system
Time to access all the memory locations will be the same for all the cores.
NUMA multicore system
A memory location a core is directly connected to can be accessed faster than a
memory location that must be accessed through another chip.
Cache coherence
• Programmers have no control over caches

and when they get updated.
A shared memory system with two cores and two

caches
Cache coherence
y0 privately owned by Core 0
y1 and z1 privately owned by Core 1
x = 2; /* shared variable */
y0 eventually ends up = 2
y1 eventually ends up = 6
z1 = ???
Shared Memory : UMA vs. NUMA
• Uniform Memory Access (UMA):
• Most commonly represented today by Symmetric Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor
updates a location in shared memory, all the other processors know about the update.
Cache coherency is accomplished at the hardware level.
• Non-Uniform Memory Access (NUMA):

• Often made by physically linking two or more SMPs
• One SMP can directly access memory of another SMP
• Not all processors have equal access time to all memories
• Memory access across link is slower
• If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent
NUMA
Shared Memory: Pros and Cons
• Advantages
• Global address space provides a user-friendly programming perspective to
memory
• Data sharing between tasks is both fast and uniform due to the proximity of
memory to CPUs
• Disadvantages:
• Primary disadvantage is the lack of scalability between memory and CPUs.
Adding more CPUs can geometrically increase traffic on the shared
memory-CPU path, and for cache coherent systems, geometrically increase
traffic associated with cache/memory management.
• Programmer responsibility for synchronization constructs that insure
"correct" access of global memory.
• Expense: it becomes increasingly difficult and expensive to design and
produce shared memory machines with ever increasing numbers of
processors.
Distributed Memory System
• Clusters (Tightly Coupled)

• A collection of commodity systems (nodes).
• Connected by a commodity interconnection network.
• The Grid (Loosely Coupled)

• Nodes of a cluster are individual computations units joined
by a communication network.
Distributed Memory System
Distributed Memory: Pro and Con
• Advantages
• Memory is scalable with number of processors. Increase the number of
processors and the size of memory increases proportionately.
• Each processor can rapidly access its own memory without interference and
without the overhead incurred with trying to maintain cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors and
networking.
• Disadvantages
• The programmer is responsible for many of the details associated with data
communication between processors.
• Memory access time is not uniform
Hybrid Distributed-Shared Memory
• The largest and fastest computers in the world today employ both shared and distributed
memory architectures.
• The shared memory component is usually a cache coherent SMP machine. Processors on a
given SMP can address that machine's memory as global.
• The distributed memory component is the networking of multiple SMPs. SMPs know only
about their own memory - not the memory on another SMP. Therefore, network
communications are required to move data from one SMP to another.
• Current trends seem to indicate that this type of memory architecture will continue to prevail
and increase at the high end of computing for the foreseeable future.
• Advantages and Disadvantages: whatever is common to both shared and distributed memory
architectures.
Interconnection networks
• Affects performance of both distributed and shared memory

systems.
• Two categories:
• Shared memory interconnects (Buses and Crossbars)
• Distributed memory interconnects (Ethernet etc.)
Shared memory interconnects
Bus interconnect
• A collection of parallel communication wires together with some
hardware that controls access to the bus.
• Communication wires are shared by the devices that are

connected to it.
• As the number of devices connected to the bus increases,

contention for use of the bus increases, and performance
decreases.
More definitions
• Any time data is transmitted, we’re interested in how long

it will take for the data to reach its destination.
• Latency
• The time that elapses between the source’s beginning to
transmit the data and the destination’s starting to receive the
first byte.
• Bandwidth
• The rate at which the destination receives data after it has
started to receive the first byte.
Message transmission time = l + n / b
latency (seconds)
length of message (bytes)
bandwidth (bytes per second)

Parallel Software
…next

Part 1 - Lecture 2 - Parallel Hardware

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Part 1 - Lecture 2 - Parallel Hardware

Caricato da

Copyright:

Formati disponibili

CSC4305: Parallel Programming

Lecture 2: Parallel Hardware

Sana Abdullahi Mu’az & Ahmad Abba Datti

• Modifications to the von Neumann model

Computer runs one

• Every location consists of an address, which is used to

Divided into two parts.

• Control unit - responsible for deciding which instruction

• Arithmetic and logic unit (ALU) - responsible for

• Bus – wires and hardware that connects the CPU and

• An instance of a computer program that is being executed.

• Gives the illusion that a single processor system is running

• Each process takes turns running. (time slice)

• They allow programmers to divide their programs into

• The hope is that when one thread blocks because it is

• A collection of memory locations that can be accessed in

• A CPU cache is typically located on the same chip, or one

largest & slowest

• Attempts to improve processor performance by having

• Pipelining - functional units are arranged in stages.

• Multiple issue - multiple instructions can be

for (i = 0; i < 1000; i++)

• Goes with data parallelism.

x[1] x[2] … x[n]

for (i = 0; i < n; i++)

• The ALUs have no instruction storage.

• Several stages of this pipeline (called shader functions) are

• Shader functions are also implicitly parallel, since they can

• GPU’s can often optimize performance by using SIMD

• The current generation of GPU’s use SIMD parallelism.

• Supports multiple simultaneous instruction streams

• Typically consist of a collection of fully independent

•A collection of autonomous processors is connected to a

• Each processor can access each memory location.

• The processors usually communicate implicitly by accessing

• Most widely available shared memory systems use one or

• Programmers have no control over caches

A shared memory system with two cores and two

• Non-Uniform Memory Access (NUMA):

• Clusters (Tightly Coupled)

• The Grid (Loosely Coupled)

• Affects performance of both distributed and shared memory

• Communication wires are shared by the devices that are

• As the number of devices connected to the bus increases,

• Any time data is transmitted, we’re interested in how long

length of message (bytes)

bandwidth (bytes per second)

Potrebbero piacerti anche