COMPUTER ORGANIZATION AND PARALLEL PROCESSING STRUCTURES

COMPUTER ORGANIZATION
-by
RAMA KRISHNA THELAGATHOTI
(M.Tech CSE from IIT Madras)
Distribution of Marks
Total marks: 100
Chapter – Marks
1. CO1 – 11
2. CO2 – 19
3. CO3 – 28
4. CO4 – 14
5. CO5 – 28 -
CHAPTER 1
Organization and Architecture -1
• Is there any distinction between Computer Organization and
Architecture?
• Computer Architecture
• is those attributes of a system visible to the programmer
• those attributes that have a direct impact on the logical execution of a
program
• e.g -Instruction set, number of bits used for data representation, I/O
mechanisms, addressing techniques.
• Computer Organization
• attributes include those hardware details transparent to the programmer
such as Control signals, interfaces, memory technology.
• Operational units and their interconnections that realize the architectural
specifications
Organization and Architecture - 2
• For example
• It is an architectural design issue
• whether a computer will have a multiply instruction.
• It is an organizational issue
• whether that instruction will be implemented by a special
multiply unit or by a mechanism that makes repeated use
of the add unit of the system.
Organization and Architecture - 3
• There are family of computer models, all with the same
architecture but with differences in organization.
• Consequently, the different models in the family have
different price and performance characteristics.
• e.g scenario IBM System/370 architecture
• introduced in 1970 with a number of models.
• The customer with modest requirements could buy a cheaper, slower model
and, later upgrade to a more expensive, faster model without having to
abandon software that had already been developed.
• Over the years, IBM has introduced many new models with improved
technology to replace older models, offering the customer greater speed,
lower cost, or both.
Processor Organization -1
Processor Requirements -
• Fetch instruction
• The processor reads an instruction from memory (register, cache, main memory)
• Interpret instruction
• The instruction is decoded to determine what action is required
• Fetch data
• The execution of an instruction may require reading data from memory or an I/O
module
• Process data
• The execution of an instruction may require performing some arithmetic or logical
operation on data
• Write data
• The results of an execution may require writing data to memory or an I/O module
Processor Organization - 2
CPU with system bus -
• This figure shows processor units
and its connection to the rest of
the system via the system bus.
• Major components of the
processor are
• ALU – Arithmetic Logic Unit
• It does the actual computation or
processing of data.
• CU – Control Unit
• controls the movement of data and
instructions into and out of the
processor and controls the operation of
the ALU.
• Registers - minimal internal memory,
consisting of a set of storage
locations, called registers
Processor Organization - 3
CPU Internal structure -
• This id slightly more detailed view
of the processor.
• Internal CPU bus - transfers data
between the various registers and the
ALU because the ALU in fact operates
only on data in the internal processor
memory.
• The figure also shows typical basic
elements of the ALU
Instruction Cycle
Typical instruction cycle includes the following stages:
• Fetch: Read the next instruction from memory into
the processor.
• Execute: Interpret the opcode and perform the
indicated operation.
• Interrupt: If interrupts are enabled and an interrupt
has occurred, save the current process state and
service the interrupt.
Introduction to parallel processing(trends
towards parallel processing)
• Single instruction, single data stream - SISD
• Single instruction, multiple data stream - SIMD
• Multiple instruction, single data stream - MISD
• Multiple instruction, multiple data stream- MIMD
parallel computer structures -1
What is Parallel Processing ?
• Execution of several activities at the same time.
• 2 multiplications at the same time on 2 different
processes,
• Printing a file on two printers at the same time.
• Parallel Computers can be divided into 3 architectural
configurations
• PipeLine computers
• Array Processors
• Multiprocessor Systems
• A pipeline computer performs overlapped computations to
achieve temporal parallelism.
• An Array Processors uses multiple synchronized Arithmetic
Logic Units to achieve Spatial Parallelism.
• A multi processor system achieves asynchronous parallelism
through a set of interactive processors with shared
resources (memories, data bases etc).
• Fundamental difference between array processor and multi
processor – processing elements in array processors operate
synchronously but in multi processors its asynchronous.
Pipeline computers
• Instruction execution involves 4 major steps
• Instruction fetch from main memory [IF]
• Instruction decoding [ID]
• Operand Fetch [OF]
• Execution [EX] – optional and depends on instruction type
Pipeline computers
Sequential Execution Parallel Execution

Pipeline computers
• Instruction Cycle involves multiple pipeline cycles
• Flow of data (input operands, intermediate results & output results) is
triggered and synchronized under a common clock control.
• In non pipelined computers , it takes 4 cycles to complete the
instruction.
• The instruction cycle reduced to ¼th in overlapped execution.
Theoretically , K-stage linear pipeline processor is typically k times
faster, However, due to memory conflicts, data dependencies, branch
and interrupt ideal speed may not achieved.
Array computers
• an Array computer is synchronous parallel computer with
multiple athematic logic units called processing Elements
(PE)
• An array of n processing elements
• all PEs will perform computation in parallel. All PEs are
synchronised under one control unit.
• array processors are also called SIMD array computers [Single
Instruction Multiple Data]
Array computers
• Control Unit (CU) - All PEs
are under the control of one
control unit. CU controls the
inter communication
between the PEs. There is a
local memory of CU also
called CY memory
• Instruction fetch and
decoding is done by the CU
only
Array computers
• Processing elements (PEs) - Each processing element consists
of ALU, its registers and a local memory for storage of
distributed data. These PEs have been interconnected via an
interconnection network.
• all PEs perform the same function synchronously in a lock-step
fashion under the control of the CU
• Inter PE connection network - IN performs data exchange
among the PEs, data routing and manipulation functions. This
IN is under the control of CU.
Multi processor computers
• Aimed at improving throughput, reliability , flexibility and
availability
• All processors share access to common sets of memory
modules, memory modules , I/O channels and peripheral
devices.
Parallel Architectural Classification Schemes -1
Basic types of architectural classification
Flynn classification: (1966) is based on multiplicity of
instruction streams and the data streams in computer
systems.
Feng’s classification: (1972) is based on serial versus
parallel processing.
Handler’s classification: (1977) is determined by the
degree of parallelism and pipelining in various subsystem
levels.
Multiplicity of Instruction-Data streams
• Flynn classified based on the notion of ‘stream’ –
sequence of instructions on a set of data.
• Two types of information flow into a processor:
instructions and data
• The instruction stream is defined as the sequence of
instructions executed by the processing unit.
• The data stream is sequence of data defined as the
data traffic exchanged between the memory and the
processing unit.
• Flynn’S FOUR machine organizations
• Single instruction, single data stream - SISD
• Single instruction, multiple data stream - SIMD
• Multiple instruction, single data stream - MISD
• Multiple instruction, multiple data stream- MIMD
• SISD –
• SIMD –
• MISD –
• MIMD -
Serial vs. Parallel Processing
• Feng classified computer architectures based on the
degree of parallelism
• The maximum number of binary digits that can be
processed within a unit time by a computer system is
called the maximum parallelism degree P.
• A bit slice is a string of bits one from each of the words
at the same vertical position.
• under above classification there are 4 types of
processing methods
• Word Serial and Bit Serial (WSBS)
• Word Parallel and Bit Serial (WPBS)
• Word Serial and Bit Parallel(WSBP)
• Word Parallel and Bit Parallel (WPBP)
• WSBS has been called bit parallel processing because
one bit is processed at a time.
• WPBS has been called bit slice processing because m-
bit slice is processes at a time.
• WSBP is found in most existing computers and has been
called as Word Slice processing because one word of n
bit processed at a time.
• WPBP is known as fully parallel processing in which an
array on n x m bits is processes at one time
Example 
Mode - Computer Model - Degree of Parallelism
Parallelism vs pipelining
• Wolfgang Handler has proposed a classification scheme
for identifying the parallelism degree and pipelining
degree built into the hardware structure of a computer
system.
• He considers at three subsystem levels:
• Processor Control Unit (PCU)
• Arithmetic Logic Unit (ALU)
• Bit Level Circuit (BLC)
Parallelism vs pipelining
• Each PCU corresponds to one processor or one CPU.
The ALU is equivalent to Processor Element (PE). The
BLC corresponds to combinational logic circuitry
needed to perform 1 bit operations in the ALU.
• Example 
END of CHAPTER 1
CHAPTER 2
Instruction level parallelism
• ILP is exploiting parallelism among instructions.
• Parallelism in a basic block—a straight-line code sequence is
quite small. Hence to obtain substantial performance
enhancements, we must exploit ILP across multiple basic
blocks.
• loop-level parallelism
• common way to increase the ILP among iterations of a loop
• Example 
for (i=0; i<=999; i=i+1)
x[i] = x[i] + y[i];
• Unroll loop statically or dynamically
Data dependences and Hazards -1
Data Dependences
• Challenges in loop-level parallelism
• Dependent instructions cannot be executed simultaneously
• 3 types of dependencies
• Data dependencies (true data dependencies)
• Name dependencies
• Control dependencies
Data dependences and Hazards - 2
Data Dependences
• Instruction j is data dependent on instruction i if
• Instruction i produces a result that may be used by instruction j
• Instruction j is data dependent on instruction k and instruction k is
data dependent on instruction I – This is called chain of dependencies
• single instruction (such as ADDD R1,R1,R1) is not considered a
dependence
• For example 
Data Dependences
Data Dependences
• If two instructions are data dependent, they must execute in
order and cannot execute simultaneously or be completely
overlapped. The dependence implies that there would be a
chain of one or more data hazards between the two
instructions.
• Dependences are a property of programs.
• pipeline organization determines Whether a given
dependence results in an actual hazard and whether that
hazard actually causes a stall
Data Dependences
• Data dependence conveys:
• Possibility of a hazard
• Order in which results must be calculated
• Upper bound on exploitable instruction level parallelism
• A dependence can be overcome in two different ways:
• (1) maintaining the dependence but avoiding a hazard,
• (2) eliminating a dependence by transforming the code
• Dependences that flow through memory locations are more difficult to
detect, since two addresses may refer to the same location but look
different: For example, 100(R4) and 20(R6) may be identical memory
addresses.
Name Dependences
• Two instructions use the same name but no flow of data
between these instructions.
• two types of name dependences between an instruction i that
precedes instruction j in program order
• Anti dependence: instruction j writes a register or memory location that
instruction i reads
• Initial ordering (i before j) must be preserved to ensure that i reads the correct value
• Ex: S.D and DADDIU on register R1
• Output dependence: instruction i and instruction j write the same register or
memory location that the value finally written corresponds to instruction j
• Ordering must be preserved
Name Dependences
• name dependence is not a true dependence, instructions
involved in a name dependence can execute simultaneously or
be reordered. To resolve, use renaming techniques
Data Hazards
• A hazard exists whenever there is a name or data dependence
between instructions
• 3 types of data hazards
• 3 types of data hazards
Control Dependences
• ordering of an instruction, i, with respect to a branch
instruction so that instruction i is executed in correct program
order and only when it should be.
• Example 
• S1 is control dependent on p1, and S2 is control
dependent on p2 but not on p1.
Control Dependences
• two constraints are imposed by control dependences
1. An instruction that is control dependent on a branch cannot
be moved before the branch so that its execution is no longer
controlled by the branch.
2. An instruction that is not control dependent on a branch
cannot be moved after the branch so that its execution is
controlled by the branch.
• Example 1: • OR instruction dependent on
DADDU R1,R2,R3
BEQZ R4,L
DADDU and DSUBU
DSUBU R1,R1,R6
L: …
OR R7,R1,R8
• Example 2:
DADDU R1,R2,R3 • Assume R4 isn’t used after skip
BEQZ R12,skip • Possible to move DSUBU before
DSUBU R4,R5,R6 the branch
DADDU R5,R4,R9
skip:
OR R7,R8,R9
Pipelining, Principles of Linear pipelining -1
Pipelining: an overlapped parallelism
• Pipelining is similar to the concept of assembly lines in an
industrial plant
Pipelining: an overlapped parallelism
• To achieve pipelining
• Divide the process into sequence of sub tasks which can be processed
concurrently.
• Successive tasks are streamed to pipe and get executed in an overlapped
fashion.
Principles of Linear Pipelining
• Consider Assembly lines in industrial plants as an example
• If more number of Assembly lines then productivity will be increased
• Should have equal assembly speeds , otherwise, slowest station becomes
bottleneck or congestion due to improper buffering may result in many stations
idle waiting for new task.
• The precedence relation of a set of subtasks {T1, T2,…, Tk} for
a given task T implies that the same task Tj cannot start until
some earlier task Ti finishes.
• •The inter dependencies of all subtasks form the precedence
graph.
Principles of Linear Pipelining
• With a linear precedence relation, task Tj cannot start until
earlier subtasks { Ti} for all (i < j) finish.
• A linear pipeline can process subtasks with a linear precedence
graph. Basic Linear pipeline
L: latches, interface between different stages of pipeline

S1, S2, etc. : pipeline stages
• Basic Linear pipeline
• It consists of cascade of processing stages.
• Stages: Pure combinational circuits performing arithmetic or
logic operations over the data flowing through the pipe.
• Stages are separated by high speed interface latches.
• Latches : Fast Registers holding intermediate results between
stages
• Information Flow are under the control of common clock
applied to all latches
• Above shows space-time diagram of 4-stage pipeline
processor.
• A linear pipeline with K-stages can process n tasks in
clock periods.
• K cycles are used to fill up the pipeline or to complete the execution of first task.
• n-1 cycles to complete remaining n-1 tasks.
• In a non pipelined processors same process takes n x k.
• Maximum speed is never fully achieved because of data
dependencies, interrupts and other factors.
• Understanding operational principles of pipeline
computation
• Design of a pipeline floating point adder
• Constructed with 4 functional stages
• Input is two normalized floating point numbers
• a,b are fractions and p,q are their exponents
• Purpose is to compute the sum
• Understand how to explain this picture
Classification of pipeline processors - 1
• Handler(1977) proposed following classification scheme based
on levels of processing.
• Arithmetic pipelining – The arithmetic Logic Units of a computer can be
segmentized for pipeline operations in various data formats
• Ex  4-stage pipes used in star-100 computer
• Instruction pipelining – the execution of stream of instructions can be
pipelined by overlapping the execution of current execution with fetch,
decode and operand fetch of subsequent instructions. Also known as
instruction look ahead
• Ex  all modern computers support this type
• Processor pipelining – pipeline processing of the same data stream by a
cascade of processors. Each of which processes a specific task. Its not yet
well accepted.
• Rama moorthy and Li (1977 )proposed following classification
scheme based on pipeline configuration and control strategies.
• Uni-function v/s Multi-function Pipelines
• A pipeline unit with fixed and dedicated function is called uni-functional.
• Example: CRAY1 (Supercomputer - 1976)
• It has 12 uni-functional pipelines units for various scalar, vector, fixed-point and
floating-point operations.
• A multifunction pipe may perform different functions either at different times or
same time, by interconnecting different subset of stages in pipeline.
• Example 4X-TI-ASC (Supercomputer - 1973)
• Static v/s Dynamic Pipelines

• Static may assume only one functional configuration at a time
• It can be either unifunctional or multifunctional
• Static pipelines are preferred when instructions of same type are to be executed
continuously
• Dynamic permits several functional configurations to exist simultaneously
• A dynamic pipeline must be multi-functional
• The dynamic configuration requires more elaborate control and sequencing
mechanisms than static pipelining
• Scalar v/s Vector Pipelines
• Scalar processes a sequence of scalar operands under the control of a DO loop
• Instructions in a small DO loop are often prefetched into the instruction buffer.
• The required scalar operands are moved into a data cache to continuously supply
the pipeline with operands
• Example: IBM System/360 Model 91
• Vector are specially designed to handle vector instructions over vector operands.
• Computers having vector instructions are called vector processors.
• The handling of vector operands in vector pipelines is under firmware and hardware
control.
• Example : Cray 1
General Pipelines and Reservation tables
• Pipeline with
• feedback connections – output of the linear pipeline
are feedback as future inputs.
• Feedforward connections – output will feedforward
to the successive stages.
• Pipeline with feedback may have non linear flow
of data
CHAPTER 3
STRUCTURES AND ALGORITHMS
FOR ARRAY PROCESSORS
SIMD Computer Organizations - 1
• A synchronous array of parallel processors is called an array
processor.
• Consists of multiple
• CU
• PE
• Array processors also called SIMD computers.
• There can be two slightly different configurations as shown in the
diagram
• This configuration is structured
with N synchronized PEs, all
under the control of one CU.
• each PE is ALU with attached
registers and local memory PEM
for data storage.
• CU also has its own memory –
CU memory
• CU is the main control point for
all programs
• Working principle
• User programs are loaded into CU memory
• CU will decode instructions and determine whether the instruction is of type
scalar/control-type/vector
• Scalar/control-type are directly executed inside CU itself
• Vector instructions are broadcasted to PEs for parallel execution.
• All PEs perform same function synchronously under the control of CU
• Vector operands are distributed to PEMs before parallel execution of array of PEs.
• Masking schemes are used to control the status of each PE during the execution of
vector instructions in PEs.
• PEs can be in active or disabled state, which means which ever PEs are required
they only will enabled by using masking.
• Data exchange happens through inter PE communication network.
• CU directly supervises the execution of programs.
• This configuration differs from
configuration 1 in two aspects
• Local memories attached to PEs are
replaced by parallel memory modules
shared by all PEs through Alignment
network.
• Inter PE connection network is
replaced by Inter PE alignment
network.
• Alignment network allows conflict
free resource sharing by all PEs.
• Formally SIMD computer is characterized by following
parameters
Masking and Data routing mechanisms - 1
• Here each PE is a processor with its
own memory PEM
• a set of working registers and flags ,
namely A, B, C and S
• an ALU
• a local index register I
• an address register D and a data
routing register R.
• The R of each PE is connected to the R
of other PE via the interconnection
n/w.
• When data transfer among PEs occur
contents of R registers are being
transferred.
• The D is used to hold the m-bit address of the PE.
• PE is either in active or inactive mode during instruction cycle.
• If a PE is active, it executes the instruction broadcast to it by the
CU otherwise it will not.
• Masking schemes are used to specify the status flag S of PE. S=1
indicate active PE and 0 for inactive PE.
• In CU, global index register I and a masking register M. if has N
bits then ith bit of M is denoted as Mi.
• The collection of Si flags for i=0,1,2,---N-1 forms a status register
S for all the PEs.
• Write down one example from text book (example 5.1 or
example 5.2)
Inter PE communications -1
• Network design is the fundamental decision in determining
appropriate architecture of an interconnection network for an
SIMD machine.
• Decisions are to be made between
• Operation mode
• Control strategies
• Switching methodologies
• Network topologies
• Operation Mode
• 2 types of communication: synchronous and asynchronous.
• Synchronous communication needed for establishing communication paths
synchronously for either data manipulating or for data instruction broadcast.
• Asynchronous communication is needed for multiprocessing in which connection
requests are issued dynamically.
• Control Strategy
• A typical interconnection n/w consists of a no. of switching elements and
interconnecting links.
• Interconnection functions are realized by properly setting control of the switching
elements.
• Switching Methodologies
• circuit switching and packet switching.
• circuit switching: physical path is established between source and destination.
• packet switching: data is put in a packet and routed through the interconnection
n/w without establishing a physical connection path.
• Network Topologies
• Network is depicted by a graph nodes represent switching points and edges
represent communication links.
• Topologies grouped into 2 categories: static and dynamic.
• Static topology: links between 2 processors are passive and dedicated buses
cannot be reconfigured for direct connections to other processors.
• Links in the dynamic category can be reconfigured by setting the n/ws active
switching elements.
SIMD Interconnection Networks
Static vs. Dynamic networks
• Topological structure of SIMD array processor is characterized
by the data-routing n/w used in interconnecting PEs.
• Static networks
• Topologies in the static n/w s can be
classified according to the dimensions (1D,
2D, 3D and hypercube) required for layout.
1D : linear array
2D: ring, star, tree, mesh, and systolic array
3D: Completely connected
Static vs. Dynamic networks
• Dynamic networks
• 2 classes  Single stage v/s multistage
Single Stage & Multi Stage networks
• Single stage n/w: A single stage n/w is a
switching n/w with N i/p selectors (IS) and N o/p
selectors (OS).
• IS is a 1 to D demultiplexer and OS is an M to1
multiplexer where 1<=D<=N and 1<=M<=N.
• Crossbar switching n/w is a single stage n/w with
D=M=N.
• To establish a connecting path, different path Conceptual view of single stage
inter connection network
control signals will be applied to all IS and OS.
Single Stage & Multi Stage networks
• Single stage networks also called recirculating networks.
• Data items may have to recirculate through the single stage several times
before reaching their final destinations.
• Number of recirculation's needed depends on the connectivity in the
single stage n/w.
• Multi stage n/w:
• Many stages of interconnected switches form a multistage SIMD network.
• Multistage n/ws are described by 3 characterizing features
• switch box
• network topology
• control structure.
Multi Stage networks
• Many switch boxes are used in a
multistage n/w.
• Each box is an interchange device with 2
i/ps and 2 o/ps.
• 4 states of switch box
• Straight
• Exchange
• upper broadcast
• lower broadcast.
• A 2 function switch box can assume either
the straight or the exchange states.
• A 4 function switch box can be in any one
of the 4 states.
• A multistage n/w is capable of connecting an arbitrary i/p terminal to an
arbitrary o/p terminal.
• Multistage n/w can be 1 sided or 2 sided. 1 sided n/w, called full switches
have i/p-o/p ports on the same side.
• 2 sided multistage n/w usually have an i/p side and an o/p side, can be
divided into 3 classes:
• Blocking
• Rearrangeable
• nonblocking.
• Blocking:
• Simultaneous connections of more than
one terminal pair may result in conflicts in
the use of n/w communication links.
• Eg : data manipulator, omega, flip, n cube
and baseline.
• Rearrangeable
• if it can perform all possible connections
between i/p s and o/p s by rearranging its
existing connections so that a connection
path for a new i/p-o/p pair can always be
established.
• eg Benes n/w
• NonBlocking:
• N/w which can handle all possible
connections without blocking is called a
nonblocking n/w.
• 2 cases
• First  Clos n/w, a one to one connection is
made between an i/p and o/p.
• Second  one to many connections.
• Multistage n/w consists of n stages where N=2n is the no. of i/p and o/p
lines. Therefore each stage, may use N/2 switch boxes.
• The interconnection patterns from stage to stage determine the n/w
topology.
• Each stage is connected to the next stage by at least N paths.
• Control structure of a network determines how the states of the switch
boxes will be set. There are 2 types
• Individual stage control uses the same control signal to set all switch boxes in the
same stage.
• Individual box control a separate control signal is used to set the state of each
switch box.
Mesh Connected Illiac Network
• ILLIAC - Illinois Automatic Computer was a series of supercomputers.
• Single stage recirculating n/w is implemented in Illiac-IV array processor
with N=64 PEs.
• Each PEi is allowed to send data to any one of PE i+1, PEi-1, PEi+r, and PEi-r
where r = √N in one circulation step through the n/w.
• Illiac n/w is characterized by the following 4 routing functions:
• R+1(i) = (i+1)mod N
• R-1(i) = (i-1)mod N
• R+r(i) = (i+r)mod N
• R-r(i) = (i-r)mod N
• Where 0<=i<=N-1. N is commonly a perfect square
Mesh Connected Illiac Network
• Real Illiac n/w has a similar structure except
larger in size
• The o/ps of ISi are connected to the i/ps of
OSj for j=i+1, i-1, i+r, i-r. On the other hand,
OSj gets i/ps from ISi for i=j-1, j+1, j-r and j+r.
• Each PEi is directly connected to its 4 nearest
neighbours in the mesh n/w.
• Other topics in chapter 3 - TBD
CHAPTER 5
MULTIPROCESSORS AND
THREAD LEVEL PARALLELISM
SYMETRIC MULTIPROCESSORS [SMP]
1. SMP Introduction
2. Organization
3. Multiprocessor Operating System Design Considerations
SYMETRIC MULTI PROCESSOR Introduction
SMP is a stand alone computer with
the following characteristics:
Processors All processors System
share same share access to controlled by
memory and I/O devices integrated
Two or more I/O facilities • Either through All processors operating
similar • Processors are same channels or
different channels
can perform the system
connected by a bus
processors of or other internal giving paths to same functions • Provides
interaction
comparable connection same devices (hence between
capacity • Memory access “symmetric”) processors and
time is their programs at
approximately the job, task, file and
same for each data element levels
processor
• Advantages of SMP
• Performance
• the work can be done in parallel, then a system with multiple processors will yield greater
performance than one with a single processor of the same type as explained in figure in next
slide by Multi programing vs Multi processing
• Availability
• In SMP, failure of single processor does not halt the machine, instead continue to function as
there are other processors available.
• Incremental growth
• A user can enhance the performance of a system by adding an additional processor
• Scaling
• Vendors can offer a range of products with different price and performance characteristics
based on the number of processors configured in the system.
Multiprogramming and Multiprocessing
SMP Organization
• This diagram shows general organization
of Multi Processor System
• There are two or more processors. Each
processor is self-contained, including a
control unit, ALU, registers, and, typically,
one or more levels of cache.
• Each processor has access to a shared
main memory and the I/O devices
through some form of interconnection
mechanism.
• The processors can communicate with
each other through memory
SMP Organization
• This diagram shows general organization of
Multi Processor System with time shared bus.
• The structure and interfaces are basically the
same as for a single-processor system that
uses a bus interconnection. The bus consists
of control, address, and data lines.
• Features provided to facilitate DMA transfers
from I/O subsystems to processors
• Addressing – using address, bus should be able to
determine source and destination
• Arbitration - Any I/O module can temporarily
function as “master.”
• Time sharing - When one module is controlling the
bus, other modules are
• locked out and must, if necessary, suspend
operation until bus access is achieved.
SMP Organization
• The bus organization has several attractive features
• Simplicity -Simplest approach to multiprocessor organization
• Flexibility - Generally easy to expand the system by attaching more processors to
the bus
• Reliability -The bus is essentially a passive medium and the failure of any attached
device should not cause failure of the whole system
• Disadvantages of the bus organization
• Main drawback is performance
• All memory references pass through the common bus
• Performance is limited by bus cycle time
• Each processor should have cache memory
• Reduces the number of bus accesses
• Leads to problems with cache coherence
• If a word is altered in one cache it could conceivably invalidate a word in another cache.To
prevent this the other processors must be alerted that an update has taken place
• Typically addressed in hardware rather than the operating system
Multiprocessor Operating System Design
Considerations
• Simultaneous concurrent processes
• OS routines need to be reentrant to allow several processors to
execute the same IS code simultaneously
• OS tables and management structures must be managed properly
to avoid deadlock or invalid operations
• Scheduling
• Any processor may perform scheduling so conflicts must be avoided
• Scheduler must assign ready processes to available processors
• Synchronization
• With multiple active processes having potential access to shared
address spaces or I/O resources, care must be taken to provide
effective synchronization
• Synchronization is a facility that enforces mutual exclusion and
event ordering
Multiprocessor Operating System Design
Considerations
• Memory management
• In addition to dealing with all of the issues found on uniprocessor
machines, the OS needs to exploit the available hardware
parallelism to achieve the best performance
• Paging mechanisms on different processors must be coordinated to
enforce consistency when several processors share a page or
segment and to decide on page replacement
• Reliability and fault tolerance
• OS should provide graceful degradation in the face of processor
failure
• Scheduler and other portions of the operating system must
recognize the loss of a processor and restructure accordingly
Cache Coherence
• What is cache coherence?
• multiprocessor systems have one or two
levels of cache associated with each
processor to achieve reasonable
performance.
• Multiple caches often creates cache
coherence problem.
• Cache coherence  Multiple copies of the
same data can exist in different caches
simultaneously, and if processors are
allowed to update their own copies freely,
an inconsistent view of memory can result.
Cache Coherence
Cache Coherence – software solutions
• Attempt to avoid the need for additional hardware circuitry and logic by
relying on the compiler and operating system to deal with the problem
• Attractive because the overhead of detecting potential problems is
transferred from run time to compile time, and the design complexity is
transferred from hardware to software
• However, compile-time software approaches generally must make
conservative decisions, leading to inefficient cache utilization
Cache Coherence – Hardware solutions
• Generally referred to as cache coherence protocols
• These solutions provide dynamic recognition at run time of potential
inconsistency conditions
• Because the problem is only dealt with when it actually arises there is
more effective use of caches, leading to improved performance over a
software approach
• Approaches are transparent to the programmer and the compiler,
reducing the software development burden
• Can be divided into two categories:
• Directory protocols
• Snoopy protocols
Hardware solutions - Directory protocol
Effective in large
Collect and maintain
scale systems with
information about
complex
copies of data in
interconnection
cache
schemes
Directory stored in Creates central

main memory bottleneck
Requests are Appropriate

checked against transfers are
directory performed
Hardware solutions – Snoopy Protocol
• Distribute the responsibility for maintaining cache coherence among all
of the cache controllers in a multiprocessor
• A cache must recognize when a line that it holds is shared with other caches
• When updates are performed on a shared cache line, it must be announced to
other caches by a broadcast mechanism
• Each cache controller is able to “snoop” on the network to observe these
broadcast notifications and react accordingly
• Suited to bus-based multiprocessor because the shared bus provides a
simple means for broadcasting and snooping
• Two basic approaches have been
explored:
• Write invalidate
• Write update (or write broadcast)
Snoopy Protocol - Write invalidate
• Multiple readers, but only one writer at a time
• When a write is required, all other caches of the line are invalidated
• Writing processor then has exclusive (cheap) access until line is required
by another processor
• Most widely used in commercial multiprocessor systems such as the
Pentium 4 and PowerPC
• State of every line is marked as modified, exclusive, shared or invalid
• For this reason the write-invalidate protocol is called MESI
Snoopy Protocol - Write Update
• Can be multiple readers and writers
• When a processor wishes to update a shared line the word to be updated
is distributed to all others and caches containing that line can update it
• Some systems use an adaptive mixture of both write-invalidate and write-
update mechanisms
MESI Protocol - 1
• To reduce bus transactions, add an exclusive state
• Exclusive state indicates that only this cache has clean copy
• Distinguish between an exclusive clean and an exclusive modified state
• A block in the exclusive state can be written without accessing the bus
• M: Modified
• Only this cache has copy and is modified
• Main memory copy is stale
• E: Exclusive or exclusive-clean
• Only this cache has copy which is not modified
• Main memory is up-to-date
• S: Shared
• More than one cache may have copies, which are not modified
• Main memory is up-to-date
• I: Invalid
MESI Protocol - 2
States:
• Invalid
• (Valid)Exclusive (clean, only copy)
• Shared (clean, possibly other copies)
• Modified (modified, only copy)
MESI Cache Line States Table
• Above table summarizes the meaning of the four states

MESI Protocol - 3
• This picture displays state
diagram for the MESI
protocol
• Figure 17.6a shows the
transitions that occur due to
actions initiated by the
processor attached to this
cache.
• Figure 17.6b shows the
transitions that occur due to
events that are snooped on
the common bus.
• If the next event is from the
attached processor, then the
transition is dictated by
Figure 17.6a
• if the next event is from the
bus, the transition is
dictated by Figure 17.6b
MESI Protocol - 4
• transitions in more detail 
• WRITE MISS
• WRITE HIT
• READ MISS
• READ HIT
Multithreading and Chip Multiprocessors
• The most important measure of performance for a processor is the rate
at which it executes instructions.
• This can be expressed as MIPS rate = f * IPC
• f is the processor clock frequency, in MHz
• IPC (instructions per cycle) is the average number of instructions executed per
cycle
• Hence Increase performance by increasing clock frequency and
increasing instructions that complete during cycle
• Multithreading
• Allows for a high degree of instruction-level parallelism without increasing circuit
complexity or power consumption
• Instruction stream is divided into several smaller streams, known as threads, that
can be executed in parallel
Multithreading and Chip Multiprocessors
• Some definitions to understand further
• Process
• Resource ownership
• Scheduling/execution
• Process switch
• Thread
• Thread switch
• explicit multithreading
• Concurrently execute instructions from different explicit threads
• Interleave instructions from different threads on shared pipelines or parallel
execution on parallel pipelines
• Implicit multithreading
• concurrent execution of multiple threads extracted from single sequential
program
• Implicit threads defined statically by compiler or dynamically by hardware
Approaches to Explicit Multithreading
• Interleaved • Blocked
• Fine-grained • Coarse-grained
• Thread executed until event
• Processor deals with two or more causes delay
thread contexts at a time • Effective on in-order processor
• Switching thread at each clock cycle • Avoids pipeline stall
• If thread is blocked it is skipped • Chip multiprocessing
• Processor is replicated on a single
• Simultaneous (SMT) chip
• Instructions are simultaneously • Each processor handles separate
issued from multiple threads to threads
execution units of superscalar • Advantage is that the available
processor logic area on a chip is used
effectively
Clusters
• Clustering is an alternative to SMP as an approach to providing high
performance and high availability
• Particularly attractive for server applications
• Defined as:
• A group of interconnected whole computers working together as a unified computing
resource that can create the illusion of being one machine
• (The term whole computer means a system that can run on its own, apart from the
cluster)
• Each computer in a cluster is called a node
• Benefits:
• Absolute scalability - A cluster can have tens, hundreds, or even thousands of machines,
each of which is a multiprocessor.
• Incremental scalability - A cluster is configured in such a way that it is possible to add new
systems to the cluster in small increments.
• High availability – failure of cluster doesn’t stop the execution
• Superior price/performance - it is possible to put together a cluster with equal or greater
computing power than a single large machine, at much lower cost.
Clusters Configurations
• Cluster computers are classified
based on whether the computers in
a cluster share access to the same
disks
• Figure(a) - a two-node cluster in
which the only interconnection is by
means of a high-speed link that can
be used for message exchange to
coordinate cluster activity.
• Figure(b) -there generally is still a
message link between nodes. In
addition, there is a disk subsystem
that is directly linked to multiple
computers within the cluster. the
common disk subsystem is a RAID
system
Clusters Configurations
Operating System Design Issues
• Full exploitation of a cluster hardware configuration requires some
enhancements to a single-system operating system
• Failure Management -How failures are managed depends on the
clustering method used
• Two approaches to deal with failure
• Highly available clusters - offers a high probability that all resources will be in
service. If a failure occurs, then the queries in progress are lost
• Fault tolerant clusters - ensures that all resources are always available
• The function of switching applications and data resources over from a
failed system to an alternative system in the cluster is known as failover
• Restoration of applications and data resources to the original system
once it has been fixed is know as Failback
Operating System Design Issues
• Load balancing – A cluster requires an effective capability for balancing the
load among available computers
• When its scaled Incrementally automatically include new computers in
scheduling
• parallelizing computation - Effective use of a cluster requires executing
software from a single application in parallel
• Three approaches
• Parallelizing complier -Determines at compile time which parts of an application can be
executed in parallel, These are then split off to be assigned to different computers in the
cluster
• Parallelized application - Application written from the outset to run on a cluster and uses
message passing to move data between cluster nodes
• Parametric computing - Can be used if the essence of the application is an algorithm or
program that must be executed a large number of times, each time with a different set of
starting conditions or parameters
Cluster Computer Architecture
• The individual computers
are connected by some
high-speed LAN or switch
hardware.
• Each computer is capable
of operating independently
• a middleware layer of
software is installed to
enable cluster operation
• Middleware provides provide a unified system image to the user, known as
a single-system image. Also responsible for providing high availability, by
means of load balancing and responding to failures in individual
components.
Cluster Computer Architecture
Desirable functions of middleware
Clusters Compared to SMP
• Both provide a configuration with multiple processors to support high
demand applications
• Both solutions are available commercially
SMP Clustering
 Easier to manage and  Far superior in terms of
configure incremental and absolute
scalability
 Much closer to the original
single processor model for  Superior in terms of
which nearly all applications availability
are written
 All components of the system
 Less physical space and lower can readily be made highly
power consumption redundant
 Well established and stable

COMPUTER ORGANIZATION AND PARALLEL PROCESSING STRUCTURES

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

COMPUTER ORGANIZATION AND PARALLEL PROCESSING STRUCTURES

Caricato da

Copyright:

Formati disponibili

COMPUTER ORGANIZATION

Sequential Execution Parallel Execution

L: latches, interface between different stages of pipeline

• Static v/s Dynamic Pipelines

Directory stored in Creates central

Requests are Appropriate

MESI Cache Line States Table

• Above table summarizes the meaning of the four states

 Well established and stable

Potrebbero piacerti anche