m1-c1 (Autosaved)

MODULE I
Chapter 1
Parallel Computer Models
TEXT BOOK: KAI HWANG AND NARESH JOTWANI, ADVANCED COMPUTER
ARCHITECTURE (SIE): PARALLELISM, SCALABILITY, PROGRAMMABILITY, MCGRAW
HILL EDUCATION 3/E. 2015
In this chapter…
OBJECTIVE
INTRODUCTION
• THE STATE OF COMPUTING
• MULTIPROCESSORS AND MULTICOMPUTERS
• MULTIVECTOR AND SIMD COMPUTERS
• PRAM AND VLSI MODELS
OBJECTIVE
The main aim of this chapter is to learn about the evolution of

computer systems,
various attributes on which performance of system is measured,
classification of computers on their ability to perform
multiprocessing and
various trends towards parallel processing.
INTRODUCTION
From an application point of view, the mainstream of usage of

computer is experiencing a trend of four ascending levels of
sophistication:
Data processing
Information processing
Knowledge processing
Intelligence processing
With more and more data structures developed, many users are shifting
to computer roles from pure data processing to information processing.
A high degree of parallelism has been found at these levels. As the
accumulated knowledge bases expanded rapidly in recent years, there
grew a strong demand to use computers for knowledge processing.
Intelligence is very difficult to create; its processing even more so.
Todays computers are very fast and obedient and have many reliable
memory cells to be qualified for data information-knowledge processing.
THE STATE OF COMPUTING
Computer Development Milestones
500 BC: Abacus (China) – The earliest mechanical computer/calculating
device.
• Operated to perform decimal arithmetic with carry propagation digit
by digit
1642: Mechanical Adder/Subtractor (Blaise Pascal)
1827: Difference Engine (Charles Babbage)
1941: First binary mechanical computer
(Konrad Zuse; Germany)
1944: Harvard Mark I (IBM)
• THE STATE OF COMPUTING
o Evolution of computer system
o Elements of Modern Computers
o Flynn's Classical Taxonomy
o System attributes
• Computer Generations
o 1st 2nd 3rd 4th 5th
o Division into generations marked primarily by
changes in hardware and software technologies.
THE STATE OF COMPUTING Computer
Development Milestones
First Generation (1945 – 54)
o Technology & Architecture:
◦ • Vacuum Tubes (a sealed glass tube containing a near-vacuum which allows the free passage of electric current.)
◦ • CPU driven by PC and accumulator (Capacitor, in electrical engineering, also known by the obsolete term accumulator.)
◦ • Fixed Point Arithmetic
o Software and Applications:
• Machine/Assembly Languages(low level programing language)
• Single user
• No subroutine linkage (a set of instructions designed to perform a frequently used operation within a program.)
• Programmed I/O using CPU
o Representative Systems: ENIAC, Princeton IAS, IBM 701
• Second Generation (1955 – 64)
◦ • Discrete Transistors * (A transistor is a semiconductor device used to amplify or switch electronic signals and electrical power.)
◦ • Core Memories
◦ • Floating Point Arithmetic
◦ • I/O Processors
◦ • Multiplexed memory access
• High level languages used with compilers
• Subroutine libraries
• Processing Monitor
o Representative Systems: IBM 7090, CDC 1604, Univac LARC
• Third Generation (1965 – 74)
◦ o Technology & Architecture:
◦ • IC Chips
◦ • Microprogramming
◦ • Pipelining
◦ • Cache
◦ • Look-ahead processors
• Multiprogramming and Timesharing OS
• Multiuser applications
o Representative Systems: IBM 360/370, CDC 6600, T1-ASC, PDP-8
• Fourth Generation (1975 – 90)
◦ o Technology & Architecture:
◦ • LSI/VLSI * (LSI was followed by Very Large Scale Integration (VLSI) where hundreds of
thousands of transistors were used and still being developed. It was for the first time that a
CPU was fabricated on a single integrated circuit, to create a microprocessor.)
◦ • Semiconductor memories
◦ • Multiprocessors
◦ • Multi-computers
• Multiprocessor OS
• Languages, Compilers and environment for parallel processing
o Representative Systems: VAX 9000, Cray X-MP, IBM 3090
• Fifth Generation (1991 onwards)
◦ • Advanced VLSI processors
◦ • Scalable Architectures
◦ • Superscalar processors
• Systems on a chip
• Massively parallel processing
• Grand challenge applications
• Heterogeneous processing *
o Representative Systems: S-81, IBM ES/9000, Intel Paragon, nCUBE 6480, MPP, VPP500
Elements of Modern Computers
• Computing Problems
• Algorithms and Data Structures
• Hardware Resources
• Operating System
• System Software Support
• Compiler Support
Elements of a modern computer
Computing problems: the problems for which computer system should be constructed.
Algorithms and data structure: Special algorithms and data structures are needed to
specify the computations and communications involved in computing problems.
Hardware resources: Processors, memory, peripheral devices.
Operating system:
System software support: Programs written in High level language*. The source
code translated into object code by a compiler.
Compiler support: 3 compiler approaches:
1. Preprocessor: uses a sequential compiler. (The preprocessor provides the
ability for the inclusion of header files, conditional compilation, and line
control)(# extraction)
2. Precompile: requires some program flow analysis, dependence checking
towards parallelism detection.(extracted data to assembly )(assembly-linker-
Binary)
3. parallelizing compiler: demands a fully developed parallelizing compiler
which can automatically detect parallelism.
Evolution of Computer Architecture
SCALAR,SEQUENTIAL
◦ We start with Von Neuman architecture built as sequential machine executing scalar data.
◦ Von Neuman architecture is slow due to sequential execution of instruction in programs.
LOOKAHEAD,PARALLELISM AND PIPELINING:

◦ Lookahead techniques were introduced to prefetch instructions in order to overlap I/E (instruction fetch
decode and execution) operations and to enable functional parallelism(pipelining).
• Multi vector and SIMD computers
o Vector Supercomputer
o SIMD supercomputers
• PRAM and VLSI model
o Parallel Random Access machines
o VLSI Complexity Model
FLYNN’S CLASSIFICATION
The four classifications defined by Flynn are based upon the number of concurrent
instruction (or control) streams and data streams available in the architecture.
Single control unit (CU) fetches single instruction Multiple instructions operate on one data stream.
stream (IS) from memory. The CU then generates Heterogeneous systems operate on the same data stream
appropriate control signals to direct single processing and must agree on the result.
element .
• Multiprocessor and multicomputer,
o Shared memory multiprocessors
o Distributed Memory Multiprocessors
o A taxonomy of MIMD Computers
FLYNN’S CLASSIFICATION
Instructions are executed sequentially. It can be MIMD architectures include multi-core superscalar
achieved by pipelining or multiple functional units processors, and distributed systems, using either one
shared memory space or a distributed memory space.
MIMD is most popular model.
SIMD is next,
MISD the least popular model.
Parallel/Vector computers
Execute programs on MIMD mode.
2 major classes:
1. shared-memory multiprocessors.(multiple processors with a shared memory)
2. message-passing multicomputer.(architecture in which each processor has its own memory)
each computer node in a multicomputer system has a local memory. Unshared with other nodes.
System attributes to performance
Performance depends on perfect match b/w Machine capability (MC)and Program behavior(PB).
Machine capability MC: can be Program behavior (PB): is difficult to predict due
enhanced with better H/W to dependency on application and runtime
technology, architecture features conditions.
and efficient recourse Other factors :-algorithm design, data
management. structures, language efficiency, programmer
skills and compiler technology.
Program performance is measured by “Turn around Time”

Total time taken between the submission of the program for execution and return the complete
output.
1)Clock Rate and CPI
CPU is driven by the clock with constant time “t”.
Size of the program is determined by its “Ic” instruction count.
◦ “Ic” is number of machine instruction to be executed in the program.
Different Machine instruction requires different number of clock cycles to execute.
CPI(cycles per instruction)is an important parameter for measuring the time needed to execute each
instruction.
2) Performance factor
Thus the CPU time needed to execute the program is by finding the product of three factors:
T= Ic * CPI * t
◦ T=CPU Time
◦ Ic= instruction count
◦ CPI=Cycles per instruction
◦ t= processor cycle time
◦ Exicution of an instruction requires going through cycle of events:
instruction fetch,
decode,
operand(s) fetch, Requires access of
Memory
execution and
Carried out in
store results. CPU
CPI(cycles per instruction can be divided into 2 component terms based on Processor cycles and
memory cycles.
Depending on the instruction type the instruction cycle may involve 1-4 memory reference
◦ 1 instruction fetch
◦ 2 operand fetch
◦ 1 to store result.
Therefore T= Ic * (p + m*k) * t
Ic= instruction count
p= number of processor cycles.
m= number of memory references
k= ratio between memory and processor cycle.
t= processor cycle time
T=CPU Time
3)System Attributes
The 5 performance factors (Ic,p ,m,k, t )are influenced by 4 system attributes:
◦ Instruction-set architecture
◦ Compiler technology
◦ CPU implementation and control
◦ Cache and memory hierarchy
Instruction-set architecture= Ic, p (processor cycle per instruction)
Compiler technology= Ic, p, m (memory references per instruction)
CPU implementation and control= p, t (processor cycle time) total processor time needed.
Cache and memory hierarchy= affects memory access latency = k, t
4)MIPS (millions instructions per second) rate
All 4 sytem attributes ISA,Compiler,processor ,memory effects MIPS rate.
5)FLOPS(floating point operations per second)

To perform floating point operations
Flops with prefix Mega(𝟏𝟎𝟔 ) mflops
Giga(𝟏𝟎𝟗 ) gflops
Tera(𝟏𝟎𝟏𝟐 )teraflops
Peta(𝟏𝟎𝟏𝟓 )petaflops
6)Throughput Rate
“How many programs a system can execute per unit time ,is called as “system throughput” (Ws)”.
7)Programming Environment
• Programmability depends on programming environment provided to the
user.
• We prefer parallel environment rather than sequential environment.
• Factors Influencing programming environment are Languages, compilers, OS.
• OS must be able to manage resources, parallel scheduling, Inter process
communication, synchronization, shared memory allocation.
Two approaches to Parallel Programming
Implicit and explicit parallelism
Implicit parallelism
Language such as C, C++, Fortran, pascal are used to write source program.
In implicit parallelism success relies on the compilers.
The sequentially coded source program is translated into parallel object code by a parallelizing
compiler.
This compiler must be able to detect parallelism and assign target machine resources.
Relies heavily on “intelligence” of parallelizing compilers.
i.e “less effort on programmer”
Two approaches to Parallel Programming
Implicit and explicit parallelism
Explicit parallelism
Explicit parallelism requires more effort by the programmer to develop a source program using
C, C++, Fortran, pascal.
Parallelism is explicitly specified in user program.
Reduces burden on compilers to detect parallelism.
1.2 Multiprocessors and multicomputer
2 categories of parallel computers
◦ Shared memory multiprocessors
◦ Distributed-memory multicomputer
3 types: Shared memory multiprocessors

◦ the uniform memory access (UMA) model.
◦ The non-uniform memory access (NUMA)
◦ The cache-only memory architecture (COMA)
Uniform memory
The physical memory is uniformly shared by all the processors.
access
Each processor may use its own private cache.
Peripherals are also shared in the same fashion.
Multiprocessors are called tightly coupled systems due to high degree of resource sharing.
The system interconnect takes the form of a common bus, a crossbar switch or a multistage network.
Disadvantage: cache coherence
cache coherence is the uniformity of shared resource data that ends up
stored in multiple local caches.
Uniform memory access
UMA model is suitable for general purpose and
time-sharing applications.
When all processors have equal access time to all the peripheral devices, the system is called a
symmetric multiprocessor.
In an asymmetric multiprocessor, only one or a subset of processors are executive capable.
An executive or a master processors can execute the operating system and handle i/o. the
remaining processor have no i/o capabilities, thus are called attached processors.
◦ Other processors execute user code under supervision of the master processor.
NUMA model
In this the access time varies with the location of the memory word. The shared memory is physically
distributed to all processors called local memories.
It is easier to access a local memory with a local processor.
The access of remote memory attached to other processors take longer due to added delay through
interconnection network.
COMA model
Cache only memory architecture
The COMA model is a special case of a NUMA machine, in which the distributed main memories
are converted to caches.
Remote cache access is assisted by the distributed cache directories.
Cache only memory architecture (COMA)
Distributed memory multicomputer
A distributed memory multicomputer system consists of multiple computers, known as nodes,
inter-connected by message passing network.
Each node acts as an autonomous computer having a processor, a local memory and sometimes
I/O devices.
In this case, all local memories are private and are accessible only to the local processors.
This is why, the traditional machines are called no-remote-memory-access (NORMA)machines.
A Taxonomy of MIMD Computers
1.3 Multivector and SIMD Computers
Vector Supercomputers
In a vector computer, a vector processor is attached to the scalar processor as an optional feature.
The host computer first loads program and data to the main memory.
Then the scalar control unit decodes all the instructions.
◦ If the decoded instructions are scalar operations or program operations, the scalar processor executes those
operations using scalar functional pipelines.
On the other hand, if the decoded instructions are vector operations then the instructions will be sent
to vector control unit.
3:executes using 4:vector operation
pipelines must be sent to VCU
2:decodes
1:Data load The Architecture of Vector Supercomputers

SIMD Supercomputers
In SIMD computers, ‘N’ number of processors are connected to a control unit and all the
processors have their individual memory units.
All the processors are connected by an interconnection network.
Parallel RAM and VLSI Models
The ideal model gives a suitable framework for developing parallel algorithms without
considering the physical constraints or implementation details.
The models can be enforced to obtain theoretical performance bounds on parallel computers or
to evaluate VLSI complexity on chip area and operational time before the chip is fabricated.
Parallel Random-Access Machines
Sheperdson and Sturgis (1963) modeled the conventional Uniprocessor computers as random-
access-machines (RAM). Fortune and Wyllie (1978) developed a parallel random-access-machine
(PRAM) model for modeling an idealized parallel computer with zero memory access overhead
and synchronization.
An N-processor PRAM has a shared memory unit. This shared memory can be centralized or
distributed among the processors. These processors operate on a synchronized read-memory, write-
memory and compute cycle. So, these models specify how concurrent read and write operations are
handled.
Following are the possible memory update operations −
Exclusive read (ER) − In this method, in each cycle only one processor is allowed to read from any
memory location.
Exclusive write (EW) − In this method, at least one processor is allowed to write into a memory
location at a time.
Concurrent read (CR) − It allows multiple processors to read the same information from the same
memory location in the same cycle.
Concurrent write (CW) − It allows simultaneous write operations to the same memory location. To
avoid write conflict some policies are set up.
VLSI Complexity Model
Parallel computers use VLSI chips to fabricate processor arrays, memory arrays and large-scale
switching networks.
Nowadays, VLSI technologies are 2-dimensional. The size of a VLSI chip is proportional to the amount
of storage (memory) space available in that chip.
We can calculate the space complexity of an algorithm by the chip area (A) of the VLSI chip
implementation of that algorithm. If T is the time (latency) needed to execute the algorithm, then
A.T gives an upper bound on the total number of bits processed through the chip (or I/O). For certain
computing, there exists a lower bound, f(s), such that
A.T2 >= O (f(s))
Where A=chip area and T=time

m1-c1 (Autosaved)

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

m1-c1 (Autosaved)

Caricato da

Copyright:

Formati disponibili

MODULE I

The main aim of this chapter is to learn about the evolution of

From an application point of view, the mainstream of usage of

LOOKAHEAD,PARALLELISM AND PIPELINING:

Program performance is measured by “Turn around Time”

5)FLOPS(floating point operations per second)

3 types: Shared memory multiprocessors

1:Data load The Architecture of Vector Supercomputers

Potrebbero piacerti anche