Sei sulla pagina 1di 80

Example: MISD(No Real Example: SIMD(Word Slice Example : SISD(One functional

Embodiment of This Class Processing) Illiac IV SIMD unit)


(Bit Slice Processing) IBM 701, IBM 1620
COA : 1

Example: MIMD(Loosely Exists SISD (Multiple functional


Coupled) STARAN units)
UNIVAC 1100 IBM 360-91
CDC 6600

IS2
IS1
MIMD (Tightly Coupled)
Borroughs D-825

ISn
CU
CU

IS

CU2
CU1

CUn
IS

IS2
IS1

ISn
IS

IS2
IS1

ISn

CU2
CU1

CUn
IS2
IS1

ISn
PU

PU2
PU1
PU2
PU1

PUn
PUn
IS

PU2
PU1

PUn
(a) SISD Computer

(b) SIMD Computer


DS2
DS1

DSn
DS

(c) MISD Computer


DS

(d) MIMD Computer


DS2
DS1

DSn
SM

MM1
MM

MM2
MM1

MMm

ISn

MM3
MM2
MM1

IS2

ISn
MM2

IS1
IS2
MMm

IS1
SM
COA : 2

Flynn's Architectural Classification Schemes:


It is introduced by M. J. Flynn. It is based on the multiplicity of instruction and data streams in a
computerissystem.
It introduced by M.J.Flynn.
The essential computingIt is basedison
process thethe multiplicity
execution of instruction
of instructions and
on a set data The
of data. stream
termin a
computer system. The essential computing process is the execution of instructions on a set of data.
stream denotes a sequence of items (instructions or data) as executed or operated upon by a single processor. The
term stream denotes a sequence of items (instructions of data) as executed or operated upon by a single
Instruction or data are defined with respect to the referenced machine. An instruction stream is a sequence of
processor. Instruction or data are defined with respect to the referanced machine. An instruction stream is
instructions as executed by the machine; a data stream is a sequence of data including input, partial, or
a sequence of temporary results, called for by the instruction stream.
temporary results, called for by the instruction stream.

The four machine organizations are:


A Single instruction stream-single data stream (SISD)
Single instruction stream-multiple data stream (SIMI))
Multiple instruction stream-single data stream (M1SD)
Multiple instruction stream-multiple data stream (MIMD)

The categorization depends on the multiplicity of simultaneous events in the system components.
Conceptually, only three types of components are needed. Both instructions and data are fetched from memory
modules (MM). Instructions decode by the control unit (CU), which sends the decoded instruction stream to the
processor unit (PU) for execution data streams flow between the processors and the memory bi-directionally.
Multiple memory modules may he used in the shared memory subsystem. Each instruction stream is generated
by an independent control unit. Multiple data streams originate from the subsystem of shared memory
modules.

SISD Computer Organization:


This organization
This organizationrepresents
represents most serial
most computers
serial available
computers today. today.
available Instructions are executed
Instructions are executed
sequentially but may be overlapped in their execution stages (pipelining). Most SISD uniprocessor
sequencially but may be overlapped in their execution stages (pipelining). Most SISD uniprocessor systems are
systems
pipelined. An SISD computer may have more than one functional unit in it. All the functional units
are pipelined. An SISD computer may have more than one functional unit in it. All the functional units are are under
the supervision of one control unit.
under the supervision of one control unit.
SIMD Computer Organization:
This class corresponds to array processors. There are multiple processing elements supervised by the
same control unit. A11 PEs receives the same instruction broadcasted from the control unit but operate on
different data sets from distinct data streams. The shared memory subsystem may contain multiple modules.
SIMD machines are further divided into word-slice and bit-slice modes.

MISD Computer Organization:


There are n processor units, each receiving distinct instructions operating over the same data stream
and its derivatives. The results (output) of one processor become the input (operands) of the next processor in
the macro-pipe. This structure has received much less attention and has been challenged as impractical by
some computer architects. No real embodiment of this class exists.

MIMD Computer Organization:


Most multiprocessor systems and multiple computer systems can he classified in this category. An
intrinsic MIMD computer implies interactions among the n processors because all memory streams are derived
from the same data space shared by all processors. If the n data streams were derived from disjointed
subspaces of the shared memories, then we would have the so-called multiple SISD (MSISD) operation, which
is nothing but a set of n independent SISD uniprocessor systems. An intrinsic MINID computer is highly
coupled if the degree of interactions among the processors is high. Otherwise, we consider them loosely coupled.
Most commercial MIMD computers are loosely coupled.
COA : 3

Serial Versus Parallel Processing:

Feng has suggested the use of the degree of parallelism to classify various computer architectures. The
maximum number of binary digits (bits) that can he processed within a unit time by a computer system is
called the maximum parallelism degree P. Let Pi be the number of bits that can be processed within the ith
processor cycle (or the ith clock period). Consider T processor cycles indexed by 1 = 1, 2, ...,T. The average
parallelism degree, Pa is defined by

In general, Pi <= P. thus we, define the utilization rate of a computer system within T cycles by

If the computing power of the processor is fully utilized (or the parallelism is fully exploited), then we
have Pi= P for all i and =1 for 100 % utilization. The utilization rate depends on the application program
being executed.

A bit slice is a string of bits one from each of the words at the same vertical bit position, e. g. the TI-ASC
has word length of 64 and 4 arithmetic pipelines. Each pipe has 8 pipeline stages. Thus there are 8* 4= 32 bits
per each bit slice in 4 pipes.

64 bits word length

Stage-1Stage-1Stage-2
Stage-2 Stage-3
Stage-3 Stage-4
Stage-4Stage-5Stage-5
Stage-6 Stage-7
Stage-6 Stage-8
Stage-7 Stage-8
8 - bits
8-bits Pipeline no - 1
8 - bits
8-bits Pipeline no 2
8-bits
8 - bits Pipeline no 3
8-bits
8 - bits Pipeline no 4

8-bits x 4 = 32 bits = bit slice length

The maximum parallelism degree P(C) of a given computer system C is represented by the product of
the word length n and the bit slice length m, that is,
P(C) = n. m
It is equal to the area of the rectangle defined by the integers n and m.
COA : 4
NUMBER REPRESENTATION TECHNIQUES:

Number Representation Techniques

Unsigned representation

Signed representation

Unsigned Representation :
In this representation all N bits will
N bits contribute to represent the + ve magnitude
part of the number.

Let N = 4 then diff. So the corresponding range specification


Combinations are graph is :
0000 => + 0 => 0
0001 => +1
.
+ ve Overflow
.
. + ve
0
1110 => + 14
1111 => + 15 => 2 N 1 +( 2N1)

Signed Representation:
In this representation technique first bit will
N bits represent the sign and rest (N 1) bits will
represent the magnitude part of the no.
1 N1
The main disadvantage of this representation
technique is that though mathematically +0
0 => + ve Magnitude and 0 are same but they have different
1 => ve representations here. The range specification
if N = 4 then diff. Combinations are graph is:
0111 => + 7 => + (2 N 1 1)
0110=> + 6
.
.
0001 => + 1 - ve Overflow + ve Overflow
0000 => + 0 - ve 0 + ve
1000 => 0
1001 => 1 ( 2 N1 1 ) + ( 2 N1 1 )
.
.
.
1110 => 6
1111 => 7 => - (2 N 1 1)

The above representation technique can also be called as signed magnitude representation.
But to solve the above mentioned problem we introduce a different representation technique called 2s
complement signedmagnitude representation.
COA : 5

Q. How can you represent -12 in 8 bit architecture?


1 N-1
12 = 0000 1100
Sign bit Magnitude
part 1s complement
1111 0011
+1
1111 0100
2s Complement
1111 1111
0=> +ve
- 0000 0000
1=>-ve
1111 0011
If N=4 then the combinations are
+1
0111 => +7 => +(2N-1-1)
1111 0100
0110 => +6
:
It is same as (1111 1111+1)-(0000 1100)
:
= 28 0000 1100
0001 => +1
0000 => 0
So 2s complement of a no. in N bit architecture
1111 => -1
= 2N no
1110 => -2
:
So the range-specification graph is:
:
1000 =>-8 => -(2N-1)

-Ve -(2N-1) 0 +(2N - 1)

Tricks:2
Tricks:1 0.1112 = 1*2-1 + 1*2-2 + 1*2-3
= 0.5+0.25+0.125
654321 0 = 0.875
111011 0 =64+32+16+4+2 = 118
= (27 - 24)+(23- 21) = (128-16)+(8- 1-2-3 = 1-(1/8) = 7/8 = 0.875
2)=118
So we have: 1-2-n = 0.111.1
654321 0
110111 1 =64+32+8+4+2+1 = 111
= (27 25)+(24- 20)
= (128-32)+(16-1)=111

Class Work : Evaluate 1111011 and 1010101 as


above.
The no -12 in 8-bit architecture = 28-12 = 256-12 = 244
= 1111 0100

If we add this with +12 then we shold get 0 as result:


1111 0100
+0000 1100

10000 0000 (Proved)


COA : 6
Floating Point Number Representation:

The general format of a floating point number is S B E ?= Significand Base Exponent


The main advantage of using a floating point number is that to represent a huge number or a very
small number floating point number requires a few bits.

e.g. 0.00000000005 = 0.5 10 10


50000000000 = 0.5 10 11

A floating pt. Number may have several representations


0.123 10 4 = 0.0123 10 5 = 1.23 10 3

So to have a fixed form before representing a number in the computers memory we should make
the number normalized. The normalization rules are:

1. The integer part should be Zero


n
2. If the number is 0. d 1 d 2 d n B IE then d 1 > 0 and all di i 2
? 0.

So using this rule 0.123 10 4 is only the normalized.


There are two representations of fL pt.no.

1. Single precision ( 32 bit ) and


2. Double precision ( 64 bit ).

32 bits
Truncated
1 8 23
Significand
Sign for Significand
0 => +ve 1 => - ve biased exponent i.e. 128 will be
added with the exponent. So the
range 128 to + 127 will be shifted
to 0 to 255 (+ve half only).

- ve overflow - ve underflow + ve underflow + ve overflow

- ve + ve
0

- 0.5 2 128 + 0.5 2 128

+
+
- ( 1 2 24 ) 2 127 + ( 1 2 24 ) 2 127
COA : 7

Example Of Fl. Pt. No. Representation :

+ 0.110111 2 +100101 = 0
+ 0.110111 2 100101 = 0
0.110111 2 + 100101 = 1
0.110111 2 100101 = 1

8 bits 23 bits

IEEE 754 Floating Point Format:


Single precision (SP) 32 bit = ( - 1 ) 5 2 e 127 ( 1.m )
Double precision (DP ) 64 bit = ( - 1 ) 5 2 e 1023 ( 1.m )
32 bits
1 8 23

SignSignexponent
Exponent mantissa
( s ) (s) ( e ) (e) (m)

Here base ( r ) = 2, exponent e in excess 127 and mantisa has hidden 1, thus denoting 1.m.

e M Inference
255 #0 NaN ( not a no. ) ( divide by 0, sq. root of ve no. )
255 0 infinite no. x = ( - 1 ) 5 , + & - Differently
0 #0 x = ( - 1 ) S 2 126 ( o. m )
0 0 0, x = ( - 1 ) S 0 again + 0 and 0 are possible

Problem :

40400000H = 0 100 00000 100 0000 0000 0000 0000 0000

0 128 1 2 1

So x = ( - 1 ) 0 2 128 127 ( 1.5 ) = 3 10

Class Work :

A) Evaluate 40A00000 H
B) Express 10 10 in IEEE 754 floating point format.

40A00000H = 0100 0000 1010 0000 0000 0000 0000 0000


COA : 8
BOOTHS ALGORITHM:

Example (-7)*3 = (-21)


START M
1001
n A Q Q0 Action -M
A 0 4 0000 0011 0 Initialize 0111
Q Multiplier 0111 0011 0 A=A-M
Q0 0 3 0011 1001 1 ASR
n No of bits
m Multiplicand 2 0001 1100 1 ASR
1010 1100 1 A=A+M
1 1101 0110 0 ASR
0 1110 1011 0 ASR

01 10
q1q 0

00 Example (-7)*(-3) = (21)


A=A+M 11 A=A-M M
1001
n Q Q0 Action -M
0000 1101 0 Initialize 0111
0111 1101 0 A=A-M
ASR AQQ0
0011 1110 1 ASR
1100 1110 1 A=A+M
1110 0111 0 ASR
n = n-1 0101 0111 0 A=A-M
0010 1011 1 ASR
0001 0101 1 ASR

Is n NO
= 0?
Class Work
YES
(7)*(3)=(21) M
Result in AQ
n A Q Q0 Action -M

STOP

ASR Means Arithmetic Shift Right.


COA : 9

( f ) Displacement ( g ) Stack Addressing


EA = A + (R)
Post Indexing
DA = (A) Pre Indexing
EA = DA + (R) IA = A + (R)
D = (EA) EA = (IA)
(DA is Direct Address) D = (EA)
(IA is Indexed Address)
COA : 10
Instruction Formats:
X = (A+B) * (C+D)
Three-Address Instructions:
ADD R1, A, B R1 M [A] + [B]
ADD R2, C, D R2 M [C] + [D]
MUL X, R1, R2 M [X] R1 * R2
Two-Address Instructions:
MOV R1, A R1 M [A]

ADD R1, B R1 R1+M[B]

MOV R2, C R2 M [C]

ADD R2, D R2 R2+M[D]


MUL R1, R2 R1 R1 * R2

MOV X, R1 M [X] R1
One-Address Instructions:
LOAD A AC M [A]

ADD B AC AC+M[B]

STORE T M[T] AC

LOAD C AC M [C]
ADD D AC AC+M[D]

MUL T AC AC*M[T]

STORE X M [X] AC
Zero-Address Instructions:
PUSH A TOS A
PUSH B
TOS B
ADD TOS (A+B)
PUSH C
TOS C
PUSH D
TOS D
ADD
TOS (C+D)
MUL TOS (A+B) * (C+D)
POP X
M [X] TOS
Computer System Architecture
M. Morris Mano. (Excercise)
Ex : 8.12 Page No.293
X = (A+B*C)/ (D-E*F+G*H)
)/ (G+H*K)
X = ({A-B+C * (D*E-F)}/
COA : 11

( A B C)
X
D EF GH
Three address instructions:

Two address instructions:

One address instructions:


COA : 12
Zero address instruction:
( A B C)
X
D EF G H

Post-fix form of the expression:


A B C * + D E F * - G H * + /
COA : 13

A B C * ( D * E F)
X
G H *K
Three address instruction:

Two address instruction:


COA : 14

A B C * ( D * E F)
X
G H *K
One address instructions:

Zero address instructions:

The post - fix expression for the above equation is:


COA : 15
Individual Control Signal for
internal CPU Control Individual Control Signal for
System Bus Control

Micro-Instruction Branch Address


Jump Condition (Unconditional,
zero, overflow, indirect)

(A)HORIZONTZL
(A) HORIZONTAL MICRO-INSTRUCTION
MICRO-INSTRUCTION

Function Codes

Individual

Control Signals
Control Signal Decoder

Individual
Decoder
Decoder

Micro-Instruction Branch Address Jump Condition


Jumper Condition
(B) VERTICAL MICRO-INSTRUCTION
(B)VERTUAL MICRO-INSTRUCTION

Individual
Control Signals
Individual
Control Signal

JumpJumper
Condition
Condition
Control Signals

Demultiplexer
Decoder

Individual
Control Signal
Individual

Micro-Instruction Branch Address


(C) HYBRID MICRO-INSTRUCTION (HAVING DUAL
USE BITS)
COA : 16

Microinstruction Formats:
The two widely used formats used for microinstructions are horizontal and vertical. In the horizontal
microinstruction each bit of the microinstruction represents a micro-order or a control signal which directly
controls a single bus line or sometimes a gate in the machine. However, the length of such a microinstruction
may be hundred of bits.

In vertical microinstructions, many similar control signals can he encoded into few microinstruction
bits. For 16 ALU operations which may require 16 individual microcoder in horizontal microinstruction only 4
encoded bits are needed in vertical microinstruction. Similarly, in a vertical microinstruction only 3 bits are
needed to select one of the 8 registers. However, these encoded bits need to be passed from respective decoder
to get the individual control signals.

Some of the microinstructions may be passed through a de-multiplexer causing selected bits to he used
for few different location in the CPU. For example, a 16 bit field in a microinstruction can be used as branch
address in branching microinstruction, however, these bits may he utilized for some other control signals in a
non-branching microinstruction. In such a case de-multiplexer can be used.

The vertical microinstructions are normally of the order of 32 bits. In certain control units, several
levels of control are used. For example, a field in microinstruction or in the machine instruction may hold the
address of a read only memory which holds the control signals. This secondary ROM can hold large address
constants such as interrupt service routine address.

In general, a horizontal control unit are faster yet require wide instruction words, whereas, vertical
control units although require decoder, however, are shorter in length. Most of the systems use neither purely
horizontal nor purely vertical microinstructions.

Example:

Let us consider a hypothetical architecture where there are 16 General Purpose Registers(GPR)
(e.g. R0, R1, R2,,R15) and 4 Operation codes/Instructions (e.g. Addition, Subtraction, Multiplication
and division). Then the instruction

MUL R3,R4

In Horizontal micro instruction it will be

0010 0001000000000000, 0000100000000000

In Vertical micro instruction it will be

10 0011, 0100
COA : 17

. . . . .
. . . .
. . . .
. . . .
.
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . .

Control Store

Decode Line
Activated Control Signal Generated Address of next
000 C1C3C5C7 microinstruction
001 C2 C4 C5 000
010 C1 C3 010
011 C2 C5 011
Either ?
011 C2 C5 If external condition is true
then target address is 110
110 C2 C4 C7 111
111 C0C1C2C3C4C5C7 Load next Instruction in IR
Or
011 C2 C5 If condition false then 100
100 C1C2C3C5 101
101 C1 C6 C7 111
111 C0C1C2C3C4C5C7 Load next Instruction in IR
COA : 18
HARDWIRED CONTROL:
Control units use fixed logic circuits to interpret instructions and generate control signals from them.

DESIGN METHODS:

The design of a hardwired control involves various complex tradeoffs between the amount of hardware
used, its speed of operation, and the cost of the design process itself. Because of the large number of control
signals used in a typical CPU and their dependence on the particular instruction set being implemented, the
design methods employed in practice are often ad hoc and heuristic in nature, and therefore cannot easily be
formalized. Three simplified and systematic approaches are:
Method 1 : The standard algorithmic approach to sequential circuit design called the state-table
method, since it begins with the construction of a state table for the control unit.
Method 2 A heuristic method based on the use of clocked delay elements for control-signal timing.
Method 3 A related method that uses counters, which we call sequence counters, for timing purpose.

METHOD 1: State Table Method


Let Cin and Cout denote the input and output variables of control unit. The rows and columns of the
state table correspond to the set of internal states {S i} of the machine and the set of external signals to the
control unit. The entry in row Si and column Ij has the form Si,j , zij , where Si,j denotes the next state of the
control unit and zi,j , denotes the set of output signals zi,j from Cout that are activated by the application of Ij to
the control unit when it is in state Sj.
There are practical disadvantages to using state tables:
The number of state and input combinations may be so great that the state table size and
the amount of computation needed become excessive.
State tables tend to conceal useful information about a circuit's behaviour, e.g., the
existence of repeated patterns or loops.
Control circuits designed from state tables also tend to have a random structure, which makes design
debugging and subsequent maintenance of the circuit difficult.

METHOD 2: Delay Element Method

Consider the problem of generating the following sequence of control signals at times tl , t2 , ., tn using
a hardwired control unit.
T 1 : Activate { Cl,j}
T 2 : Activate { C2,j }
......................................
T 3 : Activate { Cn,j }
Suppose that an initiation signal called START(t 1) is available at t1. START(t1) may he fanned out to { Cl,j } to
perform the first micro operation. If START(t1) is also entered into a time delay element of delay t2 ~ t1, the
output of that circuit, START(t2) can be used to activate {C2,j}. Similarly, another delay element of delay t3- t2
with input SI'ART(t2) can be used to activate {C3,j} and so on. Thus a sequence of delay elements can be used to
generate control signals in a very straight forward manner.
METHOD 3 : Sequence Counter Method

Considering the circuit diagram which consists of a modulo-k counter whose output is connected to a
1/k clocked decoder. If the count enable input is connected to a clock source, the counter cycles continually
through its k states. The decoder generates k pulse signals { i} on its output lines. The { i} effectively divide
the time required for one complete cycle by the counter into k equal parts; the { i} may be called phase
signals. The circuit shown may be called a sequence counter. The figure shows a one-loop flowchart containing
six steps that describes the behaviour of a typical CPU. Each pass through the loop constitutes an instruction
cycle. Assuming that each step can be performed in an appropriately chosen clock period, one may build a
control unit for this CPU around a single (modulo-6) sequence counter. Each signal 1 activates some set of
control lines in step 1 of every instruction cycle. It is usually necessary to be able to vary the operations
performed in step 1 depending on certain control signals or condition variables applied to the control unit.
These are represented by the signals Cin = {C'in, C"in}. A logic circuit N is therefore needed which, combines
Cin with the timing signals { i} generated by the sequence counter.
COA : 19

{C1, j} C1
Delay
Element
{C1, j} C2

{C2, j}
{C2, j}

C3

__
X X

No
IS X=1?

Rules for transforming a flow-


Yes chart into a control circuit using
delay elements.
A modulo - k Sequence
Counter :
(a) Logic diagram,
(b) Symbol.

A 1
1
Begin Modulo K Delay Modulo-K
End
Clock Sequence element Sequence 2
Reset Counter Counter
2
k
1 2 k
(b)
k B
B
(a) A delay element cascade; (b) The
Transfer Program Counter to equivalent Sequence-Counter Circuit
memory address register.
Step 1

Fetch the instruction from


main memory.
Step 2

Increment program counter


and decode instruction.
Step 3
Transfer operand address to
memory address register.
Step 4
Fetch the operand(s) from
main memory.
Step 5
Perform operation specified
by instruction.
Step 6 A delay - element circuit (ring
counter) that behaves like a Se-
CPU behavior represented as a single quence Counter
closed loop
COA : 21
Problems on design of CPU control circuits
Problem 1 : A CPU in addition to the registers ACC, MAR, PC & IR contains the register B, C, D, E. each
having the same length as that of the former ones. Additional instructions provided on these new registers are:
A) ACC ACC + X C) X MDR
B) ACC X D) MAR Y
Where X may be any of B, C, D, E & Y may be either D or E. Give the schematic diagram of the CPU with the
control lines properly.

Data path Control signal


B AC C0, C1 selects B, C, D, E
C AC C2 enables the data path
D AC for ACC
E AC
B ALU C0, C1 selects B, C, D, E
C ALU C3 enables the data path
D ALU for ALU
E ALU
AC ALU C4 enables the data path
ALU AC C5 enables the data path
MDR B C6, C7 selects B, C, D, E
MDR C C8 enables the data path
MDR D for MDR
MDR E
D MAR C10 selects D, E
E MAR C9 enables the data path

Problem 2 : Design a control circuit


according to the operations and the
control signals shown.
Data path Control signal
AC ALU C0
AC DR C1
DR AC C2
ALU AC C3
IR CU C4
DR IR C5
PC MAR C6
PC DR C7
DR PC C8
MAR BUS C9
DR MAR C10
DR BUS C11
DR ALU C12
BUS DR C13
COA : 22

BUS SYSTEM:

INTRODUCTION:

A computer system contains a number of buses which provide path ways among several devices. A
shared bus that connects CPU, memory and I/O is called System Bus.
A system bus may consist of 50 to 100 separate lines. These lines can be categorized into 3 functional
groups:

Data bus provides a path for moving data between the system modules. A data bus width limits the
maximum number of bits which can be transferred simultaneously between 2 modules e.g. CPU and memory.
Address bus is used to designate the source of data for data bus. The width of the address bus specifies the
maximum possible memory supported by a system.
Control bus is used to control the access to data and address bus and for transmission of commands and
timing signals between the system modules.
Physically a bus is a number of parallel electrical conductors. These circuits are normally imprinted on
printed circuit boards.
SOME OF THE ASPECTS RELATED TO THE BUS:
DEDICATED OR MULTIPLEXED BUSES:
A dedicated bus line is permanently assigned to a function or to a physical subset of the components of
the computer. A functional dedicated bus is the dedicated address bus and data bus. Physical dedication
increases the throughput of the bus as only few modules are in contention but it increases overall size and cost
of a system.

In certain computer bus some or all the address lines are also used for data transfer operation, i.e., the
same lines are used for address as well as data lines at different times. This is known as time multiplexing and
the buses are called multiplexed buses.

SYNCHRONOUS OR ASYNICHRONOUS TIMIING:

This concerns the timing of data transfers through the bus. In synchronous buses the data is transferred
during specific time which is known to source and destination. This is achieved by using clock pulses.
Alternative approach is the asynchronous buses where each item which is to be transferred has a separate
control signal. This signal indicates the presence of the item on the bus to the destination.

BUS ARBITRATION:

An important aspect of the bus system is the control of a bus. There may be more than one module
connected to the bus wants to access the bus for data transfer. Thus there must be some methods for resolving
the simultaneous data transfer requests on the bus. The process of selecting one of the units from various bus
requesting units is called bus arbitration. There are 2 broad categories of bus arbitration-centralized and
distributed.

In the centralized scheme a hardware circuit device, bus controller or bus arbiter processes the request
to use the bus. The bus controller may be a separate module or a part of CPU. The distributed scheme has
shared access control logic among the various modules. All these module, work together to shared access.
COA : 23
COA : 24

SOME OF THE ARBITRATION SCHEMES:


DAISY CHAINING:
In daisy chaining the control of the bus is granted to any module by a Bus Grant signal which is chained
through all the contending masters. Two other control signals are the Bus Request and Bus Busy signals.

The Bus Request line if activated, only indicates that one or more modules require the bus. The Bus
Request is responded by the bus controller only if the bus busy line is inactive, that is, the bus is free. The Bus
Request signal is responded by the bus controller by placing the signal on the Bus Grant line. The Bus Grant
signal passes through the modules one by one. On receiving the Bus Grant, the module which was requesting
bus access, blocks further propagation of Bus Grant signal and issues a Bus Busy signal and starts using the
bus. If the Bus Grant signal is passed through the module which had not issued the bus request, then the Bus
Grant signal is forwarded to the next module.

In this scheme the priority is wired in and can not he changed. Supposing that the assumed priority is
(highest to lowest) Module 1, Module 2 ....... Module N. If 2 Modules, say 1 and N, request the bus at the same
time then bus will be granted to Module 1 first as the signal has to pass through Module 1 to each Module N.
The basic drawback of this simple scheme is that if the bus request of Module 1 is occurring at a high rate then
rest of that Modules may not get the bus for quite some time. Another problem can occur when say Bus Grant
lint between say Module 4 & Module 5 fails, or Module 4 is unable to pass the Bus Grant signal, in any of the
above case no bus access will be possible beyond Module 4.

POLLING:
In polling instead of single Bus Grant line, in daisy chaining, we encounter poll count lines. These lines
are connected to all the modules connected on the bus.
The Bus Request and Bus Busy are the other 2 control lines for bus control. A request to use the bus is
made on the Bus Request line, while the Bus Request will not be responded to till the Bus Busy line is active.
The bus controller responds to a signal on Bus Request line by generating a sequence of numbers on poll count
lines. These numbers are normally considered to be a unique address assigned to the connected modules. When
the poll count matches the address of a particular module which is requesting for the bus, the module activates
the Bus Busy signal and starts using the bus. The polling basically is asking each module one by one whether it
has something to do with bus.

Advantages Over Daisy Chaining :


The priority of contending modules can be altered by changing the sequence of the generations of
numbers on the poll count lines.
The failure of one module will not effete any other as far as bus grant is concerned.

Disadvantages Over Daisy Chaining:


* Polling requires more control lines which adds in cost
* The maximum number of modules which can share the bus in polling is restricted by the number of
poll count lines (e.g. if there are 3 poll count lines, then max of 23 = 8 modules can share this bus).

INDEPENDENT REQUESTING:

In this scheme, each module has its independent Bus Request and Bus Busy line. The identification of
requesting unit is almost immediate and requests can be responded quickly. Priority in such systems can be
built through the bus controller and can be changed through a program.
In certain systems a combinations of these arbitration schemes are used. In PDP-11 UNIBUS system uses daisy
chaining and independent addressing. It has five independent Bus Request lines and each one of these Bus
Request lines has a distinct Bus Grant line. Several modules of the same priority may be connected to a same
Bus Request line, the Bus Grant line to these same priority modules can be daisy chained.
COA : 25
COA : 26

Data Lines

System STB
Data Programmable (Strobe) Peripheral
Bus Interfacing
MPU
IBF (Input such as
Device Buffer Keyboard
Pulse)
RD
Pin For
Status
Check

INTR

(a)
Interfacing Device With Handshake Signals For Data Input

1. A peripheral strobes or places a data byte in the input port and informs the interfacing device by
sending handshake signal STB ( strobe )

2. The device informs the peripheral that its input port is full do not sent the next byte until this one has
been read. This massage is conveyed to the peripheral by sending handshake signal IBF ( input buffer
full ).

3. The MPU keeps checking the status until a byte is available. Or the interfacing device informs the MPU,
by sending an interrupt, that it has a byte to be read.

4. The MPU reads the byte by sending control signal RD.

Timing Waveforms of the 8155 I/O Ports with Handshake : Input Mode
COA : 27

Data Lines

OBF ( out-
System put buffer
Data Programmable Full ) Peripheral
Bus Interfacing Such As
MPU
ACK
Device (Acknowl- Printer
edge )
WR
Pin for
status
check

INTR

Interfacing Device with Handshake Signals for Data Output.

1. The MPU writes a byte into the output port of the programmable device by sending control signal
WR.

2. The device informs the peripheral, by sending handshake signal OBF (Output Buffer Full), that a
byte is on the way.

3. The peripheral acknowledges the byte by sending back the ACK (Acknowledges) signal to the
device.

4. The device interrupts the MPU to ask for the next byte, or the MPU finds out that the byte has been
acknowledged through the status check.

Timing Waveforms of the 8155 I/O Ports with Handshake : Output Mode
COA : 28

Fig 9.3 Four Segment Pipeline

Nt n
S=
(k+n1)tp

Fig. 9.4 Space Time Diagram For Pipeline

K segment pipeline with a clock cycle time t p to execute n tasks. In a non pipelined unit
that performs the some operation and takes a time equal to t n to complete each task. S is
called the space up ratio. When n >> K 1 then K + n 1 n.

So S = t n / t p

If pipelined process time = non pipelined process time

then t n = kt p

So s = kt p / t p = k.
COA : 29
ARITHMETIC PIPELINE

X = 0.9504 10 3 X = A 10, Y = B 10 6
Y = 0.8200 10 2 Exponents Mantissas
= 0.0820 10 3 3 2 0.9504 0.8200
a b A B

R R

Z=X+Y
=1.0324*10 3
4
=0.10324*10
= 0.1034*104 Compare
Difference
Segment 1: Exponents
By Subtraction 32=1

Segment 2: Choose Exponent Align mantissas

3 0.9504 0.0820

Add or Subtract
Mantissas
Segment 3:
1.0324

R R

Adjust Exponent Normalize result


Segment 4:

4
0.10324

R
R

Fig 9 6 Pipeline for floating point addition and Subtraction.


COA : 30

Step: 1 2 3 4 5 6 7 8 9 10 11 12 13

Instruction:
1 FI DA FO EX
2 FI DA FO EX
(Branch) 3 FI DA FO EX
4 FI - - FI DA FO EX
5 - - - FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX

Timing Of Instruction Pipe Line.

Fetch instruction
Segment 1: from memory

Decode instruction
Segment 2: and calculate
effective address

Yes
Branch?

No

Fetch operand
Segment 3:
from memory

Segment 4: Execute Instruction

Yes
Interrupt Interrupt?
handling

No
Update PC

Empty Pipe

Four-Segment CPU Pipeline


COA : 31

Pipelining and Superscalar Techniques


A linear pipeline processor is a cascade of processing stages which are linearly connected to perform a fixed function
over a stream of data flowing from one end to the other. In modern computers, linear pipelines are applied for
instruction execution, arithmetic computation, and memory-access operations.
Asynchronous and Synchronous Models :
A linear pipeline processor is constructed with k processing stages. External inputs (operands) are fed into the
pipeline at the first stage S1. The processed results are passed from stage Si to stage Si+1, for all i = 1, 2, ..., k 1. The
final result emerges from the pipeline at the last stage Sk. Depending on the control of data flow along the pipeline,
we model linear pipelines in two categories: asynchronous and synchronous.
Asynchronous Model deals data flow between adjacent
stages in an asynchronous pipeline is controlled by a
handshaking protocol. When stage Si is ready to
transmit, it sends a ready signal to stage Si+1. After stage
Si+1 receives the incoming data, it returns an
acknowledge signal to Si. Asynchronous pipelines may
have a variable throughput rate. Different amounts of
delay may be experienced in different stages.
Synchronous Model has clocked latches are used to
interface between stages. The latches are made with
master-slave flip-flops, which can isolate inputs from
outputs. Upon the arrival of a clock pulse, all latches
transfer data to the next stage simultaneously. The
pipeline stages are combinational logic circuits. It is
desired to have approximately equal delays in all stages.
Delays determine the clock period and speed of the
pipeline.
Reservation table is essentially a space-time diagram
depicting the precedence relationship in using the
pipeline stages. For a k-stage linear pipeline, k clock
cycles are needed to flow through the pipeline.
Successive tasks or operations are initiated one per
cycle to enter the pipeline. Once the pipeline is filled up, one result emerges from the pipeline for each additional
cycle. This throughput is sustained only if the successive tasks are independent of each other.
Clock Cycle and Throughput Denotes the maximum stage delay as m and we can write as
At the rising edge of the clock pulse, the data is latched to the master flip-
flops of each latch register. The clock pulse has a width equal to d. In general,
m >> d for one to two orders of
magnitude. This implies that the
maximum stage delay m dominates the
clock period.
The pipeline frequency is defined as the inverse of the clock period: f = 1/
The efficiency Ek, of a linear k-stage pipeline is defined as above.

The pipeline throughput Hk is defined as the number of tasks (operations) performed per unit time.
COA : 32

Principles of pipelining architecture


Arithmetic pipelining:
The arithmetic logic units of a computer can be segmentized for pipeline operations in various
data formats. Well-known arithmetic pipeline examples are:
4 stage pipes Star-100
8 stage pipes TI-ASC
14 stage pipes CRAY-1
26 stage pipes CRAY-205

Instruction pipeline:
The execution of a stream of
instructions can be pipelined by
overlapping the execution of the
current instruction with the fetch,
decode, and operand fetch of
subsequent instructions. This is also
known as instruction look-ahead.

Processor pipelining:
Here the same data stream is pipelined processed by a cascade of processors, each of which
processes a specific task. The data stream passes the first processor
with results stored in a memory block, which is also accessible, by
the second processor. The second processor then passes refined
results to the third, and so on. Three classification schemes are:

Unifunction vs. Multifunction Pipeline:


A pipeline unit with a fixed and dedicated function is called unifunctional. The Cray-1
has 12 unifunctional pipeline units for various scalar, vector, fixed-point, and floating-
point operations.
A multifunction pipe may perform different functions, either at different times or at the
same time, by interconnecting different subsets of stages in pipeline. The TI-ASC has 4
multifunction pipeline processors, each of which is reconfigurable for a variety of
arithmetic logic operations at different times.

Static vs. Dynamic Pipeline:


A static pipeline may assume only one functional configuration at a time. Static pipelines
can be either unifunctional or multifunctional. Pipelining in static pipes is possible only if instructions of the same
type are to be executed continuously. The function performed by a static pipeline should not change frequently.
Otherwise, performance may be low.
A dynamic pipeline processor permits several functional configurations to exist simultaneously. The dynamic
configuration needs more elaborate control and sequencing mechanisms than those for static pipelines. A dynamic
processor must be multifunctional, whereas, a unifunctional pipe must be static.

Scalar vs. Vector Pipeline:


A scalar pipeline processes a sequence of scalar operands under the control of a DO loop. Instructions in a small DO
loop are often prefetched into the instruction buffer. The required scalar operands for repeated scalar instructions are
moved into a data cache in order to continually supply the pipeline with operands.
Vector pipelines are specially designed to handle vector instructions over vector operands. Computers with vector
instructions are often called vector processors. The handling of vector operands in vector pipelines is under firmware
and hardware controls (rather than under software control as in scalar pipelines).
COA : 33

What are the binary values to be given at X, Y and load i/PS to perform the addition of 4 numbers b0, b1,
b2, b3 to be fed through i/p
i/PSs I1 and I2. (Use minimum no. Of CLK
ClK pulses). Draw the reservation table
accordingly.

I/PI1 I/PI2
0 1 0 1

y2X1 Mux y 2X1


x Mux X

CP Y X I1 I2 Load
Stage S1 1 1 1 b0 b1 0
2 1 1 b2 b3 0
3 1 1 0 0 0
4 1 1 0 0 0
Stage S2
5 1 1 0 0 1
6 0 0 0 0 0
7 1 1 0 0 0
8 1 1 0 0 0
Stage S3
9 1 1 0 0 0

Logic operation table


Stage S4

External Reg. Load


COA : 34

8 4 16
i IF ID OF OE OS Load rB B
Load rC C
i-1 IF ID OF OE OS
Add rA rB rC
IF ID OF OE OS Store rA A
i-2 Add rB rA rC
Store rB B
IF: Instruction Fetch Load rD D
ID: Instruction Decode Sub rD rD rB
OF: Operand Fetch Store rD D
OE: Operand Execute
OS: operand Store
Reuse of operands I=228b,
A B+C D=192b,M=420b
IF ID OF OE OS (Register-to-register)
i B A+C

IF ID OF OE OS D D-B
i-1

i-2 IF ID OF OE OS
For this example
8 4 4 4 we have 8 bit
Time Add rA rB rC opcode, 32 bit
data, 16 GPRs.(4
8 16 16 16 Add rB rA rC
bit to no. them), 16
Add B C A Sub rD rD rB
bit address
Add A C B
Sub D B D
Compiler allocates
I=168b,D=288b,M=456b operand in registers.
(Memory to memory) I=60b, D=0b, M=60b
(Register to register)

To I Cache From I Cache From D Cache

II Address
Address I Register Data
Data in
in

General Registers
Increment
Increment
D B A
PC
PC ALU ALU
Shift Logic

ALU ALU

Data Address MUX

To D Cache F
COA : 35

R 15 Common to D and A Number of global registers= G


Number of local register in each window= L
R 10 Number of registers common to two windows= C
R 73 Number of windows= W
Local to D
R 64
R 63
Common to C and D
R 58

R 57 Local to C
R 48
R 47 Common to B and C
R 42
R 41 Local to B
R 32
R 31 Common to A and B
R 26
R 25
R 16 Local to A
R9 Common to All
Procedures R 15
Common to A and D
R0 R 10
Global Proc A
Registers Overlapped register windows.

Window size = L+2C+G


Register file = (L+C)W+G

In the example of fig:

We have G=10, L=10, C=6 and W=4.The window size is 10+12+10 = 32 registers, and
the register file consists of (10+6)*4+10=74 registers.

RISC CHARACTERISTICS:

1. Relatively few instructions.


2. Relatively few addressing modes.
3.Memory access limited to load and save instructions.
4.All operations done within the registers of the CPU.
5. Fixed-length, easily decoded instruction format.
6.Single-cycle instruction execution.
7.Hardwired rather than micro-programmed control .supports pipelining .
8.Supports Pipelining.
COA : 36

RISC Processors: In SPARC architecture out of 32 numbers of 32-bit IU (Integer Unit) registers
eight of these registers are global registers shared by all procedures, and the remaining 24 are window registers
associated with only each procedure. The concept af using overlapped register windows is the most important
feature introduced by the Berkeley RISC architecture. This concept is illustrated in following figure.

Each register window is divided into three


eight-register sections, labelled Ins, Locals
and Outs. The local registers are only locally
addressable by each procedure. The Ins and
Outs are shared among procedures. The
calling procedure passes parameters to the
called procedure through via its Outs (r8 to
r15) registers, which are the Ins registers of
the called procedure. The window of the
currently procedure is called the active
window pointed to by a current window
pointer (CWP). A window invalid mask
(WIM) is used to indicate which window is
invalid.
Problem: The SPARC architecture can be
implemented with two to eight register
windows, for a total of 40 to 132 GPRs in the
integer unit. Explain how the GPRs are
organized into overlapping windows in each
of the following designs:
(a) Use 40 GPRs to construct two windows.
(b) Use 72 GPRs to construct four windows.

Above implementations/answers may have other register distributions.


COA : 37

CACHE MEMORY:
Elements of Cache Design :
Cache Size Write Policy
Mapping Function Write through
Direct Write back
Associative Write once
Set associative
Block Size
Replacement Algorithm Number of Caches
Least-recently used (LRU) Single-or two-level
First-in-first-out (FIFO) Unified or split
Least-frequently used (LRU)
Random
Block Size:
As the Block size increases from very small to larger sizes, the bit ratio will at first increase because of
the principle of locality; the high probability that the data in the vanity of a referenced word is likely to be
referenced in the near future. Two specific effects come into play:
1. Larger Blocks reduce the number of blocks that fit into cache. Because each block fetch overwrites
older cache contents, a small number of blocks results in data being overwritten shortly after it is
fetched.
2. As a block becomes larger, each additional word is farther from the requested word, therefore less
likely to be needed in the near future.

Number Of Caches :
Recently, the use of multiple caches has become the norm. two aspects of this design issue concern the
number of levels of caches and the use of unified versus split caches.

Single - versus Two-Level Caches:


It has become possible to have a cache on the same chip as the processor: the on-chip cache. So using
internal buses, the on-chip cache reduces the processor's external bus activity and thus speeds up execution
time and increases overall system performance. Furthermore, as bus access is eliminated, the bus is free to
support other transfers.
In most contemporary designs include both on-chip and external caches. The resulting organization is
known as a two-level cache. The internal cache is designated as level 1 (L1) and the external cache as level 2
(L2). The reason for this is that, if the requested information is not on L1 cache, the processor has to make
main memory access through the bus. This results is poor performance due to slow bus speed and slow
memory access time. The potential saving due to the use of an L2 cache depends on the hit rates in both L1 and
L2 caches.
Unified versus split Cache :
Many of the designs previously consisted of a single cache used to store references to both data and
instructions. More recently, it has become common to split the cache into two : one dedicated to instructions
and one dedicated to data.
The unified cache has several advantages:
For a given cache size, a unified cache has a higher bit rate, than split caches because it balances the
load between instruction and data fetches automatically, that is, the cache tries to fill up with
instruction or data depending upon the higher number of fetches required.
Only one cache needs to be designed and implemented.
The key advantage of the split cache design is that it eliminates contention for the cache between the
instruction processor and the execution unit. This is important for any design implementing pipelining
of instructions. Thus for super scalar machines such as the Pentium and PowerPC implement split
cache design.

The main disadvantage of unified cache design is that preference is given to execution unit first when
instruction pre-fetcher request for an instruction. This contention can degrade performance by interfering
with efficient use of the instruction pipeline. The split cache structure overcomes this difficulty.
COA : 38
Time Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Page Address 4 3 2 5 1 2 3 4 3 2 4 5 1 4 3 1
OPT 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
Anticipatory 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
Swapping 2 2 2 2 2 2 2 2 2 5 5 5 5 5
Hit = 83 % 5 1 1 1 1 1 1 1 1 1 1 1 1
(10/12)*100 %

FIFO 4 4 4 4 1 1 1 1 1 1 1 5 5 5 5 5
HIT = 33% 3 3 3 3 3 3 4 4 4 4 4 1 1 1 1
(4/12)*100 % 2 2 2 2 2 2 2 3 3 3 3 3 4 4 4
5 5 5 5 5 5 5 2 2 2 2 2 3 3

LRU 4 4 4 4 1 1 1 1 1 1 1 5 5 5 5 5
HIT=58% 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1
(7/12)*100% 2 2 2 2 2 2 2 2 2 2 2 2 3 3
5 5 5 5 4 4 4 4 4 4 4 4 4

Results Of Simulation Run Of Three Replacement Algorithms On A Common Page Trace Stream.

Problem :

A virtual memory system has 16k word Logical address space, 8k word physical address space with a
page size of 2k words. The page address trace of a program has been found to be:

7532104167420135

Note the four pages resident in the memory after each page reference change for each of the following
replacement policies:

(a) FIFO
(b) LRU; and
(c) Anticipatory Swapping
COA : 39

An algorithm is a stack algorithm if following inclusion property holds good:

Bt(n) C Bt(n+1) for n < Lt, and


= Bt(n+1)for n ? Lt,

Where
L= Page address stream length to be processed by the replacement algorithm,
n=Page capacity of MS,
t=Time step when t pages of the address stream have been processed,
Bt(n) = set of pages in MS at time t,
Lt = number of distinct pages encountered at time t.

Bt (n) = { St(1),St(2),,St(n)} for n< Lt


= { St(1),St(2),,St(Lt)} for n? Lt

Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Step
Page 4 3 2 5 1 2 3 4 3 2 4 5 1 4 3 1
address
St(1) 4 3 2 5 1 2 3 4 3 2 4 5 1 4 3 1
St(2) 4 3 2 5 1 2 3 4 3 2 4 5 1 4 3
St(3) 4 3 2 5 1 2 3 4 3 2 4 5 1 4
St(4) 4 3 3 5 1 1 1 1 3 2 2 5 5
St(5) 4 4 4 5 5 5 5 1 3 3 2 2
n=1
Hit n=2 x
for n=3 x x x x x x
n=4 x x x x x x x
n=5 x x x x x x x x x x x

Stack processing of LRU Scheme of a given page address trace for different main storage capacity.
COA : 40
t=Average time elapsed to access a word by CPU,
t1=MS access time
t2=SS access time, and
R=Access time ratio=t2/t1

The SS access time t2 can be evaluated as, t2=tb+t1,


Where tb=block transfer time from SS to MS.

For a given design with a specified value of H, the access efficiency can be evaluated in
terms of H and R, where
H=hit ratio=N1
N1+N2,

N1=Number of access to MS,


N2= Number of access to SS,
1-H= Occurrences to miss.

Average access time = t =Ht1+(1-H)t2


=Ht1+(1-H) (tb+t1)
=t1+(1-H) tb

Hence Access Efficiency = Ae= t1/t = t1/( Ht1+(1-H)t2)

= 1/( H+(1-H)R)

= 1/( R+(1-R)H)
COA : 41

Problem :

A hierarchical Cache-MS Memory sub system has following specifications:

1. Cache access time of 50nsec;


2. Main storage access time of 500 nsec;
3. 80% of memory request are for read;
4. Hit ratio of 9 for read access and the write through scheme is employed. Estimate --

a) Average access time of the system considering only memory read cycle;
b) Average access time of the system both for read and write requests;
c) The hit ratio taking into consideration the write cycle.

Solution :

Cache access time TCA = 50 nsec


Main storage access time = 500 nsec
Prob. Of read = PR = 0.8
Hit ratio for read access = HR = 0.9
Scheme : Write through

(a) Considering only memory read cycle.

Average access time = HR * TCA + (1-HR) (TCA+TMS)


= 0.9*50+(1-0.9)*(50+500)
= 100 nsec

(b) For both read and write cycle,

Average access time = PR avg. access time for read + (1-PR) * TMS
= 0.8*100ns+ (1-0.8) *500ns
= 80+100 = 180ns

(c) Hit ratio when write cycle is also considered is

H = PR*HR+(1-PR)*HW where, HW=Hit ratio for write cycle =0


= 0.8*0.9+(1-0.8)*0
= 0.72
COA : 42

Address of word
Address 01 Page In the page

p w P = log 2 P , w = log 2 W

MAR Fields For Word Address.

Tag Register

Tag Block Word


r b w
R = 4096/128 = 32

r = 5, b = 7, w = 4
Cache
C = 128
P = 4096

Main storage

Illustration Of Direct Mapping.

MAR Address : Tag ( Page address ) W


( of main storage )

CAM Word : Tag Field Address Field

c = log 2 C
Cache Address : C/ W C = number of page trames
in cache
Page address in Cache

Organization Of CAM Word In Fully Associative Mapping.


COA : 43

Key
M
CAM
A P W
R

C W
Cache Memory
Address
Register

CPU CACHE

Address Bus Main


Storage
MDR Data Bus

Block diagram of the hardware structure realizing the behavior noted in the flowchart.
COA : 44

Start

Search in Cache

Yes No
Available
Hit Miss

Yes
Cache
full ?
All counters having value lesser
then that of the referenced one is
incremented by 1. While counters Replace the page in the The new page is
having value more then that of the frame having counter brought in a cache page
referenced page frame remain value c 1 & sets its frame not occupied
unchanged. Counter of the page hit value to 0 currently
is set to zero.

Increment each Counter Increment the counters


by 1 of each occupied page
frame by 1

Stop

Counter increment strategy in L R U scheme.

CAM Words

(Access to cache for the desired word)

P=4096, C=128, W=16


p=12, c=7, w=4
s = No. of sets = 64, s = log 2 S = 6
N = No. of frames per set = 2

Prob: A MM of 1M words and 16 bits


each. A CM of set associative mapping
with set size=4 and 64 words per block
and total 128 pages.
Cache capacity = 128 x 16 = 2 K Words (11 bit address) Hints : T S W S B W
9 5 6 bits 5 2 6 bits
COA : 45

CPU generates a new address in


MAR for Read/Write operation
in a word.

Search CAM with the tag as the key.

Replace a cache page from main


Yes storage page containing the word
Match? No Cache and proceed to complete the
full? read/write operation.

Yes
No

Access the cache by the Bring the page from main storage
address read from the to a vacant page frame of cache
CAM word. and proceed to complete the
read/write operation.

Read / Write operation in cache.

Stop

The flowchart showing the CPU-Cache-main storage interaction in a fully associative


mapping
COA : 46

Start

CPU generates an address in MAR

By the set field the specific set is accessed and its


CAM is searched associatively with the Tag field of
MAR as the key

Yes
Available?

Read the cache page frame


No address from CAM word and
access the cache for read / write
Replace a page in the set by the operation
desired page & proceed for
read / write operation in cache

Stop

Interaction Of CPU-Cache-Main Storage Interaction In Block-Set Associative Mapping.


COA : 47

Sector mapping cache design Compared with fully associative or set-


associative caches, the sector mapping cache
offers the advantages of being flexible to
implement various block replacement
algorithms and being economical to perform a
fully associative search across a limited number
of sector tags. The sector partitioning offers
more freedom in grouping cache lines at both
ends of the mapping. Making design choice
between set-associative and sector mapping
caches requires more trace and simulation
evidence.
The figure shows an example of sector mapping
with a sector size of four blocks. Note that each
sector can be mapped to any of the sector
frames with full associatively at the sector level.
This scheme was first implemented in the IBM
System/360 Model 85. In the Model 85, there
are 16 sectors, each having 16 blocks. Each
block has 64 bytes, giving a total of 1024 bytes
in each sector and a total cache capacity of 16
Kbytes using a LRU block replacement policy.
Problem 1 on Cache mapping Consider a cache
(M1): 16K words, 50ns access time and main
memory (M2): 1M words, 400ns access time. Assume 8-words cache blocks and a set size of 256 words with set-
associative mapping. (a) Show the mapping of M1 and M2. (b) Calculate effective access time, when cache hit ratio =
0.95.
8 words/block, so w=3. 256 words/set=28 words/set, so 28/23 = 25 = 32 pages/set.
cache (M1) : 16K words, so 16K/2 = 24 x 210/23 = 211 pages
3

memory (M2) : 1M words, so 1M/23 = 220/23 = 217 pages. The illustration of mapping has shown in the figure.
Now considering 4-way set-associative mapping, total number of sets = 211/4 = 512.
Effective access time = 0.95 x 50ns + (1-0.95)x 400ns = 67.5ns

Problem 2 on Cache mapping Consider MM with 256 words/module and in 4 modules. 16 words in each block. CM
of 256 words. 4-way set associative mapping is used. Show the mapping policy of MM and CM.
MM capacity = 4 x 256 = 22 x 28 = 210 No. of pages in MM = 210 /16 = 210 /24 = 26 = 64
8 4 4
No. of pages in CM = 256 /16 = 2 /2 = 2 = 16.
Total 4-sets, so 16/4 = 4 pages per set. So it is 4-way set-associative mapping.
COA : 48

Memory Hierarchy Technology There are three dimensions of the locality property:
temporal, spatial, and sequential. During the lifetime
of a software process, a number of pages are used
dynamically. These memory reference patterns are
caused by the following locality properties:
(1) Temporal locality Recently referenced items
(instructions or data) are likely to be referenced again
in the near future. This is often caused by special
program constructs such as iterative loops, process
stacks, temporary variables, or subroutines. Once a
loop is entered or a subroutine is called, a small code
segment will be referenced repeatedly many times.
Thus temporal locality tends to cluster the access in
the recently used areas.
(2) Spatial locality This refers to the tendency for a process to access items whose addresses are near one another.
For example, operations on tables or arrays involve accesses of a certain clustered area in the address space. Program
segments, such as routines and macros, tend to be stored in the same neighbourhood of the memory space.
(3) Sequential locality In typical programs, the execution of instructions follows a sequential order (or the program
order) unless branch instructions create out- of-order executions. The ratio of in-order execution to out-of-order
execution is roughly 5 to 1 in ordinary programs. Besides, the access of a large data array also follows a sequential
order.

Problem: Complete the following table replacing unknown spaces of the given memory hierarchy. Achieve an
effective memory-access time t = 10.04
s with a each hit ratio h1=0.98 and a hit
ratio h2=0.9 in the main memory. Also,
the total cost of the memory hierarchy is
upper-bounded by $15,000. The
memory hierarchy cost is calculated as
The maximum capacity of the disk is thus S3=39.8 G bytes.
Next, we want to choose the access time (t2) of the RAM to build the
main memory. The effective memory-
access time is calculated as
Substituting all known parameters, we have
10.04 x 10-6 = 0.98 x 25 x 10-9 + 0.02 X 0.9 X t2 + 0.02 x 0.1 x 1 x 4 x 10-3. Thus t2 = 903 ns.
COA : 49

Cache Design Alternatives


The relative merits of physical address caches and virtual address caches have to be judged based on the access
time, the aliasing problem, the flushing problem, OS kernel overhead, special tagging at the process level, and
cost/performance considerations. Beyond the use of private caches, three design alternatives are suggested
below.
Each of the design alternatives has its own advantages and shortcomings. There exists insufficient
evidence to determine whether any of the alternatives is better or worse than the use of private caches. More
research and trace data are needed to apply these cache architectures in designing future high-performance
multiprocessors.

Shared Caches: An alternative approach to maintaining cache coherence is to completely eliminate the
problem by using shared caches attached to shared-memory modules. No private caches are allowed in this
case. This approach will reduce the main memory access time but contributes very little to reducing the overall
memory-access time and to resolving access conflicts.
Shared caches can be built as second-level caches. Sometimes, one can make the second-level caches
partially shared by different clusters of processors. Various cache architectures are possible if private and
shared caches are both used in a memory hierarchy. Use of shared cache alone may be against the scalability of
entire system.

Non-cacheable Data: Another approach is not to cache shared writable data. Shared data are non-cacheable,
and only instructions or private data are cacheable in local caches. Shared data include locks, process queues,
and any other data structures protected by critical sections.
The compiler must tag data as either cacheable or non-cacheable. Special hardware tagging must be
used to distinguish them. Caches with cacheable and non-cacheable blocks demand more programmer effort,
in addition to support from hardware and computers.

Cache Flushing: A third approach is to use cache flushing every time a synchronization primitive is executed.
This may work well with transaction processing multiprocessor systems. Cache flushes are slow unless special
hardware is used. This approach does not solve I/O and process migration problems.
Flushing can be made very selective by programmers or by the compiler in order to increase efficiency.
Cache flushing at synchronization, I/O, and process migration may he carried out unconditionally or
selectively. Cache flushing is more often used with virtual address caches.

Hardware Interlocks: An interlock is a circuit that detects instructions whose source operands are destinations
of instructions farther up in the pipeline. Detection of this situation causes the instruction whose source is not
available to be delayed by enough clock cycles to resolve the conflict. This approach maintains the program
sequence by using hardware to insert the required delays.

Operand Forwarding: It uses special hardware to detect a conflict and then avoid it by routing the data
through special paths between pipeline segments. For example, instead of transferring an ALU result into a
destination register, the hardware cheeks the destination operand, and if it is needed as a source in the next
instruction, it passes the result directly into the ALU input, bypassing the register file. This method requires
additional hardware paths through multiplexers as well as the circuit that detects the conflict.

Delayed Load: The compiler are designed to detect a data conflict and reorder the instructions as necessary to
delay the loading of the conflicting data by inserting no-operation instructions. This method is referred to as
delayed load.

Handling of Branch Instructions One of the major problems in operating an instruction pipeline is the
occurrence of branch instructions. A branch instruction can be conditional or unconditional. It breaks the
normal sequence of the instruction stream, causing difficulties in the operation of the instruction pipeline.
COA : 50

Prefetch Target Instructions: One way of handling a conditional branch is to prefetch the target instruction in
addition to the instruction following the branch. Both are saved until the branch is executed. If the branch
condition is successful, the pipeline continues from the branch target instruction. An extension of this
procedure is to continue fetching instructions from both places until the branch decision is made. At that time
control chooses the instruction stream of the correct program flow.

Branch Target Buffer: Another possibility is the use of a branch target buffer or BTB. The BTB is an
associative memory included in the fetch segment of the pipeline. Each entry in the BTB consists of the address
of a previously executed branch instruction and the target instruction for that branch. It also stores the next
few instructions after the branch target instruction. When the pipeline decodes a branch instruction, it
searches the associative memory BTB for the address of the instruction. If it is in the BTB, the instruction is
available directly and prefetch continues from the new path. If the instruction is not in the BTB, the pipeline
shifts to a new instruction stream and stores the target instruction in the BTB. The advantage of this scheme is
that branch instructions that have occurred previously are readily available in the pipeline without
interruption.

Loop Buffer: A variation of the BTB is the loop buffer. This is a small and very high-speed register file
maintained by the instruction fetch segment of the pipeline. When a program loop is detected in the program,
it is stored in the loop buffer in its entirety, including all branches. The program loop can be executed directly
without having to access memory until the loop mode is removed by the final branching out.

Branch Prediction: Another procedure that some computers use is branch prediction. A pipeline with branch
prediction uses some additional logic to guess the outcome of a conditional branch instruction before it is
executed. The pipeline then begins prefetching the instruction stream from the predicted path. A correct
prediction eliminates the wasted time caused by branch penalties.

Delayed Branch: A procedure employed in most RISC processors is the delayed branch. In this procedure, the
compiler detects the branch instructions and rearranges the machine language code sequence by inserting
useful instructions (e.g. no-operation instruction) that keep the pipeline operating without interruptions.

Parameter Cache-Main storage Main storage-Secondary storage


Basic Motivation Faster access to a memory word Extension of main storage space
and enhancement of throughput

Implementation Totally by H/W Mainly by software routines


with support of hardware
structure

Typical page size 4 to 32/64 bytes 1K to 8K/16K bytes

Typical access ratio 1 : 5 to 8 1 : (104 or more)

Typical size ratio (varies widely 1 : 103 1 : (104 - 103 or more)


from system to system)

Access to CPU Usually has direct access to main All access to secondary storage
storage is via main storage only

Cost/bit ratio 10 : 1 103 : 1

A comparative study of various parameters in respect of Cache-Main storage and Main storage-Secondary
storage interaction
COA : 51

C/S-Access Memory Organization A memory


organization in which the C-access and S-access are
combined is called C/S-access. This scheme is shown in
Figure, where n access buses are used with m
interleaved memory modules attached to each bus. The
m modules on each bus are m-way interleaved to allow
C-access. The n buses operate in parallel to allow S-
access. In each memory cycle, at most m - n words are
fetched if the n buses are fully used with pipelined
memory accesses. The C/S-access memory is suitable for
use in vector multiprocessor configurations. It provides
parallel pipelined access of a vector data set with high
bandwidth. A special vector cache design is needed
within each processor in order to guarantee smooth data
movement between the memory and multiple vector processors.

Associative Memory or Content Addressable Memory (CAM)


The time required to find an item stored in memory can be reduced considerably if stored data can be
identified for access by the content of the data itself rather
than by an address. A memory unit accessed by content is
called an associative memory or content addressable memory
(CAM). This type of memory is accessed
simultaneously and in parallel on the basis of data content
rather than by specific address or location. When a word is
written in an associative memory, no address is given. The
memory is capable of finding an empty unused location to
store the word. When a word is to be read from an
associative memory, the content of the word, or part of the
word, is specified. The memory locates all words which
match the specified content and marks them for reading.
Because of its organization, the associative memory is
uniquely suited to do parallel searches by data association.
Moreover, searches can be done on an entire word or on a
specific field within a word. An associative memory is more
expensive than a random access memory because each cell
must have storage capability as well as logic circuits for
matching its content with an external argument. For this
reason, associative memories are used in applications where
the search time is very critical and must be very short.
Hardware Organization:
It consists of a memory array and logic for m words with n
bits per word. The argument register A and key register K
each have n bits, one for each bit of a word. The match
register M has m bits, one for each memory word. Each
word in memory is compared in parallel with the content of
the argument register. The words that match the bits of the
argument register set a corresponding bit in the match
register. After the matching process, those bits in the match
register that have been set indicate the fact that their
corresponding words have been matched. Reading is
accomplished by a sequential access to memory for those
words whose corresponding bits in the match register have been set.
COA : 52

The key register provides a mask for choosing a particular field or key in the argument word. The entire
argument is compared with each memory word if the key register contains all 1's. Otherwise, only those bits in
the argument that have 1's in their corresponding position of the key register are compared. Thus the key
provides a mask or identifying piece of information, which specifies how the reference to memory is made. To
illustrate with a numerical example, suppose that the argument register A and the key register K have the bit
configuration shown below. Only the three leftmost bits of A are compared with memory words because K has
1's in these positions.
A 101 111100
K 111 000000
Word 1 100 111100 no match
Word 2 101 000001 match

Word 2 matches the unmasked argument field because the three leftmost bits of the argument and the word
are equal.

The letter C with two subscripts where the first subscript gives
the word number and the second specifies the bit position in the
word. The internal organization of a typical cell Cij is shown in
Figure. It consists of a flip-flop storage element Fij and the
circuits for reading, writing, and matching the cell. The input bit
is transferred into the storage cell during a write operation. The
bit stored is read out during a read operation. The match logic
compares the content of the storage cell with the corresponding
unmasked bit of the argument and provides an output for the
decision logic that sets the bit in Mi.
Match logic: The match logic for each word can be derived from
the comparison algorithm for two binary numbers. First, we
neglect the key bits and compare the argument in A with the bits
stored in the cells of the words. Word i is equal to the argument
in A if Ai = Fij for j = 1, 2,. . . , n. Two bits are equal if they are both 1 or both 0. The equality of two bits can be
expressed logically by the Boolean function
xj = Aj Fij + Aj Fij
where xj = 1 if the pair of bits in position j are equal; otherwise, xj = 0.
For a word i to be equal to the argument in A we must have all xj variables equal to 1. This is the condition for
setting the corresponding match bit Mi to 1. The Boolean function for this condition is
Mi = x1 x2 x3 . xn
and constitutes the AND operation of all pairs of matched bits in a word.

Network Partitioning
The concept of virtual networks leads to the partitioning of a given physical network into logical subnetworks
for multicast communications. The idea is depicted in the figure.
COA : 53

Memory Chip Organization


Most of the semiconductor memories are packaged in chips. These memory chips may store information
ranging from 64K bits to 1M bits. There are several memory organization techniques used for a chip and the
most common of these are 2D and 2D organization.

2D Memory Organization:
In 2D memory organization, memory cells are organized as an array of words. Any word can be accessed
randomly.
Consider for N-bit address, there will be M = 2N words. Let there be B bits per word.
In PC-286 word length = 16 bit, so 64KB memory holds 32K words.
An address decoder is used to decode the address in N-bit MAR. The address decoder selects one out of 2N
word lines. In semiconductor memory, there will be B number of memory elements connected to the word line.
The address decoder thus selects the word lines to select the appropriate word by applying a voltage to it just
before the read or write operation.
To each memory element of a particular column is connected two bit wires, one to sense/write 0 bit and the
other to sense/write 1 bit. These wires are connected to sense amplifiers that sense the voltage on the proper
bit wires to sense a 0 or 1 during a read operation. After sensing the voltage on proper bit wires, the sense
amplifier stores 0 or 1 in the MBR. During a write operation, the data to be written is stored in MBR. The
sense amplifier forces appropriate voltage through proper bit wires reading the data from the MBR. This
forces the corresponding memory elements to store the required bit.

In the 2D organization the number of word lines equals the number of words in the memory, i.e., if there is M
words then M word lines are needed. The address decoder is complex.
COA : 54

2D Memory Organization:
Another organization called 2D organization uses only M word lines, allows the use of simple decoders and
it also enables the construction of memories modularly. Consider there are M words and B bits/word. Then
there will be B bit planes; each bit plane will contain M number of memory element on it. Therefore to
implement this memory, B number of M X 1 memory chips are required. Thus bits of a word are spread over
a number of chips. Normally one bit of a word in a single chip.
The word address is splitted into row address and column address. A row line and a column line connect each
memory element on each bit plane. To select a word, the most significant bits of the word address are entered
in the row register and least significant bits are entered in the column register. The voltage of the selected row
is lowered. The column decoder selects only the bits in the column that is indicated by the column register.
Thus one bit corresponding to a word is selected from each plane.

Comparison of 2D and 2D organization


The 2D organization of chips is supported to be more advantageous because:-
1. It requires less circuitry and gates.
2. The chip has only one input/output pin in 2D while it has 16 or 32 input/output pins in 2D. Less
number of pins is a desirable feature of chip packages.
3. In 2D error correcting codes cannot be used effectively. In 2D one bit may be corrupted from a bit
plane and can be recovered by suitable error correcting method. But it does not happen in 2D.
4. Usually ROM uses 2D chip organization and on the other hand 2D is increasingly finding its
application in RAM construction.

Orthogonal Memory Organization:


Orthogonal memory can be accessed either by a word or by a bit-
slice. A bit-slice is defined as a set of all the bits of the same bit
position of a specific set of words. The user may request for a word
for read or write operation. But the user can also read or write on a
bit slice.
COA : 55

System Attributes to Performance

Cycle time = Time periods = in nanoseconds.


Frequency = Clock rate = f = 1/ in megahertz
Size of the program = instruction count = Ic
Cycles per instruction = estimates time per instruction = CPI
So Time req. to execute a program = T=Ic x CPI x
T=Ic x (P + m x K) x where
P=no. of processor cycles for instr. decode and execution
m=no. of memory references required
K=is the ratio between memory and processor cycles.
C=total no. of clock cycles needed to execute a program
So, T=C x =C/f. So, CPI=C/IC and T= IC x CPI x
Million Instructions Per Second = MIPS = IC / T x 106 = f / CPI x 106 = f x IC / C x 106
Throughput Rate = no. of programs / unit time = WP = f / IC x CPI = (MIPS) x 106 / IC
Problem :
A 40Mhz processor was used to execute a benchmark program with following program mix and clock
counts:
Instruction Type Instruction Count Count cycle unit
Integer arithmetic 45000 1
Data transfer 32000 2
Floating point 15000 2
Control transfer 8000 2

Calculate effective CPI, MIPS and execution time.


45000 + 32000 x 2 + 15000 x 2 + 8000 x 2 = 155 / 100 = 1.55
Here CPI =
45000 + 32000 + 15000 + 8000

MIPS = f / CPI x 106 = 40 x 106 / 1.55 x 106 = 25.8


execution time = T = IC x CPI x = (45K + .32K + 15K + 8K) x 1.55 x (1/40)
= 3.875 n sec.
COA : 56

Problem 2: The execution times (in see) of four programs on three computers are given below
Execution Time
Program Computer A Computer B Computer C
P1 1 10 20
P2 1000 100 40
P3 500 1000 50
P4 100 800 100
Assume that 108 instruction were executed in each of four programs calculate MIPS rating of each program.
Draw a clear conclusion regarding the relative performance of the 3 computers.

Answer: Here T is given in the chart and Ic = 108 and MIPS = Ic / (T * 106)

In case of computer A
For program1 MIPS rate = 108 /(1 x 106)= 100
For program2 MIPS rate = 108 /(103 x 106)= 0.1
For program3 MIPS rate = 108 /(500 x 106)= 0.2
For program4 MIPS rate = 108 /(102 x 106)= 1

In case of computer B
For program1 MIPS rate = 108 /(10 x 106)= 10
For program2 MIPS rate = 108 /(102 x 106)= 1
For program3 MIPS rate = 108 /(103 x 106)= 0.1
For program4 MIPS rate = 108 /(800 x 106)= 0.125

In case of computer C
For program1 MIPS rate = 108 /(20 x 106)= 5
For program2 MIPS rate = 108 /(40 x 106)= 2.5
For program3 MIPS rate = 108 /(50 x 106)= 2
For program4 MIPS rate = 108 /(100 x 106)= 1

Average MIPS rate of computer A = (100+0.1+0.2+1)/4 = 101.3/4 = 25.32


Average MIPS rate of computer B = (10+1+0.1+0.125)/4 = 11.225 = 2.80
Average MIPS rate of computer C = (5+2.5+2+1)/4 = 10.5/4 = 2.62
Among 3 computers from the performance point of view computer, A has the best performance than others.

Problem 3: Consider the execution of an object code with 2,00,000 instructions in a 40mhz processor. The program
consists of four major types of instructions. The instruction mix and the number of cycles (CPI) needed for each
instruction type are given below
Instruction type CPI Instruction Mix
Arithmetic & logic 1 60%
Load/store with cache hit 2 18%
Branch 4 12%
Memory reference with cache miss 8 10%
(a) Calculate the average CPI when the program is executed on a uniprocessor with the above trace result
(b) Calculate the corresponding HIPS rate based on the CPI obtained in part a
Answer:
(a) Here Ic = 200000
And C = (200000*60*1/100) + (200000*18*2/100) + (200000*12*4/100) + (200000*10*8/100) = 448000
CPI = C / Ic = 448000 / 200000 = 2.24 Cycle / Instruction
(b) MIPS = f / (CPI * 106) and here f = 40 * 106 H2. So MIPS = (40 * 106) / (2.24 * 106) = 17.86
Ic = 100 and C = Total CPI = 60 x 1 + 18 x 2 + 12 x 4 + 10 x 3 = 224
So average CPI = 224 / 100 = 2.24 Cycle / Instruction
COA : 57

Problem 4: A workstation uses a 15MHz processor with claimed 10 MIPS rating to execute a given program mix.
Assume one cycle delay for each memory access.
(a) What is the effective CPI of this computer?
(b) Suppose the processor is being upgraded with a 30MHz clock. However, the speed of the memory subsystem
remains unchanged, consequently two clock cycles are needed per memory access. If 30% of the instructions require
one memory access and another 5% require two memory access per instruction, what is the performance of the up
graded processor with a compatible instruction set and equal instruction counts in the given program mix?
Answer:
(a) CPI = ? f=15 x 106 Hz MIPS = 10
6
MIPS = f / (CPI * 10 ) So CPI = f / (MIPS * 106) = (15 x 106) / (10 * 106)= 1.5
Let Ic = Instruction Count
2 memory accesses and 1 cycle delay /access
(b) Performance ratio = T15 / T30 =
(Icx(0.30x(CPI+1)+0.05x(CPI + 2) + 0.65xCPI)x(1/15)) / (Icx(0.30 x (CPI + 2)+0.05x(CPI + 4)+0.65xCPI)x(1/30)

2 memory accesses and 2 cycle delay per access.


Here CPI = 1.5
So performance = ((0.30 x 2.5 + 0.05 x 3.5 + 0.65 x 1.5) x 2 ) / (0.30 x 3.5 + 0.05 x 5.5 + 0.65 x 1.5)
= (0.75+0.175+0.975) x 2 = (1.05+0.275+0.975) = 3.8 / 2.3 = 1.65
So more than 60% up-gradation
COA : 58

Direct Memory Access


The transfer of data between a fast storage device (e.g. magnetic disk) and memory is often limited by the
speed of the CPU. Removing the CPU from the path and letting the peripheral device manage the CPU bus
directly would improve the speed of transfer. This transfer technique is called direct memory access (DMA).
During DMA transfer, the CPU is idle and has no control of the memory buses. A DMA controller takes over
the buses to manage the transfer directly between the I/O device and memory. For this it CPU requires two
control signals:
Bus request: The bus request (BR) input is used by the DMA controller to request the CPU to relinquish
control of the buses. When this input is active, the CPU terminates the execution of the current instruction and
places the address bus, the data bus, and the read and write lines into a high-impedance state.
Bus grant: The CPU activates the bus grant (BC) output to inform the external DMA that the buses are in the
high-impedance state. The DMA that originated the bus request can now take control of the buses to conduct
memory transfers without processor intervention. When the DMA terminates the transfer, it disables the bus
request line. The CPU disables the bus grant, takes control of the buses, and returns to its normal operation.
Burst transfer: When the DMA takes control of the bus system, it communicates directly with the memory.
The transfer can be made in several ways. In DMA burst transfer, a block sequence consisting of a number of
memory words is transferred in a continuous burst while the DMA controller is master of the memory buses.
This mode of transfer is needed for fast devices such as magnetic disks, where data transmission cannot be
stopped or slowed down until an entire block is transferred.
Cycle stealing: It allows the DMA controller to transfer one data word at a time, after which it must return
control of the buses to the CPU. The CPU merely delays its operation for one memory cycle to allow the direct
memory I/O transfer to "steal" one memory cycle.
DMA Controller: The DMA controller needs the usual circuits of an interface to communicate with the CPU
and I/O device. In addition, it needs an
address register, a word count
register, and a set of address lines. The
address register and address lines are
used for direct communication with
the memory. The word count register
specifies the number of words that
must be transferred. The unit
communicates with the CPU via the
data bus and control lines. The CPU
through the address bus selects the
registers in the DMA, by enabling the
DS (DMA select) and RS (register
select) inputs. The RD (read) and WR
(write) inputs are bi-directional. When
the BC (bus grant) input is 0, the CPU
can communicate with the DMA
registers through the data bus to read
from or write to the DMA registers.
When BC = 1, the CPU has
relinquished the buses and the DMA
can communicate directly with the
memory by specifying an address in
the address bus and activating the RD
or WR control. The DMA
communicates with the external
peripheral through the request and
acknowledge lines by using a
prescribed handshaking procedure.
COA : 59

The DMA controller has three registers: an address register, a word count register, and a control register. The
address register contains an address to specify the desired location in memory. The word count register holds
the number of words to be transferred. This register is decremented and address register is incremented by
one after each word transfer and internally tested for zero. The control register specifies the mode of transfer.
The CPU initializes the DMA by sending the following information through the data bus:
1. The starting address of the memory block where data are available (for read) or are to be stored (for write)
2. The word count, which is the number of words in the memory block
3. Control to specify the mode of transfer such as read or write
4. A control to start the DMA transfer
The CPU communicates with the DMA
through the address and data buses as with
any interface unit. The DMA has its own
address, which activates the DS (DMA
select) and RS (Register select) lines. The
CPU initializes the DMA through the data
bus. Once the DMA receives the start
control command, it can start the transfer
between the peripheral device and the
memory.
When the peripheral device sends a DMA
request, the DMA controller activates the
BR line, informing the CPU to relinquish
the buses. The CPU responds with its BC
line, informing the DMA that its buses are
disabled. The DMA then puts the current
value of its address register into the address
bus, initiates the RD or WR signal, and
sends a DMA acknowledge to the
peripheral device. The RD and WR lines in
the DMA controller are bi-directional. The
direction of transfer depends on the status
of the BC line. When BC = 0, the RD and
WR are input lines allowing the CPU to communicate with the internal DMA registers. When BC = 1, the RD
and WR are output lines from the DMA controller to the random-access memory to specify the read of write
operation for the data.
When the peripheral device receives a DMA acknowledge, it puts a word in the data bus (for write) or receives
a word from the data bus (for read). Thus the DMA controls the read or write operations and supplies the
address for the memory. The peripheral unit can then communicate with memory through the data bus for
direct transfer between the two units while the CPU is momentarily disabled.
For each word that is transferred, the DMA increments its address register and decrements its word count
register. If the word count does not reach zero, the DMA checks the request line coming from the peripheral.
For a high-speed device, the line will be active as soon as the previous transfer is completed. A second transfer
is then initiated, and the process continues until the entire block is transferred. If the peripheral speed is
slower, the DMA disables the bus request line so that the CPU can continue to execute its program. When the
peripheral requests a transfer, the DMA requests the buses again.
If the word-count register reaches zero, the DNM stops any further transfer and removes its bus request. It
also informs the CPU of the termination by means of an interrupt. When the CPU responds to the interrupt, it
reads the content of the word count register. The zero value of this register indicates that all the words were
transferred successfully. The CPU can read this register at any time to check the number of words already
transferred. A DMA controller may have more than one channel. In this case, each channel has a request and
acknowledge pair of control signals which are connected to separate peripheral devices. Each channel also has
its own address register and word count register within the DMA controller. A priority among the channels
may be established so that channels with high priority are serviced before channels with lower priority.
COA : 60

Data
Instruction Cycle
Count
Processor Processor Processor Processor Processor Processor
Cycle Cycle Cycle Cycle Cycle cycle

Data
Fetch Fetch Fetch Fetch Fetch Fetch
Data Register Instruction Instruction Instruction Instruction Instruction Instruction
Links

Address
Register Interrupt
Address
Lines
Break Point

DMA REQ
DMA ACK DMA
Control Breakpoints
INTR
Logic
Read
Write DMA And Interrupt Break Points During An Instruction
Cycle

Typical DMA Block Diagram

CPU DMA I/O I/O Memory


Module

(a) Single Bus, Detached DMA

CPU DMA DMA Memory


Module Module
II/O
/O
I/O I/O

(b) Single Bus, Integrated DMA I / O


System Bus

CPU DMA Memory


Module
I / O Bus

I/O I/O I/O

(C) I / O Bus
COA : 61

DIRECT MEMORY ACCESS:


Drawbacks of programmed and interrupt driven I/O:

Interrupt-driven I/O, though more efficient than simple programmed I/O, still requires the active
intervention of the CPU to transfer data between memory and an I/O module, and any data transfer must
traverse a path through the CPU. Thus, both these forms of I/O suffer from two inherent drawbacks:
1. The I/O transfer rate is limited by the speed with which the CPU can test and service a device.
2. The CPU is tied up in managing an I/O transfer; a number of instructions must be executed for each
I/O transfer.

DMA Function:
The DMA module is capable of mimicking the CPU and, indeed, of taking over control of the system
from the CPU. The technique works as follows:

When the CPU wishes to read or write a block of data, it issues a command to the DMA module, by
sending to the DMA module the following information:
1. Whether a read or write is requested.
2. The address of the I/O device involved.
3. The starting location in memory to read from or write to.
4. The number of words to be read or written.
Thus the CPU has delegated this I/O operation to the DMA module and continues with other works.
The DMA module transfers the entire block of data, one word at a time, directly to or from memory
without going through the CPU.
When the transfer is complete, the DMA module sends an interrupt signal to the CPU.

The DMA module needs to take control of the bus in order to transfer data to and from memory. Thus
the DMA module must use the bus only when the CPU does not need it, or it must force the CPU to
temporarily suspend operation. The latter technique is more, common and is referred to as cycle-stealing since
the DMA module in effect steals a bus cycle.

The figure shows where in the instruction cycle the CPU may be suspended. In each case, the CPU is
suspended just before it needs to use the bus. The DMA module then transfers one word and returns controls
to the CPU. This is not an interrupt; the CPU does not save a context and do something else. Rather the CPU
pauses for one bus cycle. The overall effect is to cause the CPU to execute more slowly. For a multiple-word I/O
transfer, DMA is far more efficient than interrupt-driven or programmed I/O.

The DMA mechanism can be configured in variety of ways:

Single-bus, detached DMA: Here all modules share the same system bus. The DMA module, acting as
a surrogate CPU, uses programmed I/O to exchange data between memory and an I/O module
through the DMA module. This configuration while it may be inexpensive is clearly inefficient. As
with CPU controlled-programmed I/O, each transfer of a word consumes two bus cycles.
Single-bus, integrated DMA - I/O: The number of required bus cycles can be cut substantially by
integrating the DMA and I/O functions. This means that, there will be a path between the DMA
module and one or more I/O modules that does not include the system bus. The DMA logic may
actually be a part of an I/O module, or it may be a separate module that controls one or more I/O
modules.
I/O bus: the above configuration can be improved by connecting I/O modules to the DMA module
using an I/O bus. This reduces the number of I/O interfaces in the DMA module to one and provides
for an easily expendable configuration.

In the last two configurations, the system bus that the DMA module shares with CPU and memory is used by
the DMA module only to exchange data with memory. The exchange of data between the DMA and I/O
modules takes place off the system bus.
COA : 62

Characteristics of Vector Processing


Vector operand contains an ordered set of n elements, where n = length of the vector. Each element in a vector is a
scalar quantity, which may be a floating-point number, an integer, a logical value, or a character (byte). Vector
instructions can be classified into 4 primitive types: F1: VV, F2: VS, F3: V*VV, F4: V*SV.
Where V and S denote a vector operand and a scalar operand, respectively. The mappings F1 and F2 are unary
operations and F3 and F4 are binary operations.
Some special instructions may be used to facilitate the manipulation of vector data:
A Boolean vector can be generated as a result of comparing two vectors, and can be used as a masking vector
for enabling or disabling component operations in a vector instruction.
A compress instruction will shorten a vector under the control of a masking vector.
A merge instruction combines two vectors under the control of a masking vector. The resulting operand in
compress and merge may have different length from that of input operands.
The machine operations suitable for pipelining should have the following properties:
Identical processes (functions) are repeatedly invoked many times, each of which can be subdivided into
subprocesses (subfunctions).
Successive operands are fed through the pipeline segments and require as few buffers and local controls are
possible.
Operations executed by distinct pipelines should be able to share expensive resources, such as memories and
buses, in the system.
Most vector processors have pipeline structure as unlike scalar processor, the vector processor need to perform the
same operation on different data sets repeatedly. Overhead caused by loop-control mechanism in scalar processor is
eliminated in vector processor. Because of the startup delay in a pipeline, a vector processor should perform better
with longer vectors.
The following fields usually specify vector instructions:
1. The operational code must be specified in order to select the functional unit or to reconfigure a
multifunctional unit to perform the specified operation.
2. For a memory-reference instruction, the base addresses are needed for both source operands and the result
vectors. If the operands and results are located in the vector register file, the designated vector registers must
be specified.
3. The address increment between the elements must be specified. Some computers restrict the elements to be
consecutively stored in the main memory, i.e., the increment is always 1. In some others, a variable increment
is possible, thus offers higher flexibility in application.
4. The address offset relative to the base address should be specified. Using the base address and the offset, the
effective memory address can be calculated. The offset, either positive or negative, offers the use of skewed
vectors to achieve parallel accesses.
5. The vector length is needed to determine the termination of a vector instruction. A mask vector may be used to
mask off some of the elements without changing the contents of the original vectors.
Pipeline vector computers can be classified into two architectural configurations according to where the operands are
retrieved in a vector processor.
In the memory-to-memory architecture, the source operand, intermediate and final results are retrieved form the main
memory. Here, the information about the base address, offset, increment and the vector length must be specified in
order to enable streams of data transfers between the memory and pipelines. Example: TI-ASC, CDC STAR-100,
CYBER-205. In register-to-register architecture, the operands and the results are indirectly retrieved from main
memory through the use of large number of vector or scalar register. Example: CRAY-1. The overhead of pipeline
processing is mainly the setup time, which is needed to route the operands among functional units. Another overhead
is the flushing time between the decoding of a vector instruction and the out of the first result from the pipeline. The
vector length affects the processing efficiency because of the additional overhead caused by subdividing a long
vector. In order to enhance the vector processing capability, an optimized object code must be produced to maximize
the utilization of the pipeline resources. Vector computations are often involved in processing large arrays of data. By
ordering successive computations in the array, vector (array) processing methods can be classified into three types:
1. Horizontal processing, here vector computations are performed horizontally left to right in row fashion.
2. Vertical processing, here vector computations are carried out vertically top to bottom in column fashion.
Vector lopping, in which segmented vector loop computations are performed from left to right and top to bottom in a
combined horizontal and vertical method.
COA : 63

Some representative Vector instructions


Type Mnemonic Description (I=1 through N)
f1 VSQR Vector Square Root: B(I)A(I)
VSIN Vector Sine: B(I)Sin(A(I))

VCOM Vector Complement: A(I)A(I)


f2 VSUM N
Vector Summation: S=I=1 A(I)
VMAX Vector Maximum: S=maxI =1:N A(I)
f3 VADD Vector add: C(I)=A(I)+B(I)
VMPY Vector multiply: C(I)=A(I)*B(I)
VAND Vector and: C(I)=A(I) and B(I)
VLAR Vector Larger: C(I)=max(A(I),B(I))
VTGE Vector test >: C(I)=0 if A(I)<B(I)
C(I) = 1 if A(I)>B(I)
F4 SADD Vector-Scalar add: B(I)=S+A(I)
SDIV Vector-Scalar divide: B(I)=A(I)/S

f1: VV, f2: V S, f3: V*V V, f4: V*S V

Example:
(1) X=(2,5,8,7) and Y=(9,3,6,4) after operation B=X>Y is executed, the Boolean Vector B=(0,1,1,1) is
generated.
(2) X=(1,2,3,4,5,6,7,8) and B=(1,0,1,0,1,0,1,0) after compress operation Y=X(B) Vector Y=(1,3,5,7)
(3) X=(1,2,4,8), Y=(3,5,6,7) and B=(1,1,0,1,0,0,0,1) after merge operation Z=X,Y,(B) Vector
Z=(1,2,3,4,5,6,7,8).

V1 V1 V1 V2 S V1

V2 S V3 V2

(a) f1: V1V2 (b) f2: V1S (c) f3: V1 * V2V3 (d) f4: S*V1V2
COA : 64

Structures and Algorithms for Array (SIMD) Processors


A synchronous array of parallel processors is called an array processor, which consists of multiple processing
elements (PEs) under the supervision of one control unit (CU). An array processor can handle single instruction and
multiple data (SIMD) streams. In this sense, array processors are also known as SIMD computers. SIMD machines
are especially designed to perform vector computations over matrix of arrays of data. SIMD computers appear in two
basic architectural organizations:
Array processors using random-access memory;
Associative processors using content-addressable (associative) memory.

SIMD COMPUTER ORGANIZATIONS


One of the configurations is structured with N-synchronized PEs, all of which are under the control of one CU. Each
PEi is essentially an arithmetic logic unit (ALU) with attached working registers and local memory PEMi, for the
storage of distributed data. The CU also has its main memory for the storage of programs. The system and the user
programs are executed under the control of CU. Scalar or control-type instructions are directly executed inside the
CU. Vector instructions are broadcasted to the PEs for distributed executions to achieve spatial parallelism through
duplicate arithmetic units (PEs).
A masking vector is used to control the status of all PEs. In other words not all the PEs need to participate in the
execution of the vector instruction.
The configuration II differs form the configuration I in two aspects:
1. The local memories attached to the PEs are replaced by the parallel memory modules shared by all the PEs
through an alignment network.
2. The inter-PE permutation network is replaced by the inter-PE memory alignment network, which is controlled
by the CU.
3. The alignment network is a path-switching network between the PEs and the parallel memories. Such an
alignment network is desired to allow conflict-free accesses of the shared memories by as many PEs as
possible.

An SIMD computer C is characterized by the set of parameters:

C = < N, F, I, M >
Where,
N The number of PEs in the system. Illiac-IV has N=64, BSP has N=16.
F A set of data-routing functions provided by the interconnection network or by alignment network.
I The set of machine instructions for scalar-vector, data-routing, and network-manipulation operations.
M The set of masking schemes, which partitions the set of PEs into the two disjoint subsets of enabled PEs
and disabled PEs.

Inter-PE Communications:
These are fundamental decisions in determining the appropriate architecture of an interconnection network for an
SIMD machine. The decisions are made between the operation modes, control strategies, switching methodologies
and network topologies.

Operation mode: two types of communications can be identified: synchronous and asynchronous.
Synchronous communication is needed for establishing communication paths synchronously for either a data
manipulating function or for a data instruction broadcast. Asynchronous communication is needed for multiprocessing
in which connection requests are issued dynamically.
Control strategy: a typical interconnection network consists of a number of switching elements and
interconnecting links. Interconnection functions are realized by properly setting control of the switching elements.
The control-setting function can be managed by a centralized controller or by the individual switching elements. The
latter strategy is called distributed control and the first strategy corresponds to centralized control.
COA : 65

Switching methodology: the two major switching methodologies are circuit switching and packet
switching. In circuit switching, a physical path is actually established between a source and destination. In
packet switching, data is put in a packet and routed through the interconnection network without establishing
a physical connection path. In general, circuit switching is much more efficient for many short data messages.
Network topology: a network can be depicted by a graph in which nodes represent switching points and
the edges represent communication links. The topologies tend to be regular and can be grouped into two
categories: static and dynamic. In a static topology, links between two processors are passive and dedicated
buses cannot be reconfigured for direct connections to other processors. On the other hand, setting the
networks active switching elements can reconfigure links in the dynamic category.
The space of the interconnection networks can be represented by the Cartesian product of the above four sets of
design features: {operation mode} X {control strategy} X {switching methodology} X {network topology}.
Static Vs. Dynamic Networks:
The topological structure of an SIMD array processor is mainly characterized by the data-routing network
used in interconnecting the processing elements. Formally, such an inter-PE communication network can be
specified by a set of data-routing functions.
The SIMD interconnection networks are classified into the following two categories based on network
topologies: static and dynamic networks.
In static networks, topologies can be classified according to the dimensions required for layout. E.g.
linear array (1D), star, ring, tree, mesh and systolic array all are of 2D, completely connected chordal
ring, 3-cube and 3-cube-connected-cycle are all of 3D.
In dynamic networks, two classes of networks can be described: single-stage versus multistage.
Single-stage networks: a single-stage network is a switching network with N input selectors (IS) and N output
selectors (OS). Each IS is essentially a 1-to-D demultiplexer and each OS is an M-to-1 multiplexer where
1DN and 1MN. the crossbar-switching network is a single-stage network with D=M=N. To establish a
desired connecting path, different path control signals will be applied to all IS and OS selectors. The single-
stage network is also called a recirculating network.
Multistage networks: many stages of interconnected switches form a multistage SIMD network. Multistage
networks are described by three characterizing feature: the switch box, the network topology, and the control
structure. There are four states of a switch box: straight, exchange, upper broadcast, and lower broadcast. It is
capable of connecting an arbitrary number of input to output terminals. Multistage networks may be one-sided
(called full switches and have input/output ports on the same side) and two-sided (having two sides for input
and output sections). Two-sided multistage networks can be divided into three classes:
Blocking networks are examples of data manipulators. E.g. baseline, omega, n cube networks.
Rearrangeable networks can perform all possible connections between inputs and outputs by
rearranging its inputs and outputs. E.g. benes network.
Nonblocking networks can perform all possible connections between inputs and outputs with out
blocking. E.g. clos network can perform one-to-one and one-to-many connections.
Cube Interconnection Networks:
The cube network can be implemented as either as a re-circulating network or as a multistage network for
SIMD machines. A three-dimensional cube is shown. Vertical lines connect vertices (PEs) whose addresses
differ in the most significant bit position. Vertices at both ends of the diagonal lines differ in the middle bit
position. Horizontal lines differ in the least significant bit position. This unit-cube concept can be extended to
an n-dimensional unit space, called an n cube, with n bits per vertex.

An O(n3) algo for SISD matrix multiplication: An O(n2) algo for SIMD matrix multiplication:
For i = 1 to n Do For i = 1 to n Do
For j = 1 to n Do Par For k = 1 to n Do
Cij = 0 (initializing) Cik = 0 (vector load)
For k = 1 to n Do For j = 1 to n Do
Cij = Cij + aik * bkj (scalar additive multiply) Par For k = 1 to n Do
End of k loop Cik = Cik + aij * bjk (vector multiply)
End of j loop End of j loop
End of i loop End of i loop
COA : 66

Formally, an SIMD computer C is characterized by the following set of parameters:


C = <N, F, I, M>
Where N=the number of PEs in the system. For example, the Illiac-IV has N=64, the BSP has N=16,
and the MPP has N=16,384.
F= a set of data routing functions provided by the inter connection network (in Figure 5.1a) or by the
alignment network (in Fig. 5.1b).
I= the set of machine instructions for scalar-vector, data-routing, and network-manipulation operations.
M= the set of masking schemes, where each mask partitions the set of PEs into the two disjoint Subsets
of enabled PEs and disabled PEs.

To illustrate the necessity of data routing in an array processor, we consider an array


processor of N PEs. The sum S(K) of the first K-components in a vector A is desired for each K from 0 to n-1. Let
A=(Ao, A1, , An-1). We need to compute the following n summations:
K
S(K)=? Ai For K= 0, 1, , n-1
i=0
These n-vector summations can be computed recursively by going through the following n-1 iterations defined by
S(O) = Ao
k
Fig: calculation of summation S(K)=?
= AK , K = 0, 1, , 7
S(K) = S(K-1) + AK in an SIMD machine. A=0
A=0
Where K=1, 2, , n-1

PE0 A0 0 0 0 S(0)

PE1 A1 0,1 0,1 0,1 S(1)

PE2 0-2 0-2 S(2)


A2 1,2

PE3
A3 2,3 0-3 0-3 S(3)

PE4
A4 3,4 1-4 0-4 S (4)

PE5

A5 4,5 2-5 0-5 S (5)

PE6
S (6)
A6 5,6 3-6 0-6
PE7

A7 Step 1 6,7 Step 2 4-7 Step 3 0-7 S(7)


COA : 67

I/O
Data & Instructions
Data bus

CU Memory
CU

Control bus

PEo PE1 PE N-1


PEMo PEM1 PEM N-1

Control
Interconnection
(a) Configuration I (llliac iv) Network

(a) Configuration I (Illiac IV)


I/O
Data bus
BSP means Borroughs
CU memory
Scientific Processor
CU


PEo PE1 PE N-1

Alignment Network


Mo M1 M P -1

(b)Configuration II (BSP)
Architectural Configurations of SIMD array processors.
COA : 68

Hamming Code (SECDED Code) :


SECDED: Signal error correction double error detection.
Parity bit
Even parity (total no of Is in the message and the parity bit is even)

Odd parity (total no of Is in the message and the parity bit is odd)
Let us use even parity logic in the following example:

Sender station Receiver station Sender station Receiver station


Transmission Transmission
1011 1 1111 1 1011 1 1101 1

0 (Recognized 1
Parity bits)
As the received and recognized parity bits are Here the error could not be detected
different, so the error could be detected

So the main limitation of parity bit is that it can not detect any error when error occurs in even no of places. So
to solve this problem we introduce SECDED code.

In this method more than one parity bit are generated from distinct group of message/data bits. Now few
points are:
1. How many parity bits are to be introduced?
2. How will be the message bits involvements?
3. How they will be arranged?
4. How does SEC work?
5. How does DED work?

If there are N no of data bits and P no of parity bits, then the equation is

N 2 P -1

So for N=8, then P =


For N=14, then P =
And for N=20, then P =
COA : 69

Message bits: M12 M11 M10 M9 M8 M7 M6 M5 M4 M3 M2 M1


Data bits: D 8 D7 D 6 D 5 D4 D3 D2 D1
Parity bits: P4 P3 P2 P1

Error neous place no. in decimal Parity bits Inference / Comment


P4 P3 P2 P1
1 __ __ __ 1 Parity bit 1 (P1 error)
2 __ __ 1 __ Parity bit 2 (P2 error)
3 __ __ 1 1 Data bit 1 (D1 error)
4 __ 1 __ __ Parity bit 3 (P3 error)
5 __ 1 __ 1 Data bit 2 (D2 error)
6 __ 1 1 __ Data bit 3 (D3 error)
7 __ 1 1 1 Data bit 4 (D4 error)
8 1 __ __ __ Parity bit 4 (P4 error)
9 1 __ __ 1 Data bit 5 (D5 error)
10 1 __ 1 __ Data bit 6 (D6 error)
11 1 __ 1 1 Data bit 7 (D7 error)
12 1 1 __ __ Data bit 8 (D8 error)

Example:

P4 = {9,10,11,12} => {5,6,7,8} P4 = {1,1,0,1} = 1


P3 = {5,6,7,12} => {2,3,4,8} P3 = {0,1,0,1} = 0
P2 = {3,6,7,10,11} => {1,3,4,6,7} P2 = {1,1,0,1,0} = 1
P1 = {3,5,7,9,11} => {1,2,4,5,7} P1 = {1,0,0,1,0} = 0

Data Bits Are :

8 7 6 5 4 3 2 1
1 0 1 1 0 1 0 1

After Transmission Data Bits Are :

8 7 6 5 4 3 2 1
1 0 0 1 0 1 0 1

P4 / = {1,0,0,1} = 0
P4 P3 P2 P1 => 1 0 1 0 P3 / = {0,1,0,1} = 0
P2 / = {1,1,0,0,0} = 0
+ P4 / P3 /P2 / P1 / => 0 0 0 0 P1 / = {1,0,0,1,0} = 0

Syndrome => 1 0 1 0 => 1010 So this bit place no. has become corrupted.

Class Work :

Original message : 11 11 001111001111


Received message : 1 1 0 0 1 0 1 1

Use SEC to detect the erroneous bit place number.


COA : 70

In case of DED operation, double bit errors are detected but could not be corrected. Here in this DED
scheme an extra generate parity bit will be attached with the old scheme. And the inference will be drawn as
shown below.

General parity bit detects Syndrome is Inference / Comment


1. No error Non zero DED
2. Error Non Zero SEC
3. No error Zero No bit corruption has taken place

Example : < N + P + GP >

Original Massage :
8 7 6 5 4 3 2 1 GP P 4 = { 1, 0, 1, 1 } = 1
1101 1011 O 0 P 3 = { 1, 0, 1, 1 } = 1
P 2 = { 1, 0, 1, 0, 1 } = 1
P 1 = { 1, 1, 1, 1, 1 } = 1

Received Massage :

8 7 6 5 4 3 2 1 GP P / 4 = { 0, 1, 1, 1 } = 1
1110 1011 O 0 P / 3 = { 1, 0, 1, 1 } = 1
P / 2 = { 1, 0, 1, 1, 1 } = 0
P / 1 = { 1, 1, 1, 0, 1 } = 0

P4 P3 P2 P1 1 1 1 1
/ / / /
? P4 P3 P2 P1 1 1 0 0

Syndrome 0 0 1 1 = 3 10

So syndrome is pointing to location 3 but errors have occurred at locations 5th and 6th . The general
parity bit (GP) detects no error, because errors have occurred in even number of places. So it proves that
double bits error has occurred. So DED commences.
COA : 71
COA : 72

Huffman code: Frequency dependent operation code


Expanding operation code
From the execution trace of a program, it can be observed that certain operation codes like Move, Load, Store
etc. are more frequently executed than others. If such operation codes are encoded in fewer number of bits
while other less frequently used codes utilize larger number of bits, it may result in reduction of overall
program size. In the background of this observation, lesser number of bits can be assigned. for the set of
operation codes which appear more frequently in average programs. For example, fifteen most frequently used
operation codes may be encoded in the first 4-bits of the instruction (shown in fig.). In case the first 4-bits have
value 1111, then next 4-bits may be used to encode next set of frequently used fifteen more operation codes.

Frequency dependent operation code


The concept of expanding operation code can be further extended by utilizing the information in respect of
probability of occurrence of different operation codes in average programs. From the study of the execution
trace of a large number of programs, the probability of occurrence of each instruction group may be
identified. This method encode minimum number of bits for characters of maximum probability.
Example: A probability distribution of occurrence of different Operation Code Groups
Operation code Group Probability of occurrence
Data movement (without indexing 1 0.29
with indexing) 2 0.19
Branch (conditional 3 0.13
unconditional) 4 0.05
Floating point (add/sub 5 0.07
multi/divide) 6 0.05
Fixed point arithmetic 7 0.07
shift 8 0.04
Miscellaneous (simple 9 0.06
complex) 10 0.05

However, to have simplicity in decoding, the operation codes may be chosen as in the R.H.S. of diagram.
However, such a scheme is useful, if the overhead associated with the comparatively complex decoding
structure gets outweighed by the saving in the memory space to store the instructions in average programs.
COA : 73

Arithmetic processors : Peripheral processors & Coprocessors


A typical CPU needs most of the control and data processing hardware for implementing non-arithmetic
functions. Consequently, the hardware costs in terms of IC count or chip area associated with implementing
the more complex arithmetic operations like floating-point instructions often prevent their inclusion in the
CPUs instruction set. Such CPUs must rely on much slower software routines to provide the missing
arithmetic operations. If, however, a processor is devoted exclusively for arithmetic functions, then a full range
of numerical operations can be implemented in hardware at relatively low cost, e.g., in a single IC. This
concept can be applied to CPU design by providing auxiliary special-purpose arithmetic processors that are
physically separated from CPU, but are used by it to execute a class of arithmetic instructions, which are not
executable by the CPU itself. This approach speeds up program execution, reduces programming complexity.
The instruction assigned to arithmetic processors include the basic add, subtract, multiply, divide operations
on fixed-point and floating-point operands of various lengths, as well as exponentiation, logarithms, and
trigonometric functions. There are two general ways of introducing arithmetic coprocessors in computer.
In the first approach, the arithmetic processor is simply treated as a peripheral or I/O device to which
the CPU sends (outputs) data and processing instructions, and from which it receives (inputs) results; this is
termed as peripheral processor. In the second approach, the arithmetic processor is closely coupled (tightly
coupled) to the CPU so that its instructions and register sets are extensions to those of the CPU. The CPU
instruction set contains a special subset of opcodes reserved for the auxiliary processor. These instructions are
fetched by the CPU, jointly decided by the CPU and the auxiliary processor, and finally executed directly by
the auxiliary processor in a manner that is transparent to the programmer. Arithmetic processors of this type
form logical extension to the CPU and are referred to as coprocessors.
Each CPU is designed to have
a coprocessor interface. This includes
special control circuits linking the
CPU with the coprocessor, and
special instructions designated for
execution by the coprocessor. There
are special instructions (as
mentioned) for execution by
coprocessor. Coprocessor
instructions can be included in CPU
programs. A software routine
implementing the desired operation
(of the coprocessor instruction) can
be stored in a predetermined
memory location. If no coprocessor is
present, the CPU issues a software
(coprocessor) trap to transfer the
program control to the desired
location to execute the desired
operation already stored. Thus, all this can be done without changing the source or object code.
The general structure of the hardware side of a CPU-coprocessor interface is shown. The coprocessor
stays idle until a coprocessor instruction is encountered. It is directly linked to the CPU by a small number of
control lines that allow the activities of the two processors to be rapidly synchronized. The data transfer
between them takes place through the system bus. The coprocessor may be passive or slave device whose
registers can be written into and read by the CPU in the same manner as main memory. It is also useful to
permit the coprocessor to control the system bus, so that it can initiate data transfer to and from the CPU and,
for that matter, to and from main memory.
Coprocessor instructions typically contain the following three fields: a unique opcode F0 that identifies
coprocessor instructions, the address F1 of the particular coprocessor to be used if several coprocessor can be
attached, and finally the type F2 of the particular instruction or command to be executed by the coprocessor.
The F2 field may also include (partial) operand-addressing information.
COA : 74

cn-1
CARRY-LOOK AHEAD GENERATOR

pn-1 gn-1 pn-2 gn-2 1 p0 g0

zn-1 zn-2 z0
1-bit 1-bit 1-bit cin
adder adder adder
cn-2 cn-3 c0

xn-1 yn-1 xn-2 yn-2 x0 y0

OVERALL STRUCTURE OF CARRY-LOOK AHEAD ADDER

p g
4-BIT CARRY-LOOK AHEAD GENERATOR cin
c3
cout p3 g3 c2 p2 g2 c1 p1 g1 c0 p0 g0

Z15:Z12 Z11:Z8 Z7:Z4 Z3:Z0

4 p g 4 p g 4 p g 4 p g
4-bit 4-bit 4-bit 4-bit
cin
adder adder adder adder

4 4 4 4 4 4 4 4
x15:x12 x15:x12 x11:x8 x11:x8 x7:x4 x7:x4 x3:x0 x3:x0

A 16-BITADDER COMPOSED OF 4-BIT ADDERS LINKED BY CARRY LOOK AHEAD.


COA : 75
COA : 76

Problem solve of disk storage scheduling


Problem 1: A disk subsystem employed for the virtual memory organization has the specifications: 32 sectors
per track, 512 bytes/sectors, 3600 RPM, average linear latency of 30 msec, average instruction execution time is
1.5 microsec and page size of 4 Kbyte. In the event of a page fault, compute the wastage of instruction
execution capability of CPU before execution can continue on that program. Compute the time to transfer a
page of data in DMA mode. What will be the impact if the average seek time is reduced to 25 msec by
employing improved disk drive.
Answer:
32 sectors/track, 512 bytes/sector, 3600 RPM, avg. linear latency = 30 msec
average instruction execution time = 1.5 microsec Page size = 4Kbytes
Data transfer rate = (No. of bytes/track) / Time for 1 revolution (sec)
= (512 x 32) / (1/60) bytes/sec. = 983 Kbytes/sec
In case of page fault, the processor has to transfer the page from SS to MS. So for this to position the head it
takes 30 msec.
After that 4 Kbytes has to be transferred.
So, time required = 30 + 4K / (960K x 0.001) = 34.16 msce
So, No. of instruction execution wasted = (34.16 x 10-3)/(1.5 x 10-6 ) 22778
If the average seek time is reduced to 25 msec then, the total number of instruction capability
lost to transfer 4 Kbytes = (29.16 x 10-3)/(1.5 X 10-6) 19440
% improvement over the previous scheme with seek time of 30 msec
((22.778 - 19.440) x 103)/(22.778 x 103) = 14.65%
In DMA block transfer mode operation, data bytes read from disk are assembled as a word that is transferred
to main storage. Once disk drive is ready to transfer the data bytes, the DMA request is initiated by DMA
controller. On getting back DMA acknowledge, the DMA controller starts transfusing a word to main storage.
Assuming buffer storage available in DMA controller, a page can be transferred in block mode. To transfer a
page (4 Kbyte = 1K words), the time taken
= 1024 x TMS, where TMS = Main storage cycle time.
**********************************************************************************************
Problem 2: A tape drive has following parameters:
Bit density = 1600 bits/inches, Tape speed = 200 inches/sec, Time spent at IRG (Inter Record Gap) = 3 ms,
Average record length = 1000 characters.
How many bytes can be stored on a tape reel of length 1200 ft. written on such a drive?
Answer:
For a tape drive:
Bit density = 1600 bpi
Tape speed = 200 inch/sec
Record length = 1000 bytes = 1000 /1600 (data stored parallel by bytes)
= 0.625 inch
IRG. 3 msec
= 3 msec x average speed in half of IRG x 2
= 3 msec x (0.5 x 200 inch/sec) x 2
= 0.6 in
No. of records on a 1200 ft tape = (1200 ft x 12) / (0.625 + 0.6) in = 11755 records = 11.755 Mbytes
**********************************************************************************************
Problem 3: A high speed tape system accommodates 1200 ft. reel of standard nine track tape. The tape is
moved past the recording head at a rate of 150 inches/sec. What must the linear tape recording density be in
order to achieve a data transfer of 106 bits/sec?
Answer:
Tape length = 1200 ft
Tape speed = 150 in/sec
Data transfer rate = 106 bits/sec
Recording density = 106 /150 bpi = 6667 bpi
COA : 77

Problem 4: Suppose that data on the tape is organized into blocks each containing 32 Kbytes. A gap of 0.4 inch
separates the blocks from each other. The density of recording is 6250 bits/inch. How many bytes may be
stored on the tape reel of 2400 ft.?
Answer:
Block size = 32 Kbytes Recording density = 6250 bpi Block length = (32 x 103)/6250 = 5.12 in
Block separation = 0.4 in Tape length = 2400 ft
Hence no. of blocks on tape = (2400 ft x 12)/(5.12 + 0.4) in = 5217 blocks = 166.944 Mbytes
**********************************************************************************************
Problem 5: A disk pack has 19 surfaces. Storage area on each surface has an inner diameter of 22 cm and
outer diameter of 33 cm. Maximum storage density on any track is 2000 bit/cm and minimum spacing between
tracks is 0.25 mm. (a) What is the storage capacity of the pack?
(b) What is the data transfer rate in bytes per sec. at a rotational speed of 3600 RPM.
(c) Using two 16-bit words, suggest a suitable scheme for specifying disk address.
(d) The main memory of a computer has 32-bit word length and 500 nsec. cycle time. Assuming that disk
transfers data to/from the main memory on a cycle stealing basis, evaluate the percentage of memory cycles
stolen during data transfer period?
Answer:
No. of surfaces = 19 Inner track diameter = 22 cm outer track diameter = 33 cm
Track width (total) = 5.5 cm Track separation = 0.25 mm
No. of tracks/surface = (5.5 X 10) / 0.25 = 220 Minimum track circumference = 22 cm x
Maximum track storage density = 2000 bits/cm (on the innermost track)
Data storage capacity/track = 22 x x 2000 = 138.23 Kbits
Disk speed = 3600 rpm Rotation time = 1/3600 minute = 16.67 msec
(a) Storage capacity = 19 x 220 x 138.23 Kbits = 577.8 Mbits = 72.225 Mbytes (with 8 bit/byte)
(b) Data transfer rate = 138.23 Kbits / 16.67 msec = 8.2938 Mbits/sec
This is the peak data transfer rate excluding seek time and rotational latency.
(c) A possible disk addressing scheme could be with the following fields:
Surface (head) no. = 5-bit [0 to 181
Track (cylinder) no. = 8-bit [0 to 2191
(assuming 128 sector per track, each sector storing 1 Kbit)
Sector no. 7-bit [0 to 127]
Thus the two 16-bit words can be used as follows to store the sector address in a track on a surface.

The field marked x may be used by the designer to store some useful information.
(d) CPU cycle time = 500 nsec
Data transfer rate = 8.2938 Mbit/sec
Time to read a 32-bit block = 32/8.2938Mbit/sec = 3858.3 nsec = 8 CPU cycles
Ideally, 1 out of every 8 CPU cycles should be stolen Percentage disk service time = 100/8 = 12.5%
**********************************************************************************************
Problem 6: A particular disk drive having a rotational speed of 3600 RPM has the capability for rotational
position sensing. It employs hard sectored disk pack with 128 sector marks on a track. At a particular instance,
a record on sector 1 is sought and on interrogation it is observed that the head is just entering sector 2.
Determine the duration for how long the control unit and IOP can be released prior to reading the record.
Assume a delay of .3 msee for sensing rotational position.
Answer:
Rotation speed = 3600 rpm Rotation time = 16.67 msec
Position sense time = 0.3 msec
Sequence of actions after detecting head on sector
(i) Time to traverse 127 sectors = (16.67 x 127)/ 128 msee = 16.54 (
(ii) Confirm position as sector 1 = 0.3 msec
Total delay = 16.84 msec
So control unit and IOP can be released for 16.84 msec.
COA : 78

Problem 7: A disk drive having 3600 RPM employs 128 sectors per track. What is the average rotational
latency of the drive? The IOP to which the drive is attached requested a record in say sector x. While
interrogating the disk system, if it is identified that the disk head is on sector (x + 1), for how long the IOP and
the disk controller can be released prior to R/W head starts reading the sector x.
Answer:
Drive speed = 3600 rpm
Rotation time = 1 / 3600 min = 16.67 msec
Average rotational latency = 8.33 msec
Rotation from sector (x + 1) to sector (x)
= 1 rotation - 1 sector traversal time
= 16.67 msec (1 1/128)
= 16.536 msec.
**********************************************************************************************

Problem 8: A disk drive has the following specifications: 4040 bpi recording density, 6448 x 103 bit/sec data
transfer rate, 3600 RPM, innermost track diameter of 6.5 inch, total of 400 tracks with track density of 200 tpi.
Determine whether the recording density mentioned refers to the track on the innermost/outermost or central
region of the recording surface.
Answer:
Rotation speed = 3600 rpm
Rotation time = 1/3600 mm = 16.67 msec
Data transfer rate = 6448 x 103 bit/sec
Data/track = 6448 x 103 bit/sec x 16.67 msec = 107.466 Kbits
For storage density of 4040 bits per in, track circumference = 107.466 Kbits / 4040 = 26.6 in
Track diameter = 26.6 in / = 8.5 in (approx.)
Innermost track diameter = 6.5 in
400 tracks
Radial width of the tracks = 400 tracks / 200 tpi = 2 in
Outermost track diameter = 6.5 + 2 x 2 = 10.5 in
Hence the recording density refers to the central region of the disk recording surface.
**********************************************************************************************

Problem 9: A floppy disk drive has following specifications: 77 tracks, 26 sectors per track, 188 bytes/sector,
320 bytes preamble and postamble data per track, usable data storage per sector is 128 byte, 360 RPM, 96
track per inch, 3200 bpi average recording density. Compute unformatted capacity, formatted usable data
storage, data transfer rate, radial distance between innermost and outermost track, average diameter of a
track.
Answer:
For a floppy disk drive
No. of tracks = 77
Sector per track = 26
Sector size = 188 bytes
Usable sector capacity = 128 bytes
Track overhead = 320 bytes (preamble) + 320 bytes (postamble) = 640 bytes
Unfomatted capacity = 77 tracks x [640 + (26 sectors x 188)] = 425.656 Kbytes
Fomatted capacity = 77 tracks x 26 sectors x 128 = 256.256 Kbytes
Drive speed = 360 rpm
Rotation time = 167 msec
Peak data transfer rate = (640 + 26 X 188)/167 msec = 33.168 Kbytes/sec = 265.344 Kbits/sec
Storage density = 3200 bpi
Average track circumference = ((640 + 26 x 188) x 8 )/3200 bits/track = 13.82 inch
Average track diameter = 4.399 inch
At 96 tracks/in, radial width for 77 tracks = 77/96 in = 0.802 in
COA : 79

Problem solve on printer, display and I/O scheduling


Printing sequence: The printing sequence consists of the following steps:
1. Line information to be printed is transferred from main storage to the buffer of the printer by the Input-
Output Processor (IOP) or CPU.
2. Print command signal is sent to printer controller logic.
3. The timing pulse generated from the print disc initiates print scan.
4. The character counter generates the specific character, which gets aligned with the first print position on
the first print cycle of the first sub scan of the first print scan. Subsequently, binary codes are generated by
the character counter corresponding to the specific character being optioned to print in successive print
positions.
5. In a print cycle the character code generated by the character counter is compared with the character
stored in the print buffer to be printed in the print position.
6. If the character code generated is equal to the character to be printed in that position, the hammer for that
position fires and the character gets printed on that print position.
7. Step 6 is repeated for each print cycle of each of the three sub-scans of a print scan.
8. Steps 3 to 7 are repeated for each of the n number of print scans corresponding to the n number of
characters in the character set.
9. Auto spacing after print is initiated to advance the paper to the next print line.

Printing speed: The printing speed in terms of lines per minute of a chain/band printer can be computed as
follows for a 132 print position printer. Let
ta = time for auto spacing of paper in millisecond,
tp = time for a print cycle in micro second,
ts = synchronization time in between a pair of sub-scans in micro second,
C = number of characters in the set.
Time for one sub-scan including synchronization delay = 44tp + ts,
since there are 132/3 = 44 print cycles in a sub-scan.
Time for one print scan = 3(44tp + ts)
T = Time to print a line = {3C x (44tp + ts)} x 10-3 + ta
Speed in LPM (Lines Per Minute) = (10 -3 x 60) / T
For, ta = 10 msec, tp = 4.5 micro second, ts = 10 micro second, C = 64,
T = 3 x 64(44 x 45 + 10) x 10-3 + 10 = 50 ms (approx.)
LPM = 1200.

**********************************************************************************************
Problem 1: A chain printer has following specifications:
Time to space after printing a line : 15 msec
Number of characters in the character set : 50
Number of print positions : 132
Number of sub-scans :3
Print cycle time : 5 micro-sec
Synchronization, time after each sub-scan : 35 micro-sec
Find the printing speed in lines per minute.
Answer
In chain printer, number of print positions = 132
No. of sub-scans = 3 Print cycle time = 5 micro-sec
Sub-scan time = (5 micro-sec x 132)/3 = 220 micro-sec Synchronization time after sub-scan = 35 micro-sec
Time for one sub-scan = (220 + 35) x 3 micro-sec = 765 micro-sec
No. of character = 50. So line print time = 765 x 50 = 38.25 msec
Time to space after printing a time = 15 msec
Printing speed = 60 sec/(38.25 + 15) msec = 1126 lines/min i.e., LPM
COA : 80

Problem 2: Determine the printing speed of a dot matrix printer in characters per second having following
specifications:
Time to print a character : 3 msec
Time to space in between characters : 1 msec N
Number of characters in a line : 100
Specify the time to print a character line.
Answer
Time to print a character = 3 msec
Time to space between characters = 1 msec
Printing speed = 1/(4 x 10-3) = 250 cps (characters per sec)
No. of characters/line = 100
Time to print a line = 100 x (3 + 1) msec = 0.4 sec

**********************************************************************************************

Problem 3: Determine the bit rate in MHz of a VDU terminal having following specifications:
Number of characters/line : 80
Number of bits/character :7
Horizontal sweep time : 63.5 micro-sec
Retrace time : 20% of horizontal sweep time.
Answer
For a VDU terminal
No. of characters/line = 80
No. of bits per dot row of a character = 7
So no. of bits in a dot row per line = 560 bits
Horizontal sweep time = 63.5 micro-sec (assumed to include retrace time)
Retrace time = 20% x 6.35 micro-sec = 12 micro-sec
Time for movement of beam on a dot row of a character line = 63.5 - 12.7 = 50.8 micro-sec
So bit rate = 560 bits/(50.8 X 10-6) = 11.23 Mbit/see

**********************************************************************************************

Problem 4: Determine the cycle time of the refresh RAM used in a graphic system having following
specifications.
Frame size : 1024 x 1024 pixels
Horizontal sweep time : 63.5 micro-sec
Retrace time : 30% of sweep time
One word of RAM stores : 4 pixels.
Answer
For a graphics terminal refresh RAM:
Frame size = 1024 x 1024 pixels
1 RAM word stores = 4 pixels
No. of words to store one pixels row
= 1024/4 = 256 words
Time for one pixel row = 63.5 - 0.3 x 63.5 = 44.45 micro-sec
RAM cycle time = (44.45 x 103)/256 nsec
= 174 nsec (approx.)

**********************************************************************************************

Potrebbero piacerti anche