Sei sulla pagina 1di 80

COA :

1

Example : SISD(One functional unit) IBM 701, IBM 1620 SISD (Multiple functional units) IBM 360-91 CDC 6600

Example: SIMD(Word Slice Processing) Illiac – IV SIMD (Bit Slice Processing)

Example: MIMD(Loosely Coupled) UNIVAC 1100 MIMD (Tightly Coupled) Borroughs D-825

IS IS DS CU PU MM
IS
IS
DS
CU
PU
MM

(a) SISD Computer

PU 1 DS MM 1 1 DS 2 PU 2 MM 2 IS CU SM
PU 1
DS
MM
1
1
DS
2
PU 2
MM
2
IS
CU
SM
DS
n
PU n
MM
m
IS
STARAN

(b) SIMD Computer

IS IS 1 1 CU 1 PU 1 IS IS 2 2 CU 2 PU
IS
IS
1
1
CU 1
PU 1
IS
IS
2
2
CU 2
PU 2
MM
MM
MM
1
2
m
IS
DS
n
CU n
PU n
IS n
IS 2
IS 1
IS n
Example: MISD(No Real
Embodiment of This Class
Exists

(c) MISD Computer

SM

IS IS DS 1 1 1 CU 1 PU 1 MM IS 1 1 IS
IS
IS
DS
1
1
1
CU 1
PU 1
MM
IS 1
1
IS 2
IS
IS
2
2
CU 2
PU 2
DS
MM
2
2
IS n
IS
IS
DS
n
n
n
CU n
PU n
MM
3

(d) MIMD Computer

COA :

2

Flynn's Architectural Classification Schemes:

It is introduced by M.J.Flynn. It is based on the multiplicity of instruction and data stream in a

computer system. The essential computing process is the execution of instructions on a set of data. The

term stream denotes a sequence of items (instructions of data) as executed or operated upon by a single

processor. Instruction or data are defined with respect to the referanced machine. An instruction stream is

a sequence of temporary results, called for by the instruction stream.

The four machine organizations are:

A Single instruction stream-single data stream (SISD) Single instruction stream-multiple data stream (SIMI)) Multiple instruction stream-single data stream (M1SD) Multiple instruction stream-multiple data stream (MIMD)

The categorization depends on the multiplicity of simultaneous events in the system components. Conceptually, only three types of components are needed. Both instructions and data are fetched from memory modules (MM). Instructions decode by the control unit (CU), which sends the decoded instruction stream to the processor unit (PU) for execution data streams flow between the processors and the memory bi-directionally. Multiple memory modules may he used in the shared memory subsystem. Each instruction stream is generated by an independent control unit. Multiple data streams originate from the subsystem of shared memory modules.

SISD Computer Organization:

This organization represents most serial computers available today. Instructions are executed

sequencially but may be overlapped in their execution stages (pipelining). Most SISD uniprocessor systems

are pipelined. An SISD computer may have more than one functional unit in it. All the functional units are

under the supervision of one control unit.

SIMD Computer Organization:

This class corresponds to array processors. There are multiple processing elements supervised by the same control unit. A11 PEs receives the same instruction broadcasted from the control unit but operate on different data sets from distinct data streams. The shared memory subsystem may contain multiple modules. SIMD machines are further divided into word-slice and bit-slice modes.

MISD Computer Organization:

There are n processor units, each receiving distinct instructions operating over the same data stream and its derivatives. The results (output) of one processor become the input (operands) of the next processor in the macro-pipe. This structure has received much less attention and has been challenged as impractical by some computer architects. No real embodiment of this class exists.

MIMD Computer Organization:

Most multiprocessor systems and multiple computer systems can he classified in this category. An intrinsic MIMD computer implies interactions among the n processors because all memory streams are derived from the same data space shared by all processors. If the n data streams were derived from disjointed subspaces of the shared memories, then we would have the so-called multiple SISD (MSISD) operation, which is nothing but a set of n independent SISD uniprocessor systems. An intrinsic MINID computer is highly coupled if the degree of interactions among the processors is high. Otherwise, we consider them loosely coupled. Most commercial MIMD computers are loosely coupled.

COA :

3

Serial Versus Parallel Processing:

Feng has suggested the use of the degree of parallelism to classify various computer architectures. The maximum number of binary digits (bits) that can he processed within a unit time by a computer system is

called the maximum parallelism degree P. Let P i be the number of bits that can be processed within the i th

The average

processor cycle (or the i th clock period). Consider T processor cycles indexed by 1 = 1, 2, parallelism degree, P a is defined by

,T.

by 1 = 1, 2, parallelism degree, P a is defined by ,T. In general, P

In general, P i <= P. thus we, define the utilization rate of a computer system within T cycles by

utilization rate of a computer system within T cycles by If the computing power of the

If

the computing power of the processor is fully utilized (or the parallelism is fully exploited), then we

have P i = P for all i and μ=1 for 100 % utilization. The utilization rate depends on the application program being executed.

A bit slice is a string of bits one from each of the words at the same vertical bit position, e. g. the TI-ASC

has word length of 64 and 4 arithmetic pipelines. Each pipe has 8 pipeline stages. Thus there are 8* 4= 32 bits per each bit slice in 4 pipes.

64 bits word length

32 bits per each bit slice in 4 pipes. 64 bits word length Stage-1 Stage-1 Stage-2
Stage-1 Stage-1 Stage-2 Stage-2 Stage-3 Stage-3 Stage-4 Stage-4 Stage-5 Stage-5 Stage-6 Stage-7 Stage-6
Stage-1
Stage-1
Stage-2
Stage-2
Stage-3
Stage-3
Stage-4
Stage-4
Stage-5
Stage-5
Stage-6
Stage-7
Stage-6
Stage-8
Stage-7
Stage-8
8
-
bits
8-bits
Pipeline no - 1
8
-
bits
8-bits
Pipeline no – 2
8
-
bits
8-bits
Pipeline no – 3
8 -
bits
8-bits
Pipeline no – 4

8-bits x 4 = 32 bits = bit slice length

The maximum parallelism degree P(C) of a given computer system C is represented by the product of the word length n and the bit slice length m, that is, P(C) = n. m

It is equal to the area of the rectangle defined by the integers n and m.

COA :

4

NUMBER REPRESENTATION TECHNIQUES:

Number Representation Techniques

REPRESENTATION TECHNIQUES: Number Representation Techniques Unsigned representation Signed representation Unsigned

Unsigned representation

Signed representation

Unsigned Representation :

N bits Let N = 4 then diff.
N bits
Let
N = 4 then diff.

Combinations are

0000 => + 0 => 0

0001 => +1

.

.

.

1110 => + 14

1111 => + 15 => 2 N – 1

In this representation all N bits will contribute to represent the + ve magnitude part of the number.

So the corresponding range – specification graph is :

+ ve Overflow + ve 0 + ( 2 N – 1 )
+ ve Overflow
+ ve
0
+ (
2 N – 1 )

Signed Representation:

N bits 1 N – 1
N bits
1
N – 1

0 Magnitude

1 => – ve

if N = 4 then diff. Combinations are

0111 => + 7 => + (2 N 1 – 1)

0110=> + 6

=> + ve

In this representation technique first bit will represent the sign and rest (N – 1) bits will represent the magnitude part of the no.

The main disadvantage of this representation technique is that though mathematically +0 and –0 are same but they have different representations here. The range specification graph is:

.

.

0001

=> + 1

0000

=> + 0

1000

=> – 0

1001

=> – 1

- ve Overflow + ve Overflow 0 - ve + ve – ( 2 N
- ve Overflow
+ ve Overflow
0
- ve
+ ve
– ( 2 N – 1 – 1 )
+ ( 2 N – 1 – 1 )

.

.

.

1110 => – 6

1111 => –7 => - (2 N 1 – 1)

The above representation technique can also be called as signed – magnitude representation. But to solve the above mentioned problem we introduce a different representation technique called 2’s complement signed–magnitude representation.

COA :

5

COA : 5 1 N-1 S i g n b i t Magnitude part 0=> +ve
COA : 5 1 N-1 S i g n b i t Magnitude part 0=> +ve

1

N-1

COA : 5 1 N-1 S i g n b i t Magnitude part 0=> +ve
COA : 5 1 N-1 S i g n b i t Magnitude part 0=> +ve

Sign bit

Magnitude

part

0=> +ve

1=>-ve

If N=4 then the combinations are

0111

=> +7 => +(2 N-1 -1)

0110

=> +6

:

:

0001

=> +1

0000

=>

0

1111

=> -1

1110

=> -2

:

:

1000

=>-8 => -(2 N-1 )

Tricks:1

654321

0

111011

0

=64+32+16+4+2 = 118

 

=

(2 7 - 2 4 )+(2 3 - 2 1 ) = (128-16)+(8-

2)=118

654321

0

110111

1

=64+32+8+4+2+1 = 111

= (2 7 – 2 5 )+(2 4 - 2 0 )

= (128-32)+(16-1)=111

Class Work : Evaluate 1111011 and 1010101 as above.

Q. How can you represent -12 in 8 bit architecture?

12 = 0000 1100 1’s complement 1111 0011 +1 1111 0100 2’s Complement 1111 1111
12 = 0000 1100
1’s complement
1111
0011
+1
1111
0100
2’s Complement
1111 1111
- 0000 0000
1111
0011
+ 1
1111
0100

It is same as (1111 1111+1)-(0000 1100) = 2 8 – 0000 1100

So 2’s complement of a no. in N bit architecture

= 2 N – no

So the range-specification graph is:

-Ve -(2 N-1 ) 0 +(2 N - 1)
-Ve
-(2 N-1 )
0
+(2 N - 1)

Tricks:2

0.111 2 = 1*2 -1 + 1*2 -2 + 1*2 -3

= 0.5+0.25+0.125

= 0.875

1-2 -3 = 1-(1/8) = 7/8 = 0.875

So we have: 1-2 -n = 0.111….1

The no -12 in 8-bit architecture = 2 8 -12 = 256-12 = 244 = 1111 0100

If we add this with +12 then we shold get 0 as result:

1111 0100

+0000 1100

10000 0000

(Proved)

COA :

6

Floating Point Number Representation:

The general format of a floating point number is S

=
=

B E ? Significand

Base Exponent

The main advantage of using a floating point number is that to represent a huge number or a very small number floating point number requires a few bits.

e.g.

0.00000000005 = 0.5

50000000000 = 0.5

10 10

10 11

A floating pt. Number may have several representations

0.123

10 4 = 0.0123

10 5 = 1.23

10 3

So to have a fixed form before representing a number in the computers memory we should make the number normalized. The normalization rules are:

1. The integer part should be Zero

2. If the number is

0. d 1 d 2 … d n

?  0. i2
?  0.
i2

B IE then d 1 > 0 and all di n

So using this rule 0.123

There are two representations of fL – pt.no.

10 4 is only the normalized.

1. Single precision ( 32 bit ) and

2. Double precision ( 64 bit ).

precision ( 32 bit ) and 2. Double precision ( 64 bit ). 32 bits 1
32 bits

32 bits

32 bits

1

 

8

23

1   8 23
Double precision ( 64 bit ). 32 bits 1   8 23 Sign for Significand 0

Sign for Significand 0 => +ve 1 => - ve biased exponent i.e. 128 will be added with the exponent. So the range – 128 to + 127 will be shifted to 0 to 255 (+ve half only).

Truncated

Significand

- ve overflow - ve underflow + ve underflow + ve overflow - ve +
- ve overflow
- ve underflow
+ ve underflow
+ ve overflow
- ve
+ ve
0
2 – 128
2 – 128
- 0.5 
+ 0.5 
+ 127
+ 127
- ( 1 – 2 – 24 ) 
2
+ ( 1 – 2 – 24 ) 
2

COA :

7

Example Of Fl. Pt. No. Representation :

+ 0.110111

+ 0.110111

– 0.110111

– 0.110111

2 +100101 = 0

2 100101 = 0 2 + 100101 = 1 2 100101 = 1

8 bits 23 bits
8 bits
23 bits

IEEE 754 Floating Point Format:

0 1 = 1 8 bits 23 bits IEEE 754 Floating Point Format: 32 bits Single

32 bits

Single precision ( SP 32 bit ) Double precision ( DP 64 bit )
Single precision
(
SP 32 bit
)
Double precision (
DP 64 bit
)

1

8

23

Sign

Exponent

mantissa ( m )

(s)

(e)

= ( - 1 ) 5 = ( - 1 ) 5

2

2

e 127

e 1023

( 1.m ) ( 1.m )

Here base ( r ) = 2, exponent e in excess – 127 and mantisa has hidden 1, thus denoting 1.m.

e

M

 

Inference

255

#

0

NaN ( not a no. ) ( divide by 0, sq. root of – ve no. )

255

0

infinite no.

x

=

( - 1 ) 5 α, +α & -α Differently

0

#

0

x = ( - 1 ) S

2 126 ( o. m )

0

0

0, x = ( - 1 ) S 0 again + 0 and – 0 are possible

Problem :

40400000H = 0

0

100 00000

128

100

0000

1

0000

2 1

0000

So

x = ( - 1 ) 0

2 128 127

( 1.5 ) = 3 10

Class Work :

0000

0000

A )

Evaluate

40A00000 H

B )

Express

10 10 in IEEE 754 floating point format.

40A00000H = 0100

0000

1010

0000

0000

0000

0000

0000

COA :

8

BOOTHS ALGORITHM:

START
START

A

A 0

0

Q

Q Multiplier

Multiplier

Q 0

Q 0 0

0

n

n No of bits

No of bits

m

m Multiplicand

Multiplicand

01 q 1 q 0 00 11 A=A+M
01
q 1 q 0
00
11
A=A+M

10

A=A-M

NO

ASR

AQQ 0

ASR AQQ 0

n = n-1

Is n = 0? YES Result in AQ STOP
Is n
= 0?
YES
Result in AQ
STOP

ASR Means Arithmetic Shift Right.

 

Example (-7)*3 = (-21)

M

 

1001

n

 

A

 

Q

 

Q

0

Action

-M

4

0000

0011

 

0

Initialize

0111

0111

0011

 

0

A=A-M

 

3

0011

1001

 

1

ASR

2

0001

1100

 

1

ASR

 

1010

1100

 

1

A=A+M

1

1101

0110

 

0

ASR

0

1110

1011

 

0

ASR

 
   
 
 

Example (-7)*(-3) = (21)

M

 

1001

n

Q

Q 0

   

Action

-M

0000

1101

 

0

 

Initialize

0111

0111

1101

 

0

 

A=A-M

 

0011

1110

 

1

 

ASR

1100

1110

 

1

 

A=A+M

1110

0111

 

0

 

ASR

0101

0111

 

0

 

A=A-M

0010

1011

 

1

 

ASR

0001

0101

 

1

 

ASR

 

Class Work

 
 

(7)*(3)=(21)

 

M

n

A

Q

 

Q 0

Action

-M

COA :

9

COA : 9 ( f ) Displacement EA = A + (R) ( g ) Stack

( f ) Displacement EA = A + (R)

( g ) Stack Addressing

Post Indexing DA = (A) EA = DA + (R) D = (EA) (DA is Direct Address)

Pre Indexing IA = A + (R) EA = (IA) D = (EA) (IA is Indexed Address)

COA :

10

Instruction Formats:

X = (A+B) * (C+D)

Three-Address Instructions:

ADD

R1, A, B

R1

M [A] + [B]ADD R1, A, B R1

ADD

R2, C, D

R2

M [C] + [D]ADD R2, C, D R2

MUL

X, R1, R2

M

[X]

MUL X, R1, R2 M [X] R1 * R2

R1 * R2

Two-Address Instructions:

 

MOV

R1, A

R1

MOV R1, A R1 M [A]

M [A]

ADD

R1, B

R1

ADD R1, B R1 R1+M[B]

R1+M[B]

MOV

R2, C

R2

MOV R2, C R2 M [C]

M [C]

ADD

R2, D

R2

ADD R2, D R2 R2+M[D]

R2+M[D]

MUL

R1, R2

R1

MUL R1, R2 R1 R1 * R2

R1 * R2

MOV

X, R1

M

[X]

MOV X, R1 M [X] R1

R1

One-Address Instructions:

LOAD

A

AC

AC M [A]

M [A]

ADD

B

AC

AC AC+M[B]

AC+M[B]

STORE

T

M[T]

M[T]

AC

LOAD

C

AC

M [C]AC

ADD

D

AC

AC+M[D]AC

MUL

T

AC

AC*M[T]AC

STORE

X

M

[X]

M [X]

AC

Zero-Address Instructions:

PUSH

A

TOS

APUSH A TOS

PUSH

B

TOS

BPUSH B TOS

ADD

 

TOS

(A+B)ADD   TOS

PUSH

C

TOS

CPUSH C TOS

PUSH

D

TOS

DPUSH D TOS

ADD

 

TOS

(C+D)ADD   TOS

MUL

 

TOS

(A+B) * (C+D)MUL   TOS

POP

X

M

[X]

POP X M [X] TOS

TOS

Computer System Architecture

M. Morris Mano.

(Excercise)

Ex : 8.12

Page No.293

X

=

(A+B*C)/ (D-E*F+G*H)

X

=

(
(

{A-B+C * (D*E-F)}/ (G+H*K)

)/
)/

COA :

11

X

(A

B

C)

D

E

F

G

H

 Three address instructions:

 E  F  G  H  Three address instructions :  Two address

 Two address instructions:

F  G  H  Three address instructions :  Two address instructions : 

 One address instructions:

F  G  H  Three address instructions :  Two address instructions : 

COA :

12

 Zero address instruction:

X

(A

B

C)

D

E

F

G

H

 Post-fix form of the expression:

A

B

C

*

+

D

E

F

*

-

G

H

*

+

/

: X  (A  B  C) D  E  F  G 

COA :

13

X

A

B

C*(D*E

F)

 

G

H*K

 Three address instruction:

C*(D*E  F)   G  H*K  Three address instruction :  Two address instruction

 Two address instruction:

C*(D*E  F)   G  H*K  Three address instruction :  Two address instruction

COA :

14

X

A

B

C*(D*E

F)

 

G

H*K

 One address instructions:

  G  H*K  One address instructions :  Zero address instructions :  The

 Zero address instructions:

The post - fix expression for the above equation is:

One address instructions :  Zero address instructions :  The post - fix expression for
COA : 15 Individual Control Signal for internal CPU Control Individual Control Signal for System
COA :
15
Individual Control Signal for
internal CPU Control
Individual Control Signal for
System Bus Control
Micro-Instruction Branch Address
Jump Condition (Unconditional,
zero, overflow, indirect)
(A) HORIZONTAL MICRO-INSTRUCTION (A)HORIZONTZL MICRO-INSTRUCTION
(A) HORIZONTAL MICRO-INSTRUCTION
(A)HORIZONTZL MICRO-INSTRUCTION

Function Codes

Individual Control Signals Individual Control Signal Decoder Micro-Instruction Branch Address Jump Condition Jumper
Individual
Control Signals
Individual
Control Signal
Decoder
Micro-Instruction Branch Address
Jump Condition
Jumper Condition
(B) VERTICAL MICRO-INSTRUCTION
(B)VERTUAL MICRO-INSTRUCTION
Decoder
Decoder
Individual Control Signals Individual Control Signal Individual Control Signals Individual Control Signal Jump
Individual
Control Signals
Individual
Control Signal
Individual
Control Signals
Individual
Control Signal
Jump Condition
Jumper Condition
Demultiplexer
Micro-Instruction Branch Address
(C) HYBRID MICRO-INSTRUCTION (HAVING DUAL
USE BITS)
Decoder

COA :

16

Microinstruction Formats:

The two widely used formats used for microinstructions are horizontal and vertical. In the horizontal microinstruction each bit of the microinstruction represents a micro-order or a control signal which directly controls a single bus line or sometimes a gate in the machine. However, the length of such a microinstruction may be hundred of bits.

In vertical microinstructions, many similar control signals can he encoded into few microinstruction bits. For 16 ALU operations which may require 16 individual microcoder in horizontal microinstruction only 4 encoded bits are needed in vertical microinstruction. Similarly, in a vertical microinstruction only 3 bits are needed to select one of the 8 registers. However, these encoded bits need to be passed from respective decoder to get the individual control signals.

Some of the microinstructions may be passed through a de-multiplexer causing selected bits to he used for few different location in the CPU. For example, a 16 bit field in a microinstruction can be used as branch address in branching microinstruction, however, these bits may he utilized for some other control signals in a non-branching microinstruction. In such a case de-multiplexer can be used.

The vertical microinstructions are normally of the order of 32 bits. In certain control units, several levels of control are used. For example, a field in microinstruction or in the machine instruction may hold the address of a read only memory which holds the control signals. This secondary ROM can hold large address constants such as interrupt service routine address.

In general, a horizontal control unit are faster yet require wide instruction words, whereas, vertical control units although require decoder, however, are shorter in length. Most of the systems use neither purely horizontal nor purely vertical microinstructions.

Example:

Let us consider a hypothetical architecture where there are 16 General Purpose Registers(GPR) (e.g. R0, R1, R2,…,R15) and 4 Operation codes/Instructions (e.g. Addition, Subtraction, Multiplication and division). Then the instruction

MUL R3,R4

In Horizontal micro instruction it will be

0010 0001000000000000, 0000100000000000

In Vertical micro instruction it will be

10 0011, 0100

COA :

17

. . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Control Store

Decode Line

Activated

 

000

001

010

011

Either

 

011

110

111

Or

 

011

100

101

111

Control Signal Generated

Address of next microinstruction

C

1

C

3

C

5

C

7

C

2

C

4

C

5

000

C

1

C

3

010

C

2

C

5

011

 

?

C

2

C

5

If external condition is true then target address is 110

C

2

C

4

C

7

111

C

0 C 1 C 2 C 3 C 4 C 5 C 7

Load next Instruction in IR

C

2

C

5

If condition false then 100

C

1

C

2

C

3

C

5

101

C

1

C

6

C

7

111

C

0 C 1 C 2 C 3 C 4 C 5 C 7

Load next Instruction in IR

COA :

18

HARDWIRED CONTROL:

Control units use fixed logic circuits to interpret instructions and generate control signals from them.

DESIGN METHODS:

The design of a hardwired control involves various complex tradeoffs between the amount of hardware used, its speed of operation, and the cost of the design process itself. Because of the large number of control signals used in a typical CPU and their dependence on the particular instruction set being implemented, the design methods employed in practice are often ad hoc and heuristic in nature, and therefore cannot easily be formalized. Three simplified and systematic approaches are:

• Method 1 : The standard algorithmic approach to sequential circuit design called the state-table method, since it begins with the construction of a state table for the control unit.

• A heuristic method based on the use of clocked delay elements for control-signal timing.

• A related method that uses counters, which we call sequence counters, for timing purpose.

Method 2

Method 3

METHOD 1: State Table Method

Let C in and C out denote the input and output variables of control unit. The rows and columns of the state table correspond to the set of internal states {S i } of the machine and the set of external signals to the control unit. The entry in row S i and column I j has the form S i,j , z ij , where S i,j denotes the next state of the

control unit and z i,j , denotes the set of output signals z i,j from C out that are activated by the application of I j to the control unit when it is in state S j . There are practical disadvantages to using state tables:

The number of state and input combinations may be so great that the state table size and the amount of computation needed become excessive.

State tables tend to conceal useful information about a circuit's behaviour, e.g., the

existence of repeated patterns or loops. Control circuits designed from state tables also tend to have a random structure, which makes design debugging and subsequent maintenance of the circuit difficult.

METHOD 2: Delay Element Method

Consider the problem of generating the following sequence of control signals at times t l , t 2 , …., t n using a hardwired control unit.

T

T

T

1

2

3

: Activate { C l,j } : Activate { C 2,j }

: Activate { C n,j }

Suppose that an initiation signal called START(t 1 ) is available at t 1 . START(t 1 ) may he fanned out to { C l,j } to perform the first micro operation. If START(t 1 ) is also entered into a time delay element of delay t 2 ~ t 1 , the output of that circuit, START(t 2 ) can be used to activate {C 2,j }. Similarly, another delay element of delay t 3 - t 2 with input SI'ART(t 2 ) can be used to activate {C 3,j } and so on. Thus a sequence of delay elements can be used to generate control signals in a very straight forward manner.

METHOD 3 : Sequence Counter Method

Considering the circuit diagram which consists of a modulo-k counter whose output is connected to a 1/k clocked decoder. If the count enable input is connected to a clock source, the counter cycles continually through its k states. The decoder generates k pulse signals { i } on its output lines. The { i } effectively divide the time required for one complete cycle by the counter into k equal parts; the { i } may be called phase signals. The circuit shown may be called a sequence counter. The figure shows a one-loop flowchart containing six steps that describes the behaviour of a typical CPU. Each pass through the loop constitutes an instruction cycle. Assuming that each step can be performed in an appropriately chosen clock period, one may build a control unit for this CPU around a single (modulo-6) sequence counter. Each signal 1 activates some set of control lines in step 1 of every instruction cycle. It is usually necessary to be able to vary the operations performed in step 1 depending on certain control signals or condition variables applied to the control unit. These are represented by the signals C in = {C' in , C" in }. A logic circuit N is therefore needed which, combines C in with the timing signals { i } generated by the sequence counter.

COA :

19

COA : 19 {C1, j} {C2, j} C1 Delay Element C2 C3 {C2, j} {C1, j}
{C1, j} {C2, j}
{C1, j}
{C2, j}
C1 Delay Element C2 C3 {C2, j} {C1, j}
C1
Delay
Element
C2
C3
{C2, j}
{C1, j}
X X
X
X
{C2, j} C1 Delay Element C2 C3 {C2, j} {C1, j} X X No IS X=1?

No

{C2, j} C1 Delay Element C2 C3 {C2, j} {C1, j} X X No IS X=1?
IS X=1? Yes
IS X=1?
Yes

Rules for transforming a flow- chart into a control circuit using delay elements.

A modulo - k Sequence Counter :

(a)

Logic diagram,

(b)

Symbol.

A Modulo-K Sequence Counter
A
Modulo-K
Sequence
Counter

B

A  1 Delay element  2
A
1
Delay
element
 2
Sequence Counter B A  1 Delay element  2  B k   

B

k

Modulo K Sequence Counter  1  2 
Modulo K
Sequence
Counter
1 
2 

(b)

k

Begin

End

Clock

Reset

Transfer Program Counter to memory address register. Fetch the instruction from main memory. Increment program

Transfer Program Counter to memory address register.

Transfer Program Counter to memory address register. Fetch the instruction from main memory. Increment program counter

Fetch the instruction from main memory.

address register. Fetch the instruction from main memory. Increment program counter and decode instruction. Transfer

Increment program counter and decode instruction.

memory. Increment program counter and decode instruction. Transfer operand address to memory address register. Fetch

Transfer operand address to memory address register.

Transfer operand address to memory address register. Fetch the operand(s) from main memory. Perform operation

Fetch the operand(s) from main memory.

address to memory address register. Fetch the operand(s) from main memory. Perform operation specified by instruction.

Perform operation specified by instruction.

CPU behavior represented as a single closed loop

(a) A delay – element cascade; (b) The equivalent Sequence-Counter Circuit

element cascade; (b) The equivalent Sequence-Counter Circuit A delay - element circuit (ring counter) that behaves

A delay - element circuit (ring counter) that behaves like a Se- quence Counter

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

1

2

k

COA :

21

Problems on design of CPU control circuits

Problem 1 : A CPU in addition to the registers ACC, MAR, PC & IR contains the register B, C, D, E. each

having the same length as that of the former ones. Additional instructions provided on these new registers are:

A) ACC ACC + X

B)

ACC X

C) X MDR

D)

MAR Y

Where X may be any of B, C, D, E & Y may be either D or E. Give the schematic diagram of the CPU with the

control lines properly.

Data path

Control signal

B

AC

C0, C1 selects B, C, D, E

C

AC

C2 enables the data path

D

AC

for ACC

E

AC

B

ALU

C0, C1 selects B, C, D, E

C

ALU

C3 enables the data path

D

ALU

for ALU

E

ALU

AC ALU

C4 enables the data path

ALU AC

C5 enables the data path

MDR B MDR C MDR D MDR E

C6, C7 selects B, C, D, E C8 enables the data path for MDR

D MAR

C10 selects D, E

E MAR

C9 enables the data path

C10 selects D, E E  MAR C9 enables the data path Problem 2 : Design
C10 selects D, E E  MAR C9 enables the data path Problem 2 : Design

Problem 2 : Design a control circuit according to the operations and the control signals shown.

Data path

Control signal

AC ALU

C0

AC DR

C1

DR AC

C2

ALU AC

C3

IR CU

C4

DR IR

C5

PC MAR

C6

PC DR

C7

DR PC

C8

MAR BUS

C9

DR MAR

C10

DR BUS

C11

DR ALU

C12

BUS DR

C13

COA :

22

BUS SYSTEM:

INTRODUCTION:

A computer system contains a number of buses which provide path ways among several devices. A

shared bus that connects CPU, memory and I/O is called System Bus.

A system bus may consist of 50 to 100 separate lines. These lines can be categorized into 3 functional

groups:

Data bus provides a path for moving data between the system modules. A data bus width limits the

maximum number of bits which can be transferred simultaneously between 2 modules e.g. CPU and memory.

Address bus is used to designate the source of data for data bus. The width of the address bus specifies the maximum possible memory supported by a system.

Control bus is used to control the access to data and address bus and for transmission of commands and timing signals between the system modules.

Physically a bus is a number of parallel electrical conductors. These circuits are normally imprinted on printed circuit boards.

SOME OF THE ASPECTS RELATED TO THE BUS:

DEDICATED OR MULTIPLEXED BUSES:

A dedicated bus line is permanently assigned to a function or to a physical subset of the components of

the computer. A functional dedicated bus is the dedicated address bus and data bus. Physical dedication increases the throughput of the bus as only few modules are in contention but it increases overall size and cost

of a system.

In certain computer bus some or all the address lines are also used for data transfer operation, i.e., the

same lines are used for address as well as data lines at different times. This is known as time multiplexing and

the buses are called multiplexed buses.

SYNCHRONOUS OR ASYNICHRONOUS TIMIING:

This concerns the timing of data transfers through the bus. In synchronous buses the data is transferred during specific time which is known to source and destination. This is achieved by using clock pulses. Alternative approach is the asynchronous buses where each item which is to be transferred has a separate control signal. This signal indicates the presence of the item on the bus to the destination.

BUS ARBITRATION:

An important aspect of the bus system is the control of a bus. There may be more than one module connected to the bus wants to access the bus for data transfer. Thus there must be some methods for resolving the simultaneous data transfer requests on the bus. The process of selecting one of the units from various bus requesting units is called bus arbitration. There are 2 broad categories of bus arbitration-centralized and distributed.

In the centralized scheme a hardware circuit device, bus controller or bus arbiter processes the request

to use the bus. The bus controller may be a separate module or a part of CPU. The distributed scheme has shared access control logic among the various modules. All these module, work together to shared access.

COA :

23

COA : 23
COA : 23

COA :

24

SOME OF THE ARBITRATION SCHEMES:

DAISY CHAINING:

In daisy chaining the control of the bus is granted to any module by a Bus Grant signal which is chained through all the contending masters. Two other control signals are the Bus Request and Bus Busy signals.

The Bus Request line if activated, only indicates that one or more modules require the bus. The Bus Request is responded by the bus controller only if the bus busy line is inactive, that is, the bus is free. The Bus Request signal is responded by the bus controller by placing the signal on the Bus Grant line. The Bus Grant signal passes through the modules one by one. On receiving the Bus Grant, the module which was requesting bus access, blocks further propagation of Bus Grant signal and issues a Bus Busy signal and starts using the bus. If the Bus Grant signal is passed through the module which had not issued the bus request, then the Bus Grant signal is forwarded to the next module.

In this scheme the priority is wired in and can not he changed. Supposing that the assumed priority is

Module N. If 2 Modules, say 1 and N, request the bus at the same

time then bus will be granted to Module 1 first as the signal has to pass through Module 1 to each Module N. The basic drawback of this simple scheme is that if the bus request of Module 1 is occurring at a high rate then rest of that Modules may not get the bus for quite some time. Another problem can occur when say Bus Grant lint between say Module 4 & Module 5 fails, or Module 4 is unable to pass the Bus Grant signal, in any of the above case no bus access will be possible beyond Module 4.

(highest to lowest) Module 1, Module 2

POLLING:

In polling instead of single Bus Grant line, in daisy chaining, we encounter poll count lines. These lines are connected to all the modules connected on the bus. The Bus Request and Bus Busy are the other 2 control lines for bus control. A request to use the bus is made on the Bus Request line, while the Bus Request will not be responded to till the Bus Busy line is active. The bus controller responds to a signal on Bus Request line by generating a sequence of numbers on poll count lines. These numbers are normally considered to be a unique address assigned to the connected modules. When the poll count matches the address of a particular module which is requesting for the bus, the module activates the Bus Busy signal and starts using the bus. The polling basically is asking each module one by one whether it has something to do with bus.

Advantages Over Daisy Chaining :

• The priority of contending modules can be altered by changing the sequence of the generations of numbers on the poll count lines. • The failure of one module will not effete any other as far as bus grant is concerned.

Disadvantages Over Daisy Chaining:

* Polling requires more control lines which adds in cost

* The maximum number of modules which can share the bus in polling is restricted by the number of

poll count lines (e.g. if there are 3 poll count lines, then max of 2 3 = 8 modules can share this bus).

INDEPENDENT REQUESTING:

In this scheme, each module has its independent Bus Request and Bus Busy line. The identification of requesting unit is almost immediate and requests can be responded quickly. Priority in such systems can be built through the bus controller and can be changed through a program. In certain systems a combinations of these arbitration schemes are used. In PDP-11 UNIBUS system uses daisy chaining and independent addressing. It has five independent Bus Request lines and each one of these Bus Request lines has a distinct Bus Grant line. Several modules of the same priority may be connected to a same Bus Request line, the Bus Grant line to these same priority modules can be daisy chained.

COA :

25

COA : 25
COA : 25

COA :

26

 
     
   
        System Data Programmable MPU Bus Interfacing Device RD   INTR
 

System

Data

Programmable

MPU

Bus

Interfacing

Device

RD

 
 

INTR

MPU Bus Interfacing Device RD   INTR Peripheral such as Keyboard Data Lines STB (Strobe) IBF

Peripheral

such as

Keyboard

RD   INTR Peripheral such as Keyboard Data Lines STB (Strobe) IBF (Input Buffer Pulse) Pin

Data Lines

  INTR Peripheral such as Keyboard Data Lines STB (Strobe) IBF (Input Buffer Pulse) Pin For

STB

(Strobe)

INTR Peripheral such as Keyboard Data Lines STB (Strobe) IBF (Input Buffer Pulse) Pin For Status

IBF (Input

Buffer

Pulse)

Keyboard Data Lines STB (Strobe) IBF (Input Buffer Pulse) Pin For Status Check ( a )

Pin For

Status

Check

( a ) Interfacing Device With Handshake Signals For Data Input

1. A peripheral strobes or places a data byte in the input port and informs the interfacing device by sending handshake signal STB ( strobe )

2. The device informs the peripheral that its input port is full – do not sent the next byte until this one has been read. This massage is conveyed to the peripheral by sending handshake signal IBF ( input buffer full ).

3. The MPU keeps checking the status until a byte is available. Or the interfacing device informs the MPU, by sending an interrupt, that it has a byte to be read.

4. The MPU reads the byte by sending control signal RD.

Timing Waveforms of the 8155 I/O Ports with Handshake : Input Mode
Timing Waveforms of the 8155 I/O Ports with Handshake : Input Mode

COA :

27

Data Lines System Data Programmable OBF ( out- put buffer Full ) Peripheral Bus Interfacing
Data Lines
System
Data
Programmable
OBF ( out-
put buffer
Full )
Peripheral
Bus
Interfacing
Such As
ACK
Device
Printer
(Acknowl-
edge )
WR
Pin for
status
check
INTR
MPU
(Acknowl- edge ) WR Pin for status check INTR MPU Interfacing Device with Handshake Signals for

Interfacing Device with Handshake Signals for Data Output.

1. The MPU writes a byte into the output port of the programmable device by sending control signal WR.

2. The device informs the peripheral, by sending handshake signal OBF (Output Buffer Full), that a byte is on the way.

3. The peripheral acknowledges the byte by sending back the ACK (Acknowledges) signal to the device.

4. The device interrupts the MPU to ask for the next byte, or the MPU finds out that the byte has been acknowledged through the status check.

the byte has been acknowledged through the status check. Timing Waveforms of the 8155 I/O Ports

Timing Waveforms of the 8155 I/O Ports with Handshake : Output Mode

COA :

28

COA : 28 Fig 9.3 Four – Segment Pipeline S = Nt n ( k +
COA : 28 Fig 9.3 Four – Segment Pipeline S = Nt n ( k +

Fig 9.3 Four – Segment Pipeline

S =

Nt n

( k + n – 1 ) t p

9.3 Four – Segment Pipeline S = Nt n ( k + n – 1 )

Fig. 9.4 Space – Time Diagram For Pipeline

n – 1 ) t p Fig. 9.4 Space – Time Diagram For Pipeline K segment

K segment pipeline with a clock cycle time t p to execute n tasks. In a non pipelined unit that performs the some operation and takes a time equal to t n to complete each task. S is called the space up ratio. When n >> K – 1 then K + n – 1 ˜ n.

So S = t n / t p

If pipelined process time = non pipelined process time

then t n = kt p

So s = kt p / t p = k.

COA :

29

ARITHMETIC PIPELINE

X

= 0.9504

10 3

X = A

Y

= 0.8200

10 2

= 0.0820

10 3

10,

Y = B

10 6

Exponents

Mantissas

2

a b

3

0.9504

A

0.8200

B

R

Compare

Exponents

By Subtraction

R

Align mantissas

0.9504

R Compare Exponents By Subtraction R Align mantissas 0.9504 0.0820 Z=X+Y =1.0324*10 3 = 0.1034*10 =0

0.0820

Z=X+Y

=1.0324*10 3

= 0.1034*10 =0 10324*10 4 4
= 0.1034*10
=0 10324*10 4
4

Segment 1:

Difference

3 – 2 = 1

R

Choose Exponent

3

R

Adjust Exponent

4

R

R Choose Exponent 3 R Adjust Exponent 4 R Segment 2: R Add or Subtract Mantissas

Segment 2:

R

R Add or Subtract Mantissas 1.0324 R Normalize result 0.10324

Add or Subtract Mantissas

1.0324

R

Normalize result

0.10324

R

Segment 3:

Segment 4:

R Normalize result 0.10324 R Segment 3: Segment 4: Fig 9 – 6 Pipeline for floating
R Normalize result 0.10324 R Segment 3: Segment 4: Fig 9 – 6 Pipeline for floating

Fig 9 – 6 Pipeline for floating – point addition and Subtraction.

Normalize result 0.10324 R Segment 3: Segment 4: Fig 9 – 6 Pipeline for floating –

COA :

30

Step: 1 2 3 4 5 6 7 8 9 10 11 12 13 Instruction:
Step:
1
2
3
4
5
6
7
8
9
10
11
12
13
Instruction:
1
FI
DA
FO
EX
2
FI
DA
FO
EX
(Branch)
3
FI
DA
FO
EX
4
FI
-
-
FI
DA
FO
EX
5
-
-
-
FI
DA
FO
EX
6
FI
DA
FO
EX
7
FI
DA
FO
EX

Timing Of Instruction Pipe Line.

Segment 1: Fetch instruction from memory Segment 2: Decode instruction and calculate effective address Yes
Segment 1:
Fetch instruction
from memory
Segment 2:
Decode instruction
and calculate
effective address
Yes
Branch?
No
Fetch operand
Segment 3:
from memory
Execute Instruction
Segment 4:
Yes
Interrupt
Interrupt?
handling
No
Update PC
Empty Pipe
Four-Segment CPU Pipeline

COA :

31

Pipelining and Superscalar Techniques

A linear pipeline processor is a cascade of processing stages which are linearly connected to perform a fixed function over a stream of data flowing from one end to the other. In modern computers, linear pipelines are applied for instruction execution, arithmetic computation, and memory-access operations. Asynchronous and Synchronous Models :

A linear pipeline processor is constructed with k processing stages. External inputs (operands) are fed into the

pipeline at the first stage S 1 . The processed results are passed from stage S i to stage S i+1 , for all i = 1, 2,

final result emerges from the pipeline at the last stage S k . Depending on the control of data flow along the pipeline, we model linear pipelines in two categories: asynchronous and synchronous. Asynchronous Model deals data flow between adjacent stages in an asynchronous pipeline is controlled by a handshaking protocol. When stage S i is ready to transmit, it sends a ready signal to stage S i+1 . After stage S i+1 receives the incoming data, it returns an acknowledge signal to S i . Asynchronous pipelines may have a variable throughput rate. Different amounts of delay may be experienced in different stages. Synchronous Model has clocked latches are used to interface between stages. The latches are made with master-slave flip-flops, which can isolate inputs from outputs. Upon the arrival of a clock pulse, all latches transfer data to the next stage simultaneously. The pipeline stages are combinational logic circuits. It is desired to have approximately equal delays in all stages. Delays determine the clock period and speed of the pipeline. Reservation table is essentially a space-time diagram depicting the precedence relationship in using the pipeline stages. For a k-stage linear pipeline, k clock cycles are needed to flow through the pipeline. Successive tasks or operations are initiated one per cycle to enter the pipeline. Once the pipeline is filled up, one result emerges from the pipeline for each additional cycle. This throughput is sustained only if the successive tasks are independent of each other.

, k – 1. The

tasks are independent of each other. , k – 1. The Clock Cycle and Throughput Denotes

Clock Cycle and Throughput Denotes the maximum stage delay as m and we can write as At the rising edge of the clock pulse, the data is latched to the master flip- flops of each latch register. The clock pulse has a width equal to d. In general,

The clock pulse has a width equal to d. In general,  m >> d for
The clock pulse has a width equal to d. In general,  m >> d for

m

>>

d

for

one to

two orders

of

magnitude.

This

implies

that

the

maximum stage delay m dominates the

clock period. The pipeline frequency is defined as the inverse of the clock period: f = 1/The efficiency E k , of a linear k-stage pipeline is defined as above.

E k , of a linear k-stage pipeline is defined as above. The pipeline throughput H

The pipeline throughput H k is defined as the number of tasks (operations) performed per unit time.

COA :

32

Principles of pipelining architecture

Arithmetic pipelining:

Principles of pipelining architecture Arithmetic pipelining: The arithmetic logic units of a computer can be segmentized

The arithmetic logic units of a computer can be segmentized for pipeline operations in various data formats. Well-known arithmetic pipeline examples are:

4

stage pipes

Star-100

8

stage pipes

TI-ASC

14

stage pipes

CRAY-1

26

stage pipes

CRAY-205

Instruction pipeline:

The execution of a stream of instructions can be pipelined by overlapping the execution of the current instruction with the fetch, decode, and operand fetch of subsequent instructions. This is also known as instruction look-ahead.

Processor pipelining:

also known as instruction look-ahead. Processor pipelining: Here the same data stream is pipelined processed by

Here the same data stream is pipelined processed by a cascade of processors, each of which processes a specific task. The data stream passes the first processor with results stored in a memory block, which is also accessible, by the second processor. The second processor then passes refined results to the third, and so on. Three classification schemes are:

to the third, and so on. Three classification schemes are: Unifunction vs. Multifunction Pipeline: A pipeline

Unifunction vs. Multifunction Pipeline:

A pipeline unit with a fixed and dedicated function is called unifunctional. The Cray-1

has 12 unifunctional pipeline units for various scalar, vector, fixed-point, and floating- point operations.

A multifunction pipe may perform different functions, either at different times or at the

same time, by interconnecting different subsets of stages in pipeline. The TI-ASC has 4

multifunction pipeline processors, each of which is reconfigurable for a variety of arithmetic logic operations at different times.

Static vs. Dynamic Pipeline:

A static pipeline may assume only one functional configuration at a time. Static pipelines

can be either unifunctional or multifunctional. Pipelining in static pipes is possible only if instructions of the same type are to be executed continuously. The function performed by a static pipeline should not change frequently. Otherwise, performance may be low. A dynamic pipeline processor permits several functional configurations to exist simultaneously. The dynamic configuration needs more elaborate control and sequencing mechanisms than those for static pipelines. A dynamic processor must be multifunctional, whereas, a unifunctional pipe must be static.

Scalar vs. Vector Pipeline:

A scalar pipeline processes a sequence of scalar operands under the control of a DO loop. Instructions in a small DO

loop are often prefetched into the instruction buffer. The required scalar operands for repeated scalar instructions are moved into a data cache in order to continually supply the pipeline with operands. Vector pipelines are specially designed to handle vector instructions over vector operands. Computers with vector instructions are often called vector processors. The handling of vector operands in vector pipelines is under firmware and hardware controls (rather than under software control as in scalar pipelines).

COA :

33

What are the binary values to be given at X, Y and load i/PS to perform the addition of 4 numbers b 0 , b 1 ,

to perform the addition of 4 numbers b 0 , b 1 , b 2 ,

b 2 , b 3 to be fed through i/p i/PS s I accordingly.

1 and I

2

CLK
CLK

. (Use minimum no. Of ClK pulses). Draw the reservation table

I/PI 1 I/PI 2 0 1 0 1 2X1 Mux 2X1 Mux y y x
I/PI 1
I/PI 2
0
1
0
1
2X1 Mux
2X1 Mux
y
y
x
X
CP
Y
X
I 1
I 2 Load
1 1
1
b
b
0
Stage S1
0
1
2 1
1
b
b
0
2
3
3 1
1
0
0
0
4 1
1
0
0
0
Stage S2
5 1
1
0
0
1
6 0
0
0
0
0
7 1
1
0
0
0
8 1
1
0
0
0
Stage S3
9 1
1
0
0
0
Logic operation table
Stage S4
External Reg.
Load
0 0 0 7 1 1 0 0 0 8 1 1 0 0 0 Stage

COA :

34

8

4

 

16

Load

rB

B

Load

rC

C

Add

rA

rB

rC

Store

rA

A

Add

rB

rA

rC

Store

rB

B

Load

rD

D

Sub

rD

rD

rB

Store

rD

D

IF

ID

OF

 

OE

OS

   

i-1

IF

 

ID

 

OF

 

OE

OS

 
 

i-2

   

IF

 

ID

 

OF

OE

OS

IF: Instruction Fetch ID: Instruction Decode OF: Operand Fetch OE: Operand Execute OS: operand Store

i

   

B+C

i

IF

ID

OF

OE

OS

A+C

   

D-B     

 

i-1

IF

ID

OF

OE

OS

D

i-2

IF

ID

OF

OE

OS

 
     

8

4

4

4

 

Time

Time
   

Add

rA

rB

rC

8

16

16

16

 

Add

rB

rA

rC

Add

B

C

A

 

Sub

rD

rD

rB

Add

A

C

B

 

Sub

D

B

D

I=168b,D=288b,M=456b

Compiler allocates operand in registers. I=60b, D=0b, M=60b (Register – to – register)

(Memory – to – memory)

Reuse of operands I=228b, A – to – register) (Memory – to – memory) D=192b,M=420b (Register-to-register) B For this example we

D=192b,M=420b

(Register-to-register) B – to – memory) Reuse of operands I=228b, A D=192b,M=420b For this example we have 8

For this example

we have 8 bit opcode, 32 bit data, 16 GPRs.(4 bit to no. them), 16 bit address

To I Cache

From I Cache

From D Cache

I Address Data in I Register General Registers Increment D B A PC PC ALU
I Address
Data in
I Register
General Registers
Increment
D
B
A
PC
PC
ALU
ALU
Shift Logic
ALU
ALU
Data
Address
MUX
To D Cache
F

COA :

35

R 15 Common to D and A R 10 R 73 Number of global registers=
R
15
Common to D and A
R
10
R
73
Number of global registers= G
Number of local register in each window= L
Number of registers common to two windows= C
Number of windows= W
Local to D
R
64
R
63
Common to C and D
R
58
R
57
Local to C
R
48
R
47
Common to B and C
R
42
R
41
Local to B
R
32
R
31
Common to A and B
R
26
R
25
Local to A
R
16
R9
Common to All
Procedures
R
15
Common to A and D
R0
R
10
Global
Proc A

Registers

Overlapped register windows.

Window size = L+2C+G Register file = (L+C)W+G

In the example of fig:

We have G=10, L=10, C=6 and W=4.The window size is 10+12+10 = 32 registers, and the register file consists of (10+6)*4+10=74 registers.

RISC CHARACTERISTICS:

1. Relatively few instructions.

2. Relatively few addressing modes.

3.Memory access limited to load and save instructions.

4.All operations done within the registers of the CPU.

5. Fixed-length, easily decoded instruction format.

6.Single-cycle instruction execution.

7.Hardwired rather than micro-programmed control supports pipelining .

8.Supports Pipelining.

.

COA :

36

RISC Processors: In SPARC architecture out of 32 numbers of 32-bit IU (Integer Unit) registers eight of these registers are global registers shared by all procedures, and the remaining 24 are window registers associated with only each procedure. The concept af using overlapped register windows is the most important feature introduced by the Berkeley RISC architecture. This concept is illustrated in following figure.

This concept is illustrated in following figure. Each register window is divided into three eight-register
This concept is illustrated in following figure. Each register window is divided into three eight-register

Each register window is divided into three eight-register sections, labelled Ins, Locals and Outs. The local registers are only locally addressable by each procedure. The Ins and Outs are shared among procedures. The calling procedure passes parameters to the

called procedure through via its Outs (r8 to r15) registers, which are the Ins registers of the called procedure. The window of the currently procedure is called the active window pointed to by a current window pointer (CWP). A window invalid mask (WIM) is used to indicate which window is invalid. Problem: The SPARC architecture can be implemented with two to eight register windows, for a total of 40 to 132 GPRs in the integer unit. Explain how the GPRs are organized into overlapping windows in each of the following designs:

(a)

Use 40 GPRs to construct two windows.

(b)

Use 72 GPRs to construct four windows.

windows. (b) Use 72 GPRs to construct four windows. Above implementations/answers may have other register

Above implementations/answers may have other register distributions.

COA :

37

CACHE MEMORY:

Elements of Cache Design :

Cache Size

Write Policy

Mapping Function

Write through

Direct

Write back

Associative Set associative

Write once

Replacement Algorithm Least-recently used (LRU) First-in-first-out (FIFO) Least-frequently used (LRU) Random

Block Size Number of Caches Single-or two-level Unified or split

Block Size:

As the Block size increases from very small to larger sizes, the bit ratio will at first increase because of the principle of locality; the high probability that the data in the vanity of a referenced word is likely to be referenced in the near future. Two specific effects come into play:

1. Larger Blocks reduce the number of blocks that fit into cache. Because each block fetch overwrites

older cache contents, a small number of blocks results in data being overwritten shortly after it is fetched.

2. As a block becomes larger, each additional word is farther from the requested word, therefore less

likely to be needed in the near future.

Number Of Caches :

Recently, the use of multiple caches has become the norm. two aspects of this design issue concern the number of levels of caches and the use of unified versus split caches.

Single - versus Two-Level Caches:

It has become possible to have a cache on the same chip as the processor: the on-chip cache. So using internal buses, the on-chip cache reduces the processor's external bus activity and thus speeds up execution time and increases overall system performance. Furthermore, as bus access is eliminated, the bus is free to support other transfers. In most contemporary designs include both on-chip and external caches. The resulting organization is known as a two-level cache. The internal cache is designated as level 1 (L1) and the external cache as level 2 (L2). The reason for this is that, if the requested information is not on L1 cache, the processor has to make main memory access through the bus. This results is poor performance due to slow bus speed and slow memory access time. The potential saving due to the use of an L2 cache depends on the hit rates in both L1 and L2 caches.

Unified versus split Cache :

Many of the designs previously consisted of a single cache used to store references to both data and instructions. More recently, it has become common to split the cache into two : one dedicated to instructions and one dedicated to data. The unified cache has several advantages:

• For a given cache size, a unified cache has a higher bit rate, than split caches because it balances the load between instruction and data fetches automatically, that is, the cache tries to fill up with instruction or data depending upon the higher number of fetches required. • Only one cache needs to be designed and implemented. The key advantage of the split cache design is that it eliminates contention for the cache between the instruction processor and the execution unit. This is important for any design implementing pipelining of instructions. Thus for super scalar machines such as the Pentium and PowerPC implement split cache design.

The main disadvantage of unified cache design is that preference is given to execution unit first when instruction pre-fetcher request for an instruction. This contention can degrade performance by interfering with efficient use of the instruction pipeline. The split cache structure overcomes this difficulty.

COA :

38

Time Step Page Address OPT Anticipatory Swapping Hit = 83 % (10/12)*100 % 1 2
Time Step
Page Address
OPT
Anticipatory
Swapping
Hit = 83 %
(10/12)*100 %
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
4
3
2
5
1
2
3
4
3
2
4
5
1
4
3
1
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
4
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
2
2
2
2
2
2
2
2
2
5
5
5
5
5
5
1
1
1
1
1
1
1
1
1
1
1
1
FIFO
HIT = 33%
(4/12)*100 %
4
4
4
4
1
1
1
1
1
1
1
5
5
5
5
5
3
3
3
3
3
3
4
4
4
4
4
1
1
1
1
2
2
2
2
2
2
2
3
3
3
3
3
4
4
4
5
5
5
5
5
5
5
2
2
2
2
2
3
3
LRU
4
4
4
4
1
1
1
1
1
1
1
5
5
5
5
5
HIT=58%
3
3
3
3
3
3
3
3
3
3
3
1
1
1
1
(7/12)*100%
2
2
2
2
2
2
2
2
2
2
2
2
3
3
5
5
5
5
4
4
4
4
4
4
4
4
4

Results Of Simulation Run Of Three Replacement Algorithms On A Common Page Trace Stream.

Problem :

A virtual memory system has 16k word Logical address space, 8k word physical address space with a page size of 2k words. The page address trace of a program has been found to be:

7 5 3 2 1 0 4 1 6 7 4 2 0 1 3 5

Note the four pages resident in the memory after each page reference change for each of the following replacement policies:

(

a )

FIFO

(

b )

LRU; and

COA :

39

An algorithm is a stack algorithm if following inclusion property holds good:

stack algorithm if following inclusion property holds good: Bt(n) C B t (n+1) for n <

Bt(n)

C B t (n+1) for n < L t , and

= B t (n+1)for n ? L t ,

Where

L= Page address stream length to be processed by the replacement algorithm, n=Page capacity of MS, t=Time step when t pages of the address stream have been processed, B t (n) = set of pages in MS at time t, L t = number of distinct pages encountered at time t.

B t (n) = {

S t (1),S t (2),…,S t (n)} for n< L t

= { S t (1),S t (2),…,S t (Lt)} for n? L t

Time Step Page address S t (1) S t (2) S t (3) S t
Time
Step
Page
address
S t (1)
S t (2)
S t (3)
S t (4)
S t (5)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
4
3