Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Processing Unit
Overview
Instruction
Some Fundamental
Concepts
Fundamental Concepts
Executing an Instruction
Processor Organization
Internalprocessor
bus
Controlsignals
PC
MDR HAS
TWO INPUTS
AND TWO
OUTPUTS
Instruction
Address
lines
decoderand
MAR
controllogic
Memory
bus
MDR
Data
lines
IR
Datapath
Y
R0
Constant4
Select
MUX
Add
ALU
control
lines
Sub
Carryin
XOR
R n 1
ALU
TEMP
Z
Figure7.1.Singlebusorganizationofthedatapathinsideaprocessor.
Executing an Instruction
Transfer
Register Transfers
Internalprocessor
bus
Riin
Ri
Riout
Yin
Y
Constant4
Select
MUX
A
B
ALU
Zin
Z
Zout
Figure7.2.InputandoutputgatingfortheregistersinFigure7.1.
Register Transfers
All operations and data transfers are controlled by the processor clock.
Bus
0
D
1
Q
Ri in
Riout
Clock
Figure7.3.
Figure 7.3.Inputandoutputgatingforoneregisterbit.
Input and output gating for one register bit.
Performing an Arithmetic or
Logic Operation
R1out, Yin
R2out, SelectY, Add, Zin
Zout, R3in
MDRoutE
MDRout
Internalprocessor
bus
MDR
MDRinE
MDRin
Figure7.4.
Figure 7.4. ConnectionandcontrolsignalsforregisterMDR.
Connection and control signals for register MDR.
Timing
Assume MAR
is always available
on the address lines
of the memory bus.
Step
Clock
MARin
MAR [R1]
Address
Read
MR
MDRinE
Data
MDRout
Figure7.5. TimingofamemoryReadoperation.
Execution of a Complete
Instruction
Add
(R3), R1
Fetch the instruction
Fetch the first operand (the contents of the
memory location pointed to by R3)
Perform the addition
Load the result into R1
Architecture
Internalprocessor
bus
Riin
Ri
Riout
Yin
Y
Constant4
Select
MUX
A
B
ALU
Zin
Z
Zout
Figure7.2.InputandoutputgatingfortheregistersinFigure7.1.
Execution of a Complete
Instruction
Internalprocessor
bus
Add (R3), R1
Controlsignals
PC
Step A ction
1
MDRout , IR in
R1out , Y in , WMF C
Instruction
Address
lines
decoderand
MAR
controllogic
Memory
bus
MDR
Data
lines
IR
Y
R0
Constant4
Select
MUX
Add
ALU
control
lines
Sub
R n 1
ALU
Carryin
XOR
TEMP
Figure7.1.Singlebusorganizationofthedatapathinsideaprocessor.
Execution of Branch
Instructions
A
Execution of Branch
Instructions
StepAction
1
MDR out , IR in
Offset-field-of-IR
out, Add, Zin
Multiple-Bus Organization
BusA
BusB
BusC
Incrementer
PC
Register
file
MUX
Constant4
A
ALU R
B
Instruction
decoder
IR
MDR
MAR
Memorybus
datalines
Address
lines
Multiple-Bus Organization
StepAction
1
WMFC
MDRoutB, R=B,IR in
Figure 7.9.
Control sequence for the instruction. Add R4,R5,R
for the three-bus organization in Figure 7.8.
Quiz
Internalprocessor
bus
Controlsignals
What
is the control
sequence for
execution of the
instruction
Add R1, R2
including the
instruction fetch
phase? (Assume
single bus
architecture)
PC
Instruction
Address
lines
decoderand
MAR
controllogic
Memory
bus
MDR
Data
lines
IR
Y
R0
Constant4
Select
MUX
Add
ALU
control
lines
Sub
R n 1
ALU
Carryin
XOR
TEMP
Z
Figure7.1.Singlebusorganizationofthedatapathinsideaprocessor.
Hardwired Control
Overview
To
CLK
Controlstep
counter
External
inputs
IR
Decoder/
encoder
Condition
codes
Controlsignals
Figure7.10.Controlunitorganization.
CLK
Controlstep
counter
Reset
Stepdecoder
T 1 T2
Tn
INS1
External
inputs
INS2
IR
Instruction
decoder
Encoder
Condition
codes
INSm
Run
End
Controlsignals
Figure7.11. Separationofthedecodingandencodingfunctions.
Generating Zin
Zin = T1 + T6 ADD + T4 BR +
Branch
T4
Add
T6
T1
Figure7.12.GenerationoftheZincontrolsignalfortheprocessorinFigure7.1.
Generating End
Add
T7
Branch
T5
T4
End
Figure7.13. GenerationoftheEndcontrolsignal.
T5
A Complete Processor
Instruction
unit
Integer
unit
Instruction
cache
Floatingpoint
unit
Data
cache
Businterface
Processor
Systembus
Main
memory
Input/
Output
Figure7.14. Blockdiagramofacompleteprocessor.
Microprogrammed
Control
Overview
Micro
instruction
PCout
MARin
Read
MDRout
IRin
Yin
Select
Add
Zin
Z out
R1out
R1 in
R3out
WMFC
End
Figure7.15 AnexampleofmicroinstructionsforFigure7.6.
Overview
Step A ction
1
MDRout , IR in
R1out , Y in , WMF C
Overview
Control store
IR
Starting
address
generator
Clock
P C
Control
store
One function
cannot be carried
out by this simple
organization.
CW
Figure7.16. Basicorganizationofamicroprogrammedcontrolunit.
Overview
The previous organization cannot handle the situation when the control
unit is required to check the status of the condition codes or external
inputs to choose between alternative courses of action.
Use conditional branch microinstruction.
Address
Microinstruction
0
Zout , PC in , Y in , WMFC
MDRout , IR in
3
Branchto startingaddress
of appropriate
microroutine
. ... .. ... ... .. ... .. ... ... .. ... ... .. ... .. ... ... .. ... .. ... ... .. ... ..
25
26
Offset-field-of-IR
out , SelectY,Add, Zin
27
Zout , PC in , End
Figure 7.17. Microroutine for the instruction Branch<0.
Overview
External
inputs
IR
Startingand
branchaddress
generator
Clock
PC
Control
store
Figure7.18.
Condition
codes
CW
Organizationofthecontrolunittoallow
conditionalbranchinginthemicroprogram.
Microinstructions
A
F2
F3
F4
F5
F1(4bits)
F2(3bits)
F3(3bits)
F4(4bits)
F5(2bits)
0000:Notransfer
0001:PCout
0010:MDRout
0011:Zout
0100:R0out
0101:R1out
0110:R2out
0111:R3out
1010:TEMPout
1011:Offsetout
000:Notransfer
001:PCin
010:IRin
011:Zin
100:R0in
101:R1in
110:R2in
111:R3in
000:Notransfer
001:MARin
010:MDRin
011:TEMPin
100:Yin
0000:Add
0001:Sub
00:Noaction
01:Read
10:Write
F6
F7
1111:XOR
16ALU
functions
F8
F6(1bit)
F7(1bit)
F8(1bit)
0:SelectY
1:Select4
0:Noaction
1:WMFC
0:Continue
1:End
Figure7.19. Anexampleofapartialformatforfieldencodedmicroinstructions.
Further Improvement
Enumerate
Microprogram Sequencing
- Bit-ORing
- Wide-Branch Addressing
- WMFC
Mode
ContentsofIR
OPcode
11 10
Rsrc
8 7
Rdst
4 3
Address
(octal)
Microinstruction
000
PCout,MARin,Read,Select 4 ,Add,Zin
001
Zout,PCin,Yin,WMFC
002
MDRout,IRin
003
Branch{ PC 101(fromInstructiondecoder);
121
122
Zout,Rsrcin
123
170
MDRout,MARin,Read,WMFC
171
MDRout,Yin
172
Rdstout ,SelectY,Add,Z in
173
Zout,Rdstin,End
Figure7.21. MicroinstructionforAdd(Rsrc)+,Rdst.
Note:Microinstructionatlocation170isnotexecutedforthisaddressingmode.
Condition
codes
Decodingcircuits
AR
Controlstore
I R
Nextaddress
Microinstructiondecoder
Controlsignals
Figure7.22.Microinstructionsequencingorganization.
Microinstruction
F0
F0(8bits)
F1
F1(3bits)
F2
F2(3bits)
F4
F5
F6
F3
F3(3bits)
000:Notransfer
001:MARin
010:MDRin
011:TEMPin
100:Yin
F7
F4(4bits)
F5(2bits)
F6(1bit)
F7(1bit)
0000:Add
0001:Sub
00:Noaction
01:Read
10:Write
0:SelectY
1:Select4
0:Noaction
1:WMFC
F9
F10
1111:XOR
F8
F8(1bit)
F9(1bit)
F10(1bit)
0:NextAdrs
1:InstDec
0:Noaction
1:ORmode
0:Noaction
1:ORindsrc
Figure7.23. FormatformicroinstructionsintheexampleofSection7.5.3.
Implementation of the
Microroutine
Octal
address
0
0
0
0
0
0
0
0
0
0
0
0
121
122
0 1 0 1 0 0 1 0 1 0 0 01 1 0 0 1 0 0 0 0 01
0 1 1 1 1 0 0 0 0 1 1 10 0 0 0 0 0 0 0 0 00
1
0
0
1
0
0
0
0
0
1
1
1
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
0
0
0
0
0
0
1
0
0
1
1
0
1
1
0
1
1
1
0
0
0
0
1
1
01
00
01
00
00
00
01
10
1
1
0
0
0
0
1
1
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
F6 F7 F8 F9 F10
0
0
0
0
1
1
1
0
0
0
0
0
F5
0
0
0
1
1
1
1
0
0
0
0
0
F4
0
0
0
1
1
1
1
0
0
0
0
0
F3
0
1
0
0
1
1
1
0
0
0
0
0
F2
1
0
0
0
0
1
2
3
0
0
0
0
F1
01
00
00
00
7
7
7
7
0
1
2
3
F0
0
0
0
0
01
00
00
00
Figure7.24.ImplementationofthemicroroutineofFigure7.21usinga
nextmicroinstructionaddressfield. (SeeFigure7.23forencodedsignals.)
R15in
R15out
R0 in
R0out
Decoder
Decoder
IR
Rsrc
Rdst
InstDecout
External
inputs
Decoding
circuits
Condition
codes
ORmode
ORindsrc
AR
Controlstore
Nextaddress
F1
F2
F8 F9 F10
Rdstout
Rdstin
Rsrcout
Microinstruction
decoder
Rsrcin
Othercontrolsignals
Figure7.25. Somedetailsofthecontrolsignalgeneratingcircuitry.
bit-ORing
Prefetching
One
Prefetching Disadvantages
Sometimes
Emulation
Emulation
Pipelining
Overview
Pipelining
Basic Concepts
Example
Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
Washer takes 30 minutes
Dryer
takes 40 minutes
Folder
takes 20 minutes
10
Midnight
11
Time
30
40
20 30
40
20 30
40
20 30
Sequential
A
B
C
D
40
20
for 4 loads
If they learned pipelining, how
long would laundry take?
10
11
Midnight
Time
30
40
40
40
40 20
A
Pipelined
B
C
D
laundry takes
3.5 hours for 4 loads
Pipelining
9
Time
T
a
s
k
O
r
d
e
r
30
A
B
C
D
40
40
40
40
20
I2
Time
I3
Clockcycle
F
I2
Interstagebuffer
B1
Instruction
fetch
unit
I3
Execution
unit
(b)Hardwareorganization
F1
E1
Instruction
I1
(a)Sequentialexecution
Time
F2
E2
F3
E3
(c)Pipelinedexecution
Figure8.1.Basicideaofinstructionpipelining.
Clockcycle
F1
D1
E1
W1
F2
D2
E2
W2
F3
D3
E3
W3
F4
D4
E4
Instruction
Fetch + Decode
+ Execution + Write
I1
I2
I3
I4
W4
(a)Instructionexecutiondividedintofoursteps
Interstagebuffers
D:Decode
instruction
andfetch
operands
F:Fetch
instruction
B1
E:Execute
operation
B2
(b)Hardwareorganization
W:Write
results
B3
Pipeline Performance
The
Pipeline Performance
Time
Clockcycle
F1
D1
E1
W1
F2
D2
Instruction
I1
I2
I3
I4
I5
F3
E2
W2
D3
E3
W3
F4
D4
E4
W4
F5
D5
E5
Figure8.3. Effectofanexecutionoperationtakingmorethanoneclockcycle.
Pipeline Performance
The previous pipeline is said to have been stalled for two clock
cycles.
Any condition that causes a pipeline to stall is called a hazard.
Data hazard any condition in which either the source or the
destination operands of an instruction are not available at the
time expected in the pipeline. So some operation has to be
delayed, and the pipeline stalls.
Instruction (control) hazard a delay in the availability of an
instruction causes the pipeline to stall.
Structural hazard the situation when two instructions require
the use of a given hardware resource at the same time.
Pipeline Performance
Time
Instruction
hazard
Clockcycle
F1
D1
E1
W1
D2
E2
W2
F3
D3
E3
Instruction
I1
I2
F2
I3
W3
(a)Instructionexecutionstepsinsuccessiveclockcycles
Time
Clockcycle
F1
F2
F2
F2
F2
F3
D1
idle
idle
idle
D2
D3
E1
idle
idle
idle
E2
E3
W1
idle
idle
idle
W2
Stage
F:Fetch
D:Decode
E:Execute
W:Write
Idle periods
stalls (bubbles)
W3
(b)Functionperformedbyeachprocessorstageinsuccessiveclockcycles
Figure8.4. PipelinestallcausedbyacachemissinF2.
Pipeline Performance
Structural
hazard
Load X(R1), R2
Time
Clockcycle
F1
D1
E1
W1
F2
D2
F3
E2
M2
W2
D3
E3
W3
F4
D4
E4
Instruction
I1
I2 (Load)
I3
I4
I5
F5
D5
Figure8.5. EffectofaLoadinstructiononpipelinetiming.
Pipeline Performance
Data Hazards
Data Hazards
Data Hazards
Time
Clockcycle
F1
D1
E1
W1
F2
D2
D2A
E2
W2
D3
E3
W3
F4
D4
E4
Instruction
I1 (Mul)
I2 (Add)
I3
I4
F3
W4
Figure8.6. PipelinestalledbydatadependencybetweenD2andW1.
Operand Forwarding
Instead
Source1
Source2
SRC1
SRC2
Register
file
ALU
RSLT
Destination
(a)Datapath
SRC1,SRC2
RSLT
E:Execute
(ALU)
W:Write
(Registerfile)
Forwardingpath
(b)Positionofthesourceandresultregistersintheprocessorpipeline
Side Effects
Instruction Hazards
Overview
Whenever
Unconditional Branches
Time
Clockcycle
F1
E1
Instruction
I1
I2 (Branch)
I3
Ik
Ik+1
F2
Executionunitidle
E2
F3
Fk
Ek
Fk+1
Ek+1
Figure8.8. Anidlecyclecausedbyabranchinstruction.
Time
Branch Timing
Clockcycle
I1
F1
D1
E1
W1
F2
D2
E2
F3
D3
F4
I2 (Branch)
I3
- Branch penalty
- Reducing the penalty
I4
Fk
Ik
Ik+1
Dk
Ek
Wk
Fk+1
Dk+1
E k+1
(a)BranchaddresscomputedinEx ecutestage
Time
Clockcycle
I1
F1
D1
E1
W1
F2
D2
I2 (Branch)
I3
Ik
Ik+1
F3
Dk
Ek
Wk
X
Fk
Fk+1
D k+1 E k+1
(b)BranchaddresscomputedinDecodestage
Figure8.9. Branchtiming.
D:Dispatch/
Decode
unit
E:Execute
instruction
W:Write
results
Figure8.10.UseofaninstructionqueueinthehardwareorganizationofFigure8.2b.
Conditional Braches
A
Delayed Branch
The
Delayed Branch
LOOP
NEXT
Shift_left
Decrement
Branch=0
Add
R1
R2
LOOP
R1,R3
(a)Originalprogramloop
LOOP
NEXT
Decrement
Branch=0
Shift_left
Add
R2
LOOP
R1
R1,R3
(b)Reorderedinstructions
Figure8.12.Reorderingofinstructionsforadelayedbranch.
Delayed Branch
Clockcycle
Instruction
Decrement
Branch
Shift(delayslot)
Decrement(Branchtaken)
Branch
Shift(delayslot)
Add(Branchnottaken)
Figure8.13. Executiontimingshowingthedelayslotbeingfilled
duringthelasttwopassesthroughtheloopinFigure8.12.
Time
Branch Prediction
F1
D1
E1
W1
F2
D2 /P2
E2
F3
D3
F4
Instruction
I1(Compare)
I2(Branch>0)
I3
I4
Ik
Fk
Dk
Figure8.14. Timingwhenabranchdecisionhasbeenincorrectlypredicted
asnottaken.
Branch Prediction
Influence on
Instruction Sets
Overview
Some
Addressing Modes
Addressing
Side effects
The extent to which complex addressing modes cause
the pipeline to stall
Whether a given mode is likely to be used by compilers
Recall
Load X(R1), R2
Time
Clockcycle
F1
D1
E1
W1
F2
D2
F3
E2
M2
W2
D3
E3
W3
F4
D4
E4
Instruction
I1
I2 (Load)
I3
I4
I5
Load (R1), R2
F5
D5
Figure8.5. EffectofaLoadinstructiononpipelinetiming.
Clockcycle 1
Load
X+ [R1]
[X+[R1]] [[X+[R1]]]
Time
Forward
Nextinstruction
(a)Complexaddressingmode
Add
Load
Load
Nextinstruction
X+ [R1]
[X+[R1]]
[[X+[R1]]]
(b)Simpleaddressingmode
Addressing Modes
Addressing Modes
Good
Register,
Conditional Codes
If
Conditional Codes
Add
Compare
Branch=0
R1,R2
R3,R4
...
(a)Aprogramfragment
Compare
Add
Branch=0
R3,R4
R1,R2
...
(b)Instructionsreordered
Figure8.17.Instructionreordering.
Conditional Codes
Two
conclusion:
BusA
BusB
BusC
Incrementer
Original Design
PC
Register
file
MUX
Constant4
A
ALU R
B
Instruction
decoder
IR
MDR
MAR
Memorybus
datalines
Address
lines
Register
file
BusA
Pipelined Design
BusB
Instruction
decoder
B
BusC
ALU R
PC
Incrementer
IMAR
Memoryaddress
(Instructionfetches)
Instruction
queue
MDR/Write
DMAR
Instructioncache
Memoryaddress
(Dataaccess)
- Reading an instruction from the instruction cache
- Incrementing the PC
- Decoding an instruction
- Reading from or writing into the data cache
Datacache
- Reading the contents of up to two regs
Figure8.18. Datapathmodifiedforpipelinedexecution,with
- Writing into one register in the reg file
interstagebuffersattheinputandoutputoftheALU.
- Performing an ALU operation
MDR/Read
Superscalar Operation
Overview
Superscalar
F:Instruction
fetchunit
Instructionqueue
Floating
point
unit
Dispatch
unit
W:Write
results
Integer
unit
Figure8.19. Aprocessorwithtwoexecutionunits.
Timing
Clockcycle
I1 (Fadd)
F1
D1
E1A
E 1B
E1C
W1
I2 (Add)
F2
D2
E2
W2
I3 (Fsub)
F3
D3
E3
E3
E3
I4 (Sub)
F4
D4
E4
W4
Time
W3
Figure8.20. AnexampleofinstructionexecutionflowintheprocessorofFigure8.19,
assumingnohazardsareencountered.
Out-of-Order Execution
Hazards
Exceptions
Imprecise exceptions
Precise exceptions
Clockcycle
I1 (Fadd)
F1
D1
E1A
E 1B
E 1C
W1
I2 (Add)
F2
D2
E2
I3(Fsub)
F3
D3
I4(Sub)
F4
D4
W2
E3A
(a)Delayedwrite
E 3B
E 3C
W3
E4
W4
Time
Execution Completion
I1 (Fadd)
F1
D1
E1A
E 1B
E 1C
W1
I2 (Add)
F2
D2
E2
TW2
I3 (Fsub)
F3
D3
E3A
E 3B
I4 (Sub)
F4
D4
E4
TW4
(b)Usingtemporaryregisters
Time
7
W2
E 3C
W3
W4