11 12 Control Unit

CONTROL UNITE
1. CONTROL DESIGN
2. INTRODUCTION IN PIPELINING
Two main operations of Control Unite can be identified. Instruction sequencing - the methods by which instructions are selected for execution or, the manner in which control of the processor is transferred from one instruction to another. Instruction interpretation - the methods used for activating the control signals that cause the data processing unit to execute the instruction. Digital system: data path control path The data processing path is a network of functional units capable of performing certain operations on data The control unit issues control signals to the data processing part These control signals select the functions to be performed at specific times and route the data through the appropriate functional units The data processing path is logically reconfigured by the control unit to perform certain sets of (micro) operations
2
Instruction Sequencing
The simplest method of controlling the sequence in which instructions are executed is to have each instruction explicitly specify the address of its successor (or successors, if more than one possibility exists) Explicit inclusion of instruction addresses in all instructions has the disadvantage of substantially increasing instruction length Most instructions in a typical program have a unique successor If an instruction Ii is stored in memory location A, and Ii has a unique successor Ii+1, then it is natural to store Ii+1 in the location that immediately follows A (A+1 or A+k, where k is the length in words of Ii).
3
Use of a Program Counter (PC) or instruction address register The address of instruction Ii+1 can then be determined by incrementing PC thus: PC PC + k where k is the length in words of Ii If Ii must be fetched from main memory one word at a time, then PC is automatically incremented by one (k) before each new instruction word is read Consequently, after all words of I have been fetched, PC points to Ii+1 A program counter (PC) makes it unnecessary for an instruction to specify the address of its successor, if the program is a linear sequence of instructions
4
Branch instructions specify implicitly an instruction address X. An unconditional branch always alters the flow of program control by causing the operation PC X A conditional branch instruction first tests for some condition C within the processor; C is typically a property of a result generated by an earlier instruction and stored in a status register. If C is true, then PC X, otherwise PC is incremented to point to the next consecutive instruction.
Instruction interpretation
Control Unit (CU) decodes an instruction and generates control signals. Type of control signals (outputs from CU):
Controls that select the functions to be performed by data-path at specific times - main job of CU. Controls with other processors (state controls, synchronization, request for control of buses, etc.)
There are also synchronization signals (inputs to CU):

State and condition information from the datapath and memory interface unit that can modify the sequence of controls generated by CU. State and condition information from other processors (start, stop, acknowledge, busy, etc.)
Two methods for Control Unite implementation:

hardwired control microprogrammed control
6
Instruction interpretation
Cout directly control data-path. The main purpose of the CU Cin feedback from data-path. Can modify the sequence of Cout (Conditions, interrupts, and state) Csout to other control unites (State, synchronization, (busy), bus request, confirmations) Csin from others controllers (State, start, stop. Similar to Csout)
Csin Control Unit Csout
Control Unit
The control unit tells the data-path what to do every clock cycle during the execution of instructions
This is typically specified by a finite-state diagram
MARPC
IRM[MAR]
Memory access not complete
Cout
Cin
Data Input
Data-path
Data Output
Every state corresponds to one clock cycle, and the operations to be performed during the clock cycle are written within the state. Each instruction takes several clock cycles to complete
Memory access complete
PCPC+k Decode Op-code
Hardwired control
Turning a state diagram into hardware is the next step. The alternatives for doing this depend on the implementation technology. One way to bound the complexity of control is by the product States * Control inputs * Control outputs where:
States = the number of states in the finite-state machine controller; Control inputs = the number of signals examined by the control unit; Control outputs = the number of control outputs generated for the hardware, including bits to specify the next state.
Hardwired control
Control
Hardwired logic 40 bits wide Control Lines
2(12+3+6)=221 entries
12 3 6
Data-path
Example:
the finite-state diagram contains 50 states (requiring 6 bits to represent the state) 3 bits to select conditions from the data-path and memory interface unit 12 bits for the op-code of an instruction
Instruction Register
State
10
Hardwired control
Control can be specified as a big table
Each row of the table contains the values of the control lines to perform the operations required by the state and supplies the next state number. In figure we assume that there are 40 control lines.
Microprogrammed control
The solution, introduced by Maurice Wilkes, was to turn the control unit into a miniature computer by having two tables:
a table to specify control of the data-path and a table to determine control flow at the micro level
The straightforward implementation of a table is with a read only memory (ROM) ROM will require: 221 words, each 40 bits wide (10 MB of ROM!) But, little of this table has unique information, so its size can be reduced by keeping only the rows with unique information at the cost of more complicated address decoding Such a hardware construct is called a programmed logic array (PLA)
11
Wilkes called his invention microprogramming and attached the prefix "micro" to traditional terms used at the control level: microinstruction, microcode, microprogram Microinstructions specify all the control signals for the data-path, plus the ability to conditionally decide which microinstruction should be executed next Once the data-path and memory for the microinstructions are designed, control becomes essentially a programming task; that is the task of writing an interpreter for the instruction set The invention of microprogramming enabled the instruction set to be changed by altering the contents of control store without touching the hardware.
12
Control
Control Lines
The structure of a microprogram is very similar to the state diagram, with a microinstruction for each state in the diagram. While it doesn't matter to the hardware how the control lines are grouped within a microinstruction, control lines performing related functions are traditionally placed next to each other for ease of understanding Groups of related control lines are called fields and are given names in a microinstruction format:
Microprogram memory Datapath

1 Microinstruction PC + Address select logic
Destination
ALU operation
Source 1
Source 2
Misc
Cond
Jump address
Instruction register
13
14
Introduction to pipelining
Pipeline processing is an implementation technique where arithmetic sub-operations or the phases of a computer instruction cycle overlap in execution. Pipelining exploits parallelism among instructions by overlapping them - called Instruction Level Parallelism (ILP) Pipelining is a technique of decomposing a sequential process into sub-operations, with each sub-process being executed in a special dedicated segment that operates concurrently with all other segments. The result obtained from the computation in each segment is transferred to the next segment in the pipeline. The final result is obtained after the data have passed through all segments The name "pipeline" implies a flow of information analogous to an industrial assembly line It is characteristic of pipelines that several computations can be in progress in distinct segments at the same time
15
Introduction to pipelining
The overlapping of computation in distinct segments is made possible by associating a register with each segment in the pipeline The registers provide isolation between each segment so that each can operate on distinct data simultaneously. If there are k stages each of equal delay, then the processing throughput (# of data/instruction processed per second) increases roughly by a factor of k. There are two areas of computer design where the pipeline organization is applicable
An arithmetic pipeline divides an arithmetic operation into suboperations for execution in the pipeline segments An instruction pipeline operates on a stream of instructions by overlapping the fetch, decode, and execute phases of the instruction cycle
16
Four-segment pipeline
The simplest way of viewing the pipeline structure is to imagine that each segment consists of an input register followed by a combinational circuit The register holds the data and the combinational circuit performs the suboperation in the particular segment A clock is applied to all registers after enough time has elapsed to perform all segment activity. In this way the information flows through the pipeline one step at a time
Example of an arithmetic pipeline

Suppose that we want to perform the combined multiply and add operations with a stream of numbers. Ai * Bi + Ci for i = 1,2, 3, ..., 7 Each suboperation is to be implemented in a segment within a pipeline. Each segment has one or two registers and a combinational circuit We shall use registers (R1 through R5) that receive new data with every clock pulse The multiplier and adder are combinational circuits The suboperations performed in each segment of the pipeline are as follows: R1 Ai, R2 Bi Input Ai and Bi R3 R1*R2, R4 Ci Multiply and input Ci R5 R3 + R4 Add Ci to product The five registers are loaded with new data every clock pulse The first clock pulse transfers A1 and B1 into R1 and R2.
Clock
Input
R1
S1
R2
S2
R3
S3
R4
S4
17
18
Example of an arithmetic pipeline

Ai Bi Ci
Content of Registers in Pipeline example
R1
R2
Multiplier
R3
R4
Adder
R5
Ai * Bi + Ci
19 20
Arithmetic Pipeline
Pipeline arithmetic units are usually found in very high speed computers. They are used to implement floating-point operations, multiplication of fixed-point numbers, and similar computations encountered in scientific problems. A pipeline multiplier is essentially an array multiplier with special adders designed to minimize the carry propagation time through the partial products. Floating-point operations are easily decomposed into suboperations, as in the following example of a pipeline unit for floatingpoint addition and subtraction.
Pipeline for floating-point addition and subtraction

The inputs to the floating-point adder pipeline are two normalized floating-point binary numbers.
X = A 2a Y = B 2b
A and B are two fractions that represent the mantissas and a and b are the exponents. The suboperations that are performed in the four segments of pipelined unite are: 1. Compare the exponents. 2. Align the mantissas. 3. Add or subtract the mantissas. 4. Normalize the result.
22
21
Exponents a b
Mantissas A B
Instruction Pipeline
Pipeline for FP add and sub
Instruction execution involves several operations which are executed successively. This implies a large amount of hardware, but only one part of this hardware works at a given moment. Pipelining is an implementation technique whereby multiple instructions are overlapped in execution - different parts of the hardware work for different instructions at the same time. The pipeline organization of a CPU is similar to an assembly line: the work to be done in an instruction is broken into smaller steps (pieces), each of which takes a fraction of the time needed to complete the entire instruction.
Segment 1:
Compare exponents by subtraction
Difference
Segment 2:
Choose exponent
Align mantissas
Segment 3:
Add or subtract mantissas
Segment 4:
Adjust exponent
Normalize result
23
24
Segment 1:
Fetch instruction from memory
Pipeline
Segment 2:
Decode instruction and calculate effective address
Yes Branch?
No Segment 3: Fetch operand from memory
Example: FourSegment Instruction Pipeline
The behavior of a pipeline can be illustrated with a space-time diagram. This is a diagram that shows the segment utilization as a function of time The space-time diagram of a four-segment pipeline is represented below. The horizontal axis displays the time in clock cycles and the vertical axis gives the segment number. The diagram shows six tasks T1 through T6 executed in four segments No matter how many segments there are in the system, once the pipeline is full, it takes only one clock period to obtain an output.
Segment 4:
Execute instruction
Interrupt handling
Yes Interrupt? No
Update PC Empty pipe
25
26
Pipeline Speedup
Now consider the case where a k-segment pipeline with a clock cycle time tp is used to execute n tasks The first task T1 requires a time equal to ktp to complete its operation since there are k segments in the pipe The remaining n - 1 tasks emerge from the pipe at the rate of one task per clock cycle and they will be completed after a time equal to (n - 1)tp Therefore, to complete n tasks using a k-segment pipeline requires k + (n - 1) clock cycles For example, for a four segments and six tasks the time required to complete all the operations is 4 + (6 - 1) = 9 clock cycles Next consider a non-pipeline unit that performs the same operation and takes a time equal to tn to complete each task. The total time required for n tasks is ntn. The speedup of a pipeline processing over an equivalent nonpipeline processing is defined by the ratio:
Pipeline Speedup
As the number of tasks increases, n becomes much larger than k 1, and k + n - 1 approaches the value of n. Under this condition, the speedup becomes:
t S= n tp
If we assume that the time it takes to process a task is the same in the pipeline and nonpipeline circuits, we will have tn ktp. Including this assumption, the speedup reduces to
S=
n tn (k + n 1) t p
27
S=
kt p tp
k
28
Pipeline Speedup
S= kt p tp k
Instruction Pipelining
An instruction pipeline reads consecutive instructions from memory while previous instructions are being executed in other segments. This causes the instruction fetch and execute phases to overlap and perform simultaneous operations. In the most general case, the computer needs to process each instruction with the following sequence of steps.

29
This equation shows that the theoretical maximum speedup that a pipeline can provide is k, where k is the number of segments in the pipeline However:
a greater number of stages increases the over-head in moving information between stages and synchronization between stages. with the number of stages the complexity of the CPU grows. it is difficult to keep a large pipeline at maximum rate because of pipeline hazards.
Fetch the instruction from memory. Decode the instruction. Calculate the effective address. Fetch the operands from memory. Execute the instruction. Store the result in the proper place.
30
Instruction Pipelining
In general, there are three major difficulties (conflicts or hazards) that cause the instruction pipeline to deviate from its normal operation Pipeline hazards are situations that prevent the next instruction in the instruction stream from executing during its designated clock cycle. The instruction is said to be stalled When an instruction is stalled, all instructions later in the pipeline than the stalled instruction are also stalled Typical conflicts:
Resource conflicts (structural hazards) occur when a certain resource (memory, functional unit) is requested by more than one instruction at the same time. Data dependency (data hazards) conflicts arise when an instruction depends on the result of a previous instruction, but this result is not yet available. Branch difficulties (control hazards) arise from branch and other instructions that change the value of Program Counter.
31
Instruction pipelining - Examples

In the following examples we shall use an inversion of the space-time diagram: the horizontal axis displays the time in clock cycles and the vertical axis represents the executed instructions with shifted pipeline contents We assume a 6 stages pipeline:
FI DI CO FO EI WO = = = = = = Fetch instruction Decode instruction Calculate operand address Fetch operand Execute instruction Write back operand
32
Exemplification for structural hazard (Resource conflict)
DATA DEPENDENCY
(read after write)
33
34
Handling Data Hazards

Pipelined computers deal with such conflicts between data dependencies in a variety of ways: Inserting hardware interlocks. An interlock is a circuit that detects instructions whose source operands are destinations of instructions farther up in the pipeline. Detection of this situation causes the instruction whose source is not available to be delayed by enough clock cycles to resolve the conflict. Operand forwarding (bypassing) uses special hardware to detect a conflict and then avoid it by routing the data through special paths between pipeline segments. For ex., instead of transferring an ALU result into a destination register, and if it is needed as a source in the next instruction, it passes the result directly into the ALU input, bypassing the register file. Compiler that control reordering of the instructions. The method is to give the responsibility for solving data conflicts problems to the compiler that translates the high-level programming language into a machine language program. The compiler detect a data conflict and reorder the instructions as necessary to delay the loading of the conflicting data by inserting no-operation instructions.
35
BRANCH HAZARD (CONTROL HAZARD)

Hazard for unconditional branch
After the FO stage of the branch instruction the address of the target is known and it can be fetched Clock cycle BR TARGET
target
1 FI 2 DI FI 3 CO
Stall
4 FO
Stall
5 EI FI
6 WO DI FI
10
CO DI
FO CO
EI FO
WO EI WO
target + 1
The instruction following the branch is fetched; before the DI is finished it is not known that a branch is executed. Later the fetched instruction is discarded.
36
Hazard for conditional branch

ADD R1,R2 BEZ TARGET instruction i+1 ------------------------R1 R1 + R2 branch if zero
Hazard for conditional branch

ADD R1,R2 BEZ TARGET instruction i+1 ------------------------R1 R1 + R2 branch if zero
TARGET
TARGET
If branch is taken
At this moment, both the condition (set by ADD) and the target address are known.
If branch is not taken

At this moment the condition is known and instr+1 can go on.
Clock cycle ADD R1,R2

BEZ TARGET
10
11
Clock cycle ADD R1,R2
1 FI
2 DI FI
3 CO DI FI
4 FO CO
Stall
5 EI FO
Stall
6 WO EI DI
10
11
FI
DI
CO
FO
EI
WO
BEZ TARGET
FI DI CO FO EI WO
WO CO FO EI WO
instr i+1
FI
Stall Stall
target
FI
DI
CO
FO
EI
WO
37
38
Branch Hazard Alternatives

Stall until branch direction is clear Prefetch the target instruction(s) in addition to the instruction(s) following the branch. Predict Branch Use of a branch target buffer (BTB) or a loop buffer. Branch prediction. - additional logic to guess the outcome of a conditional branch instruction before it is executed. .Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor 1 Branch ........ delay of length n sequential successor n branch target if taken
39
References
Advanced Computer Architecture, Paul Pop, Institutionen fr Datavetenskap (IDA) - Linkpings Universitet, http://www.ida.liu.se/~paupo Smith, James, E., and Sohi, Gurindar, S., The Microarchitecture of Superscalar Processors, Proc. of IEEE, vol.83, no. 12, Dec. 1995, pp 1609-1624. Mano M., Computer System Architecture, Prentice-Hall Inc. 1993. Hans-Peter Messmer, The Indispensable PENTIUM Book, 1995 Addison-Wesley Publishing Company Inc.
40

11 12 Control Unit

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

11 12 Control Unit

Caricato da

Copyright:

Formati disponibili

CONTROL UNITE

There are also synchronization signals (inputs to CU):

Two methods for Control Unite implementation:

Memory access not complete

Memory access complete

PCPC+k Decode Op-code

Microprogram memory Datapath

Example of an arithmetic pipeline

Example of an arithmetic pipeline

Content of Registers in Pipeline example

Pipeline for floating-point addition and subtraction

Compare exponents by subtraction

Add or subtract mantissas

Fetch instruction from memory

Decode instruction and calculate effective address

No Segment 3: Fetch operand from memory

Example: FourSegment Instruction Pipeline

Update PC Empty pipe

Instruction pipelining - Examples

Exemplification for structural hazard (Resource conflict)

Handling Data Hazards

BRANCH HAZARD (CONTROL HAZARD)

Hazard for conditional branch

Hazard for conditional branch

If branch is not taken

Clock cycle ADD R1,R2

Clock cycle ADD R1,R2

Branch Hazard Alternatives

Potrebbero piacerti anche