Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
1. CONTROL DESIGN
2. INTRODUCTION IN PIPELINING
Two main operations of Control Unite can be identified. Instruction sequencing - the methods by which instructions are selected for execution or, the manner in which control of the processor is transferred from one instruction to another. Instruction interpretation - the methods used for activating the control signals that cause the data processing unit to execute the instruction. Digital system: data path control path The data processing path is a network of functional units capable of performing certain operations on data The control unit issues control signals to the data processing part These control signals select the functions to be performed at specific times and route the data through the appropriate functional units The data processing path is logically reconfigured by the control unit to perform certain sets of (micro) operations
2
Instruction Sequencing
The simplest method of controlling the sequence in which instructions are executed is to have each instruction explicitly specify the address of its successor (or successors, if more than one possibility exists) Explicit inclusion of instruction addresses in all instructions has the disadvantage of substantially increasing instruction length Most instructions in a typical program have a unique successor If an instruction Ii is stored in memory location A, and Ii has a unique successor Ii+1, then it is natural to store Ii+1 in the location that immediately follows A (A+1 or A+k, where k is the length in words of Ii).
3
Instruction Sequencing
Use of a Program Counter (PC) or instruction address register The address of instruction Ii+1 can then be determined by incrementing PC thus: PC PC + k where k is the length in words of Ii If Ii must be fetched from main memory one word at a time, then PC is automatically incremented by one (k) before each new instruction word is read Consequently, after all words of I have been fetched, PC points to Ii+1 A program counter (PC) makes it unnecessary for an instruction to specify the address of its successor, if the program is a linear sequence of instructions
4
Instruction Sequencing
Branch instructions specify implicitly an instruction address X. An unconditional branch always alters the flow of program control by causing the operation PC X A conditional branch instruction first tests for some condition C within the processor; C is typically a property of a result generated by an earlier instruction and stored in a status register. If C is true, then PC X, otherwise PC is incremented to point to the next consecutive instruction.
Instruction interpretation
Control Unit (CU) decodes an instruction and generates control signals. Type of control signals (outputs from CU):
Controls that select the functions to be performed by data-path at specific times - main job of CU. Controls with other processors (state controls, synchronization, request for control of buses, etc.)
Instruction interpretation
Cout directly control data-path. The main purpose of the CU Cin feedback from data-path. Can modify the sequence of Cout (Conditions, interrupts, and state) Csout to other control unites (State, synchronization, (busy), bus request, confirmations) Csin from others controllers (State, start, stop. Similar to Csout)
Csin Control Unit Csout
Control Unit
The control unit tells the data-path what to do every clock cycle during the execution of instructions
This is typically specified by a finite-state diagram
MARPC
IRM[MAR]
Cout
Cin
Data Input
Data-path
Data Output
Every state corresponds to one clock cycle, and the operations to be performed during the clock cycle are written within the state. Each instruction takes several clock cycles to complete
Hardwired control
Turning a state diagram into hardware is the next step. The alternatives for doing this depend on the implementation technology. One way to bound the complexity of control is by the product States * Control inputs * Control outputs where:
States = the number of states in the finite-state machine controller; Control inputs = the number of signals examined by the control unit; Control outputs = the number of control outputs generated for the hardware, including bits to specify the next state.
Hardwired control
Control
Hardwired logic 40 bits wide Control Lines
2(12+3+6)=221 entries
12 3 6
Data-path
Example:
the finite-state diagram contains 50 states (requiring 6 bits to represent the state) 3 bits to select conditions from the data-path and memory interface unit 12 bits for the op-code of an instruction
Instruction Register
State
10
Hardwired control
Control can be specified as a big table
Each row of the table contains the values of the control lines to perform the operations required by the state and supplies the next state number. In figure we assume that there are 40 control lines.
Microprogrammed control
The solution, introduced by Maurice Wilkes, was to turn the control unit into a miniature computer by having two tables:
a table to specify control of the data-path and a table to determine control flow at the micro level
The straightforward implementation of a table is with a read only memory (ROM) ROM will require: 221 words, each 40 bits wide (10 MB of ROM!) But, little of this table has unique information, so its size can be reduced by keeping only the rows with unique information at the cost of more complicated address decoding Such a hardware construct is called a programmed logic array (PLA)
11
Wilkes called his invention microprogramming and attached the prefix "micro" to traditional terms used at the control level: microinstruction, microcode, microprogram Microinstructions specify all the control signals for the data-path, plus the ability to conditionally decide which microinstruction should be executed next Once the data-path and memory for the microinstructions are designed, control becomes essentially a programming task; that is the task of writing an interpreter for the instruction set The invention of microprogramming enabled the instruction set to be changed by altering the contents of control store without touching the hardware.
12
Microprogrammed control
Control
Control Lines
Microprogrammed control
The structure of a microprogram is very similar to the state diagram, with a microinstruction for each state in the diagram. While it doesn't matter to the hardware how the control lines are grouped within a microinstruction, control lines performing related functions are traditionally placed next to each other for ease of understanding Groups of related control lines are called fields and are given names in a microinstruction format:
Destination
ALU operation
Source 1
Source 2
Misc
Cond
Jump address
Instruction register
13
14
Introduction to pipelining
Pipeline processing is an implementation technique where arithmetic sub-operations or the phases of a computer instruction cycle overlap in execution. Pipelining exploits parallelism among instructions by overlapping them - called Instruction Level Parallelism (ILP) Pipelining is a technique of decomposing a sequential process into sub-operations, with each sub-process being executed in a special dedicated segment that operates concurrently with all other segments. The result obtained from the computation in each segment is transferred to the next segment in the pipeline. The final result is obtained after the data have passed through all segments The name "pipeline" implies a flow of information analogous to an industrial assembly line It is characteristic of pipelines that several computations can be in progress in distinct segments at the same time
15
Introduction to pipelining
The overlapping of computation in distinct segments is made possible by associating a register with each segment in the pipeline The registers provide isolation between each segment so that each can operate on distinct data simultaneously. If there are k stages each of equal delay, then the processing throughput (# of data/instruction processed per second) increases roughly by a factor of k. There are two areas of computer design where the pipeline organization is applicable
An arithmetic pipeline divides an arithmetic operation into suboperations for execution in the pipeline segments An instruction pipeline operates on a stream of instructions by overlapping the fetch, decode, and execute phases of the instruction cycle
16
Four-segment pipeline
The simplest way of viewing the pipeline structure is to imagine that each segment consists of an input register followed by a combinational circuit The register holds the data and the combinational circuit performs the suboperation in the particular segment A clock is applied to all registers after enough time has elapsed to perform all segment activity. In this way the information flows through the pipeline one step at a time
Clock
Input
R1
S1
R2
S2
R3
S3
R4
S4
17
18
R1
R2
Multiplier
R3
R4
Adder
R5
Ai * Bi + Ci
19 20
Arithmetic Pipeline
Pipeline arithmetic units are usually found in very high speed computers. They are used to implement floating-point operations, multiplication of fixed-point numbers, and similar computations encountered in scientific problems. A pipeline multiplier is essentially an array multiplier with special adders designed to minimize the carry propagation time through the partial products. Floating-point operations are easily decomposed into suboperations, as in the following example of a pipeline unit for floatingpoint addition and subtraction.
X = A 2a Y = B 2b
A and B are two fractions that represent the mantissas and a and b are the exponents. The suboperations that are performed in the four segments of pipelined unite are: 1. Compare the exponents. 2. Align the mantissas. 3. Add or subtract the mantissas. 4. Normalize the result.
22
21
Exponents a b
Mantissas A B
Instruction Pipeline
Pipeline for FP add and sub
Instruction execution involves several operations which are executed successively. This implies a large amount of hardware, but only one part of this hardware works at a given moment. Pipelining is an implementation technique whereby multiple instructions are overlapped in execution - different parts of the hardware work for different instructions at the same time. The pipeline organization of a CPU is similar to an assembly line: the work to be done in an instruction is broken into smaller steps (pieces), each of which takes a fraction of the time needed to complete the entire instruction.
Segment 1:
Difference
Segment 2:
Choose exponent
Align mantissas
Segment 3:
Segment 4:
Adjust exponent
Normalize result
23
24
Segment 1:
Pipeline
Segment 2:
Yes Branch?
The behavior of a pipeline can be illustrated with a space-time diagram. This is a diagram that shows the segment utilization as a function of time The space-time diagram of a four-segment pipeline is represented below. The horizontal axis displays the time in clock cycles and the vertical axis gives the segment number. The diagram shows six tasks T1 through T6 executed in four segments No matter how many segments there are in the system, once the pipeline is full, it takes only one clock period to obtain an output.
Segment 4:
Execute instruction
Interrupt handling
Yes Interrupt? No
25
26
Pipeline Speedup
Now consider the case where a k-segment pipeline with a clock cycle time tp is used to execute n tasks The first task T1 requires a time equal to ktp to complete its operation since there are k segments in the pipe The remaining n - 1 tasks emerge from the pipe at the rate of one task per clock cycle and they will be completed after a time equal to (n - 1)tp Therefore, to complete n tasks using a k-segment pipeline requires k + (n - 1) clock cycles For example, for a four segments and six tasks the time required to complete all the operations is 4 + (6 - 1) = 9 clock cycles Next consider a non-pipeline unit that performs the same operation and takes a time equal to tn to complete each task. The total time required for n tasks is ntn. The speedup of a pipeline processing over an equivalent nonpipeline processing is defined by the ratio:
Pipeline Speedup
As the number of tasks increases, n becomes much larger than k 1, and k + n - 1 approaches the value of n. Under this condition, the speedup becomes:
t S= n tp
If we assume that the time it takes to process a task is the same in the pipeline and nonpipeline circuits, we will have tn ktp. Including this assumption, the speedup reduces to
S=
n tn (k + n 1) t p
27
S=
kt p tp
k
28
Pipeline Speedup
S= kt p tp k
Instruction Pipelining
An instruction pipeline reads consecutive instructions from memory while previous instructions are being executed in other segments. This causes the instruction fetch and execute phases to overlap and perform simultaneous operations. In the most general case, the computer needs to process each instruction with the following sequence of steps.
29
This equation shows that the theoretical maximum speedup that a pipeline can provide is k, where k is the number of segments in the pipeline However:
a greater number of stages increases the over-head in moving information between stages and synchronization between stages. with the number of stages the complexity of the CPU grows. it is difficult to keep a large pipeline at maximum rate because of pipeline hazards.
Fetch the instruction from memory. Decode the instruction. Calculate the effective address. Fetch the operands from memory. Execute the instruction. Store the result in the proper place.
30
Instruction Pipelining
In general, there are three major difficulties (conflicts or hazards) that cause the instruction pipeline to deviate from its normal operation Pipeline hazards are situations that prevent the next instruction in the instruction stream from executing during its designated clock cycle. The instruction is said to be stalled When an instruction is stalled, all instructions later in the pipeline than the stalled instruction are also stalled Typical conflicts:
Resource conflicts (structural hazards) occur when a certain resource (memory, functional unit) is requested by more than one instruction at the same time. Data dependency (data hazards) conflicts arise when an instruction depends on the result of a previous instruction, but this result is not yet available. Branch difficulties (control hazards) arise from branch and other instructions that change the value of Program Counter.
31
DATA DEPENDENCY
(read after write)
33
34
4 FO
Stall
5 EI FI
6 WO DI FI
10
CO DI
FO CO
EI FO
WO EI WO
target + 1
The instruction following the branch is fetched; before the DI is finished it is not known that a branch is executed. Later the fetched instruction is discarded.
36
TARGET
TARGET
If branch is taken
At this moment, both the condition (set by ADD) and the target address are known.
10
11
1 FI
2 DI FI
3 CO DI FI
4 FO CO
Stall
5 EI FO
Stall
6 WO EI DI
10
11
FI
DI
CO
FO
EI
WO
BEZ TARGET
FI DI CO FO EI WO
WO CO FO EI WO
instr i+1
FI
Stall Stall
target
FI
DI
CO
FO
EI
WO
37
38
References
Advanced Computer Architecture, Paul Pop, Institutionen fr Datavetenskap (IDA) - Linkpings Universitet, http://www.ida.liu.se/~paupo Smith, James, E., and Sohi, Gurindar, S., The Microarchitecture of Superscalar Processors, Proc. of IEEE, vol.83, no. 12, Dec. 1995, pp 1609-1624. Mano M., Computer System Architecture, Prentice-Hall Inc. 1993. Hans-Peter Messmer, The Indispensable PENTIUM Book, 1995 Addison-Wesley Publishing Company Inc.
40