Sei sulla pagina 1di 161

William Stallings Computer Organization and Architecture 8th Edition

CHAPTER 1 INTRODUCTION

Architecture and Organization


Architecture is those attributes visible to the programmer
Instruction set, number of bits used for data representation, I/O mechanisms, addressing techniques. e.g. Is there a multiply instruction?

Organization is how features are implemented


Control signals, interfaces, memory technology. e.g. Is there a hardware multiply unit or is it done by repeated addition?

Architecture and Organization


Family Concept
All Intel x86 family share the same basic architecture The IBM System/370 family share the same basic architecture This gives code compatibility (at least backwards)

Organization differs between different versions

Structure and Function


Computer
Complex system: How can we design/describe it?

Hierarchical system:
A set of interrelated subsystems, each subsystem hierarchic in structure until some lowest level of elementary subsystems is reached

At each level of the system, the designer is concerned with structure and function.

Structure and Function


Structure is the way in which components relate to each other Function is the operation of individual components as part of the structure

Function
General computer functions: Data processing Data storage Data movement Control

Operations
Data movement
Ex., keyboard to screen

Functional View of the Computer

Operations
Storage
Ex., Internet download to disk Playing an mp3 file stored in memory to earphones attached to the same PC.

Operations
Processing from/ to storage
Any number-crunching application that takes data from memory and stores the result back in memory. ex., updating bank statement

Operations
Processing from storage to I/O
Receiving packets over a network interface, verifying their CRC, then storing them in memory.
ex., printing a bank statement

Structure
Four main structural components
CPU Main Memory I/O Devices System Interconnection

Structure
Four main structural components
1. Central Processing Unit (CPU)

Controls the operation of the computer and performs its data processing functions; often simply referred to as processor. 2. Main Memory Stores data

Structure
Four main structural components
3.I/O

moves data between the computer and its external environment. 4. System Interconnection Some mechanism that provides for communication among CPU, main memory, and I/O. A common example of system interconnection is a system bus consisting of a number of wires to w/c all

Structure Top Level


Peripherals

Computer
Central Processing Unit Main Memory

Computer

Systems Interconnection

Communication lines

Input Output

Structure The CPU


Computer
I/O

CPU
Registers Arithmetic and Logic Unit

System Bus

CPU

Internal CPU Interconnection

Memory

Control Unit

Structure The Control Unit


Control Unit
CPU
Control Unit

ALU

Sequencing Logic Control Unit Registers and Decoders

Internal Bus Registers

Control Memory

Computer Evolution and Performance


CHAPTER 2

Brief History of Computers


The First Generation: Vacuum Tubes ENIAC
o Electronic Numerical Integrator And Computer o Worlds first general purpose electronic digital computer o John Mauchly and John Eckert o It weighs 30 tons, occupying 1500 square feet of floor space, and containing more than 18,000 vacuum tubes.

Brief History of Computers


The First Generation: Vacuum Tubes Von Neumann/Turing o Stored Program concept
o o Main memory storing programs and data Attributed to John von Neumann who was an ENIAC designer and Alan Turing was the one who developed the idea Input and output equipment operated by control unit

o In 1946, von Neumann and his colleagues began the design of a new stored program computer, referred to as the IAS computer. o The IAS computer, although not completed until 1952, is the prototype of all subsequent general-purpose computers.

Brief History of Computers


IAS computer consist of:
o A main memory, which stores both data and instructions o An arithmetic and logic unit (ALU) capable of operating on binary data o A control unit, which interprets the instructions in memory and causes them to be executed o Input and output (I/O) equipment operated by the control unit

Structure of the IAS computer

John von Neumann and the IAS machine, 1952

UNIVAC o UNIVAC I (Universal Automatic


o o

o o o

Computer) 1947 -Eckert-Mauchly Computer Corporation first successful commercial computer. It was intended for both scientific and commercial applications. US Bureau of Census 1950 calculations Became part of Sperry-Rand Corporation Late 1950s -UNIVAC II -Faster -More memory

IBM
o Punched-card processing equipment o 1953 -the 701 o IBMs first stored program computer o Scientific calculations o 1955 -the 702 o Business applications o Lead to 700/7000 series

Brief History of Computers


The Second Generation: Transistor
o

Transistors

o o o o

is smaller, cheaper, and dissipates less heat than a vacuum tube but can be used in the same way as a vacuum tube to construct computers invented at Bell Labs in 1947 by William Shockley IBM 7000 DEC (Digital Equipment Corporation) was founded in 957 Produced PDP-1 in the same year

Brief History of Computers


The Third Generation: Integrated Circuits o A computer is made up of gates, memory
cells and interconnections o single, self-contained transistor is called a discrete component o All these can be manufactured either separately (discrete components) or on the same piece of semiconductor

Brief History of Computers


Generations of Computers oVacuum tube -1946-1957 oTransistor -1958-1964 oSmall scale integration -1965 on -Up to 100 devices on a chip oMedium scale integration -to 1971 -100-3,000 devices on a chip oLarge scale integration -1971-1977 -3,000 -100,000 devices on a chip oVery large scale integration -1978 -1991 -100,000 -100,000,000 devices on a chip oUltra large scale integration 1991 -Over 100,000,000 devices on a chip

Moores Law Increased density of components on chip Gordon Moore co-founder of Intel Number of transistors on a chip will double every year Since 1970s development has slowed a little
Number of transistors doubles every 18 months

Cost of a chip has remained almost unchanged Higher packing density means shorter electrical paths, giving higher performance Smaller size gives increased flexibility Reduced power and cooling requirements Fewer interconnections increases reliability

Growth in CPU Transistor Count

IBM 360 Series first planned family of computers. Similar or identical O/S Increasing speed Increasing number of I/O ports (i.e. more terminals) Increased memory size Increased cost Multiplexed switch structure

DEC PDP - 8

1964 First minicomputer (after miniskirt!) Did not need air conditioned room Small enough to sit on a lab bench $16,000 -$100k+ for IBM 360 Embedded applications & OEM BUS STRUCTURE

DEC-PDP 8 Bus Structure

Semiconductor Memory
1970
Fairchild Size of a single core -i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core Capacity approximately doubles each year

Microprocessors -Intel 1971 -4004 First microprocessor All CPU components on a single chip 4 bit Multiplication by repeated addition, no hardware multiplier! Followed in 1972 by 8008 8 bit Both designed for specific applications 1974 -8080 Intels first general purpose microprocessor

1970s Processors

1980s Processors

1990s Processors

Recent Processors

Designing for Performance


Year by year, the cost of computer systems continues to drop dramatically, while the performance and capacity of those systems continue to rise equally dramatically The basic building blocks for todays computer miracles are virtually the same as those of the IAS computer from over 50 years ago, while on the other hand, the techniques for squeezing the last iota of performance out of the materials at hand have become increasingly sophisticated.

Designing for Performance


But many techniques have been invented to improve the performance. Some of the main techniques are the following:
Pipelining On board cache On board L1 and L2 Cache Branch Prediction -The processor looks ahead in the instruction code fetched from memory and predicts which branches, or groups of instructions, are likely to be processed next. If the processor guesses right most of the time, it can pre-fetch the correct instructions and buffer them so that the processor is kept busy.

Designing for Performance


Data Flow Analysis - The processor analyzes which instructions are dependent on each others results, or data, to create an optimized schedule of instructions. Speculative Execution - Using branch prediction and data flow analysis, some processors speculatively execute instructions ahead of their actual appearance in the program execution, holding the results in temporary locations. This enables the processor to keep its execution engines as busy as possible by executing instructions that are likely to be needed.

Performance Balance
While processor power has raced ahead at breakneck speed, other critical components of the computer have not kept up. The result is a need to look for performance balance: an adjusting of the organization and architecture to compensate for the mismatch among the capabilities of the various components. Processor speed increased Memory capacity increased

Logic and Memory Performance Gap

While processor speed has grown rapidly, the speed with which data can be transferred between main memory and the processor has lagged badly. The interface between processor and main memory is the most crucial pathway in the entire computer because it is responsible for carrying a constant flow of program instructions and data between memory chips and the processor. If memory or the pathway fails to keep pace with the processors insistent demands, the processor stalls in a wait state, and valuable processing time is lost.

Solutions
Increased number of bits retrieved at one time
Make DRAM wider rather than deeper

Change DRAM interface


Cache

Reduce frequency of memory access


More complex cache and cache on chip

Increase interconnection bandwidth


High speed buses Hierarchy of buses

I/O Devices
As computers become faster and more capable, more sophisticated applications are developed that support the use of peripherals with intensive I/O demands. Solutions
Caching Buffering Higher-speed interconnection buses More elaborate bus structures Multiple processor configurations

Typical I/O Device Data Rates

The key is balance among:


Processor components Main memory I/O Devices Interconnection structures

The evolution of the Intel X86 Architecture


8080: The worlds first general-purpose microprocessor. This was an 8-bit machine, with an 8-bit data path to memory. The 8080 was used in the first personal computer, the Altair. 8086: A far more powerful, 16-bit machine. In addition to a wider data path and larger registers, the 8086 sported an instruction cache, or queue, that pre-fetches a few instructions before they are executed. A variant of this processor, the 8088, was used in IBMs first personal computer, securing the success of Intel. The 8086 is the first appearance of the x86 architecture.

The evolution of the Intel X86 Architecture


80286: This extension of the 8086 enabled addressing a 16-MByte memory instead of just 1 MByte. 80386: Intels first 32-bit machine, and a major overhaul of the product. With a 32-bit architecture, the 80386 rivaled the complexity and power of minicomputers and mainframes introduced just a few years earlier. This was the first Intel processor to support multitasking, meaning it could run multiple programs at the same time.

The evolution of the Intel X86 Architecture


80486: The 80486 introduced the use of much more sophisticated and powerful cache technology and sophisticated instruction pipelining. The 80486 also offered a built-in math coprocessor, offloading complex math operations from the main CPU. Pentium: With the Pentium, Intel introduced the use of superscalar techniques, which allow multiple instructions to execute in parallel.

The evolution of the Intel X86 Architecture


Pentium Pro: The Pentium Pro continued the move into superscalar organization begun with the Pentium, with aggressive use of register renaming, branch prediction, data flow analysis, and speculative execution. Pentium II: The Pentium II incorporated Intel MMX technology, which is designed specifically to process video, audio, and graphics data efficiently. Pentium III: The Pentium III incorporates additional floating-point instructions to support 3D graphics software.

The evolution of the Intel X86 Architecture


Pentium 4: The Pentium 4 includes additional floating-point and other enhancements for multimedia.8 Core: This is the first Intel x86 microprocessor with a dual core, referring to the implementation of two processors on a single chip. Core 2: The Core 2 extends the architecture to 64 bits. The Core 2 Quad provides four processors on a single chip.

Embedded Systems and ARM


The ARM architecture refers to a processor architecture that has evolved from RISC design principles and is used in embedded systems. The term embedded system refers to the use of electronics and software within a product, as opposed to a general-purpose computer, such as a laptop or desktop system. Embedded system. A combination of computer hardware and software, and perhaps additional mechanical or other parts, designed to perform a dedicated function. In many cases, embedded systems are part of a larger system or product, as in the case of an

Embedded Systems and ARM


Embedded Systems Requirements: Small to large systems, implying very different cost constraints, thus different needs for optimization and reuse Relaxed to very strict requirements and combinations of different quality requirements, for example, with respect to safety, reliability, real-time, flexibility, and legislation Short to long life times Different environmental conditions in terms of, for example, radiation, vibrations, and humidity

Possible Organization of an Embedded System

ARM Designed by ARM Inc., Cambridge, England Its not a processor, but an architecture AMR licenses it to manufacturers As of 2007, about 98 percent of the more than one billion mobile phones sold each year use at least one ARM processor ARM chips are the processors in Apples popular iPod and iPhone devices. ARM is probably the most widely used embedded processor architecture and indeed the most widely used processor architecture of any kind in the world.

ARM Evolution

ARM processors are designed to meet the needs of three system categories:
Embedded real-time systems: Systems for storage, automotive body and power-train, industrial, and networking applications Application platforms: Devices running open operating systems including Linux, Palm OS,

Symbian OS, and Windows CE in wireless, consumer entertainment and digital imaging applications
Secure applications: Smart cards, SIM cards, and payment terminals

Performance Assessment
In evaluating processor hardware and setting requirements for new systems, performance is one of the key parameters to consider, along with cost, size, security, reliability, and in some cases power consumption. System clock speed Operations performed by a processor, such as fetching an instruction, decoding the instruction, performing an arithmetic operation, and so on are governed by a system clock. The speed of a processor is dictated by the pulse frequency produced by the clock, measured in cycles per second, or Hertz (Hz).

Performance Assessment
Clock signals are generated by a quartz crystal, which generates a constant signal wave while power is applied. This wave is converted into a digital voltage pulse stream that is provided in a constant flow to the processor circuitry. The rate of pulses is known as the clock rate, or clock speed. One increment, or pulse, of the clock is referred to as a clock cycle, or a clock tick. The time between pulses is the cycle time.

System Clock

Instruction execution takes place in discrete steps


Fetch, decode, load and store, arithmetic or logical Usually require multiple clock cycles per instruction

Pipelining simultaneous execution of instructions Conclusion: clock speed is not the whole story about performance

Instruction execution rate


Define the instruction count, Ic for a program as the number of machine instructions executed for that program until it runs to completion or for some defined time interval. An important parameter is the average cycles per instruction CPI for a program. If all instructions required the same number of clock cycles, then CPI would be a constant value for a processor. However, on any give processor, the number of clock cycles required varies for different types of instructions, such as load, store, branch, and so on.

Instruction execution rate


Let CPIi be the number of cycles required for
instruction type i. and Ii be the number of executed instructions of type I be the number of cycles required for instruction type i. and Ii be the number of executed instructions of type i for a given program. Then we can calculate an overall CPI as follows:

We can refine this formulation by recognizing that during the execution of an instruction, part of the work is done by the processor, and part of the time a word is being transferred to or from memory. In this latter case, the time to transfer depends on the memory cycle time, which may be greater than the processor cycle time. We can rewrite the preceding equation as

p no. of processor cycles M no. of memory references K ratio between memory cycle time and processor cycle time

Instruction execution rate


Millions of instructions per second (MIPS) Millions of floating point instructions per second (MFLOPS) Heavily dependent on: instruction set compiler design processor implementation

cache & memory hierarchy We can express the MIPS rate in terms of the clock rate and CPI as follows:

For example, consider the execution of a program which results in the execution of 2 million instructions on a 400-MHz processor. The program consists of four major types of instructions. The instruction mix and the CPI for each instruction type are given below based on the result of a program trace experiment:

The average CPI when the program is executed on a uniprocessor with the above trace results is CPI 0.6 + (2*0.18) + (4*0.12) + (8*0.1) = 2.24. The corresponding MIPS rate is (400*106)/(2.24*106) = 178. Floating point performance is expressed as millions of floating-point operations per second (MFLOPS), defined as follows:

Benchmarks Programs designed to test performance benchmark suite is a collection of programs, defined in a high-level language, that together attempt to provide a representative test of a computer in a particular application or system programming area. System Performance Evaluation Corporation (SPEC), maintained and defined the best known collection of benchmark suites

Averaging Results
To obtain a reliable comparison of the performance of various computers, it is preferable to run a number of different benchmark programs on each machine and then average the results. For example, if m different benchmark program, then a simple arithmetic mean can be calculated as follows:

Where Ri is the high-level language instruction execution rate for the ith benchmark program. Alternative: Harmonic Mean

Ahmdals Law Gene Amdahl Potential speed-up of program using multiple processors Concluded that:
Code needs to be parallelizable Speed up is bound, giving diminishing returns for more processors Task dependent Servers gain by maintaining multiple

connections on multiple processors Databases can be split into parallel tasks

Let T be the total execution time of the program using a single processor. Then the speedup using a parallel processor with N processors that fully exploits the parallel portion of the program is as follows:

Two important conclusions can be drawn: 1. When f is small, the use of parallel processors has little effect. 2. As N approaches infinity, speedup is bound by 1/(1 f), so that there are diminishing returns for using more processors.

Speedup

Suppose that a feature of the system is used during execution a fraction of the time f, before enhancement, and that the speedup of that feature after enhancement is SUf. Then the overall speedup of the system is

For example, suppose that a task makes extensive use of floating-point operations, with 40% of the time is consumed by floating-point operations. With a new hardware design, the floating-point module is speeded up by a factor of K. Then the overall speedup is:

Thus, independent of K, the maximum speedup is

1.67

Top Level View of Computer Function and Interconnection


CHAPTER 3

Computer Components
The Control Unit and the Arithmetic and Logic Unit constitute the Central Processing Unit An instruction interpreter and a module of general-purpose arithmetic and logic functions Data and instructions must be put into the system Taken together, theses are referred to as I/O components Memory/Main Memory place to store temporarily both instructions and data.

Top-Level View Components

Top-Level View
The CPU exchanges data with memory. For this purpose, it typically makes use of two internal (to the CPU) registers: a memory address register (MAR), which specifies the address in memory for the next read or write, and a memory buffer register (MBR), which contains the data to be written into memory or receives the data read from memory. Similarly, an I/O address register (I/OAR) specifies a particular I/O device. An I/O buffer (I/OBR) register is used for the exchange of data between an I/O module and the CPU An I/O module transfers data from external devices to CPU and memory, and vice versa. It contains internal buffers for temporarily holding these data until they can be sent on.

Computer Function
The basic function performed by a computer is execution of a program The processor does the actual work by executing instructions specified in the program. Instruction processing consists of two steps:
The processor reads ( fetches) instructions from memory one at a time and executes each instruction. Program execution (executes) consists of repeating the process of instruction fetch and instruction execution

Instruction Fetch and Execute


Fetch Cycle
Program Counter (PC) holds address of next instruction to fetch Processor fetches instruction from memory location pointed to by PC Increment PC Unless told otherwise Instruction loaded into Instruction Register (IR) Processor interprets instruction and performs required actions

Computer Function
Instruction Cycle
processing required for a single instruction The two steps are referred to as the fetch cycle and the execute cycle. Program execution halts only if the machine is turned off, some sort of unrecoverable error occurs, or a program instruction that halts the computer is encountered.

Instruction Fetch and Execute


Execute Cycle
Processor-memory
data transfer between CPU and main memory

Processor I/O
Data transfer between CPU and I/O module

Data processing
Some arithmetic or logical operation on data

Control
Alteration of sequence of operations

e.g. jump Combination of above

Example of a Program Execution

Instruction Fetch and Execute


In this example, three instruction cycles, each consisting of a fetch cycle and an execute cycle, are needed to add the contents of location 940 to the contents of 941. With a more complex set of instructions, fewer cycles would be needed. Some older processors, for example, included instructions that contain more than one memory address. Thus the execution cycle for a particular instruction on such processor could involve more than one reference to memory. Also, instead of memory references, an instruction may specify an I/O operation.

Instruction Cycle State Diagram

Instruction Cycle State Diagram


States in the upper part of the diagram involve an exchange between the processor and either memory or an I/O module. States in the lower part of the diagram involve only internal processor operations. The OAC state appears twice, because an instruction may involve a read, a write, or both. However, the action performed during that state is fundamentally the same in both cases, and so only a single state identifier is needed.

Instruction Cycle State


The states can be described as follows:
Instruction address calculation (IAC): Determine the address of the next instruction to be executed. Instruction fetch (IF): Read instruction from its memory location into the processor. Instruction operation decoding (IOD): Analyze instruction to determine type of operation to be performed and operand(s) to be used. Operand address calculation (OAC): If the operation involves reference to an operand in memory or available via I/O, then determine the address of the operand.

Instruction Cycle State


The states can be described as follows:
Operand fetch (OF): Fetch the operand from memory or read it in from I/O. Data operation (DO): Perform the operation indicated in the instruction. Operand store (OS): Write the result into memory or out to I/O.

Interrupts
Mechanism by which other modules (e.g. I/O) may interrupt normal sequence of processing Program
e.g. overflow, division by zero

Timer
Generated by internal processor time

I/O
from I/O controller

Hardware failure
e.g. memory parity error

Program Flow Control

Interrupt Cycle
Added to instruction cycle Processor checks for interrupt
Indicated by an interrupt signal

If no interrupt, fetch next instruction If interrupt pending:


Suspend execution of current program Save context Set PC to start address of interrupt handles routine Process interrupt Restore context and continue interrupted program

Transfer of Control via Interrupts

Transfer of Control via Interrupts


From the point of view of the user program, an interrupt is just that: an interruption of the normal sequence of execution. When the interrupt processing is completed, execution resumes Thus, the user program does not have to contain any special code to accommodate interrupts; the processor and the operating system are responsible for suspending the user program and then resuming it at the same point.

Interrupt Cycle
Added to instruction cycle Processor checks for interrupt
Indicated by an interrupt signal

If no interrupt, fetch next instruction If interrupt pending:


Suspend execution of current program Save context Set PC to start address of interrupt handles routine Process interrupt Restore context and continue interrupted program

Instruction Cycle with Interrupts

Instruction Cycle with Interrupts


The processor now proceeds to the fetch cycle and fetches the first instruction in the interrupt handler program, which will service the interrupt. The interrupt handler program is generally part of the operating system. Typically, this program determines the nature of the interrupt and performs whatever actions are needed. In the example we have been using, the handler determines which I/O module generated the interrupt and may branch to a program that will write more data out to that I/O module. When the interrupt handler routine is completed, the processor can resume execution of the user program at the point of interruption.

Instruction Cycle with Interrupts


In the interrupt cycle, the processor checks to see if any interrupts have occurred, indicated by the presence of an interrupt signal. If no interrupts are pending, the processor proceeds to the fetch cycle and fetches the next instruction of the current program. If an interrupt is pending, the processor does the following:
It suspends execution of the current program being executed and saves its context. This means saving the address of the next instruction to be executed (current contents of the program counter) and any other data relevant to the processors current activity It sets the program counter to the starting address of an interrupt handler routine.

Program Timing Short I/O Wait

Program Timing Long I/O Wait

Instruction Cycle State Diagram w/ Interrupts

Multiple Interrupts
Disable Interrupts
Processor will ignore further interrupts whilst processing one interrupt Interrupts remain pending and are checked after first interrupt has been processed Interrupts handled in sequence as they occur

Define Priorities
Low priority interrupts can be interrupted by higher priority interrupts

When higher priority interrupt has been processed, processor returns to previous interrupt

Multiple Interrupts - Nested

Multiple Interrupts - Sequential

Interconnection Structures
The collection of paths connecting the various modules is called the interconnection structure. The design of this structure will depend on the exchanges that must be made among modules.

Interconnection Structures
Types of exchanges that are needed by indicating the major forms of input and output for each module type:
Memory: Typically, a memory module will consist of N words of equal length. Each word is assigned a unique numerical address (0, 1, . . . ,N 1). A word of data can be read from or written into the memory I/O module: From an internal (to the computer system) point of view, I/O is functionally similar to memory. There are two operations, read and write. Further, an I/O module may control more than one external device. We can refer to each of the interfaces to an external device as a port and give each a unique address (e.g., 0, 1, . . . ,M 1).

Interconnection Structures
- Processor: The processor reads in instructions and data, writes out data after processing, and uses control signals to control the overall operation of the system. It also receives interrupt signals.

Computer Module

Memory Connection
Receives and sends data Receives addresses (of locations)

Receives control signals


Read Write Timing

Input / Output Connection


Similar to memory from computers viewpoint Output
Receive data from computer Send data to peripheral

Input
Receive data from peripheral Send data from computer

Input / Output Connection


Receive control signals from computer Send control signals to peripherals
Ex. Spin disk

Receive addresses from computer

Send interrupt signals (control)

CPU Connection
Reads instruction and data Writes out data (after processing)

Sends control signals to other units


Receives (& acts on) interrupts

Bus Interconnection
A bus is a communication pathway connecting two or more devices Multiple devices connect to the bus, and a signal transmitted by any one device is available for reception by all other devices attached to the bus. A bus that connects major computer components (processor, memory, I/O) is called a system bus.

Bus Structure
On any bus the lines can be classified into three functional groups:
The data lines provide a path for moving data among system modules. These lines, collectively, are called the data bus. The address lines are used to designate the source or destination of the data on the data bus. The control lines are used to control the access to and the use of the data and address lines.

Bus Structure
The operation of the bus is as follows. If one module wishes to send data to another, it must do two things: (1) obtain the use of the bus, and (2) transfer data via the bus. If one module wishes to request data from another module, it must (1) obtain the use of the bus, and (2) transfer a request to the other module over the appropriate control and address lines. It must then wait for that second module to send the data.

Bus Structure

Typical Physical Realization of a Bus Architecture

Traditional ISA with Cache

High Performance Bus

Bus Types
Dedicated
Separate data & address lines

Multiplexed
Shared lines Address valid or data valid control line Advantage - fewer lines Disadvantages
More complex control Ultimate performance

Bus Arbitration
More than one module controlling the bus
Ex. CPU and DMA controller

Only one module may control bus at one time Arbitration may be centralised or distributed

Centralized and Distributed Arbitration


Centralised
Single hardware device controlling bus access
Bus Controller Arbiter

May be part or separate

Distributed
Each module may claim the bus Control logic on all modules

Timing
Co-ordination of events on bus Synchronous
Events determined by clock signals Control Bus includes clock line A single 1-0 is a bus cycle All devices can read clock line Usually sync on leading edge Usually a single cycle for an event

Synchronous Timing Diagram

Asynchronous Timing Read Diagram

Asynchronous Timing Write Diagram

PCI Bus
Peripheral Component Interconnection Intel released to public domain 32 or 64 bit

PCI Bus Lines (required)


System Lines
Including clock and reset

Address and Data


32 time mux lines for address/data Interrupt & validate lines

Interface Control Arbitration


Not shared Direct connection to PCI bus arbiter

Error Lines

PCI Bus Lines (optional)


Interrupt Lines
Not shared

Cache Support 64-bit Bus Extension


Additional 32 lines Time multiplexed

2 lines to enable devices to agree to use 64bit transfer JTAG/Boundary Scan


For testing procedures

PCI Commands
Transaction between initiator (master) and target Master claims bus Determine type of transaction
Ex. I/O read/write

Address phase One or more data phases

PCI Read Timing Diagram

PCI Bus Arbiter

PCI Bus Arbitration

Cache Memory

Chapter 4

Terminology
Capacity: the amount of information that can be
contained in a memory unit
usually in terms of words or bytes

Word: the natural unit of organization in the memory,


typically the number of bits used to represent a number Addressable unit: the fundamental data element size that can be addressed in the memory
typically either the word size or individual bytes

Unit of transfer: The number of data elements


transferred at a time
usually bits in main memory and blocks in secondary memory

Transfer rate: Rate at which data is transferred to/from


the memory device

Terminology
Access time:
For RAM, the time to address the unit and perform the transfer For non-random access memory, the time to position the R/W head over the desired location

Memory cycle time: Access time plus any other


time required before a second access can be started

Access technique: how are memory contents


accessed

Memory Hierarchy
Major design objective of any memory system
To provide adequate storage capacity at An acceptable level of performance At a reasonable cost

Four interrelated ways to meet this goal

Use a hierarchy of storage devices Develop automatic space allocation methods for efficient use of the memory Through the use of virtual memory techniques, free the user from memory management tasks Design the memory and its related interconnection structure so that the processor can operate at or near its maximum rate

Memory Hierarchy
Basis of the memory hierarchy
Registers internal to the CPU for temporary data storage (small in number but very fast) External storage for data and programs (relatively large and fast) External permanent storage (much larger and much slower)

Characteristics of the memory hierarchy

Consists of distinct levels of memory components Each level characterized by its size, access time, and cost per bit Each increasing level in the hierarchy consists of modules of larger capacity, slower access time, and lower cost/bit
Try to match the processor speed with the rate of information transfer from the lowest element in the hierarchy

Goal of the memory hierarchy

Memory Hierarchy Diagram

Hierarchy List
Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape

Cache Memory
Cache memory is a critical component of the memory hierarchy
Compared to the size of main memory, cache is relatively small Operates at or near the speed of the processor Very expensive compared to main memory Cache contains copies of sections of main memory

Cache Memory
Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module

Cache and Main Memory

Cache/Main Memory Structure

Cache Operation - Overview


CPU requests contents of memory location Check cache for this data If present, get from cache (fast) If not present, read required block from main memory to cache Then deliver from cache to CPU Cache includes tags to identify which block of main memory is in each cache slot

Locality of Reference
The cache memory works because of locality of reference
Memory references made by the processor, for both instructions and data, tend to cluster together
Instruction loops, subroutines Data arrays, tables

Keep these clusters in high speed memory to reduce the average delay in accessing data Over time, the clusters being referenced will change -- memory management must deal with this

Typical Cache Organization

Cache Design
Addressing Size Mapping Function Replacement Algorithm Write Policy Block Size Number of Caches

Cache Addressing
Where does cache sit?
Between processor and virtual memory management unit Between MMU and main memory

Logical cache (virtual cache) stores data using virtual addresses


Processor accesses cache directly, not thorough physical cache Cache access faster, before MMU address translation Virtual addresses use same address space for different applications
Must flush cache on each context switch

Physical cache stores data using main memory physical addresses

Mapping Function
Because there are fewer cache lines than main memory blocks, an algorithm is needed for mapping main memory blocks into cache lines. The choice of the mapping function dictates how the cache is organized. 3 techniques: direct, associative, and set associative.

Direct Mapping
The simplest technique, known as direct mapping, maps each block of main memory into only one possible cache line.

Direct Mapping

Direct Mapping

Set Associative Mapping


Set-associative mapping is a compromise that exhibits the strengths of both the direct and associative approaches while reducing their disadvantages.

Set Associative Mapping

Set Associative Mapping

Fully Associative Mapping


Associative mapping overcomes the disadvantage of direct mapping by permitting each main memory block to be loaded into any line of the cache

Fully Associative Mapping

Write Policy
Must not overwrite a cache block unless main memory is up to date

Multiple CPUs may have individual caches


I/O may address main memory directly

Write Through
All writes go to main memory as well as cache Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date Lots of traffic Slows down writes Remember bogus write through caches!

Write Back
Updates initially made in cache only Update bit for cache slot is set when update occurs If block is to be replaced, write to main memory only if update bit is set Other caches get out of sync I/O must access main memory through cache N.B. 15% of memory references are writes

Potrebbero piacerti anche