Comp Arch Ch1 Ch2 Ch3 Ch4

William Stallings Computer Organization and Architecture 8th Edition
CHAPTER 1 INTRODUCTION
Architecture and Organization

Architecture is those attributes visible to the programmer
Instruction set, number of bits used for data representation, I/O mechanisms, addressing techniques. e.g. Is there a multiply instruction?
Organization is how features are implemented

Control signals, interfaces, memory technology. e.g. Is there a hardware multiply unit or is it done by repeated addition?
Architecture and Organization

Family Concept
All Intel x86 family share the same basic architecture The IBM System/370 family share the same basic architecture This gives code compatibility (at least backwards)
Organization differs between different versions
Structure and Function

Computer
Complex system: How can we design/describe it?
Hierarchical system:
A set of interrelated subsystems, each subsystem hierarchic in structure until some lowest level of elementary subsystems is reached
At each level of the system, the designer is concerned with structure and function.
Structure and Function

Structure is the way in which components relate to each other Function is the operation of individual components as part of the structure
Function
General computer functions: Data processing Data storage Data movement Control
Operations
Data movement
Ex., keyboard to screen
Functional View of the Computer
Operations
Storage
Ex., Internet download to disk Playing an mp3 file stored in memory to earphones attached to the same PC.
Operations
Processing from/ to storage
Any number-crunching application that takes data from memory and stores the result back in memory. ex., updating bank statement
Operations
Processing from storage to I/O
Receiving packets over a network interface, verifying their CRC, then storing them in memory.
ex., printing a bank statement
Structure
Four main structural components
CPU Main Memory I/O Devices System Interconnection
Structure
1. Central Processing Unit (CPU)
Controls the operation of the computer and performs its data processing functions; often simply referred to as processor. 2. Main Memory Stores data
Structure
3.I/O
moves data between the computer and its external environment. 4. System Interconnection Some mechanism that provides for communication among CPU, main memory, and I/O. A common example of system interconnection is a system bus consisting of a number of wires to w/c all
Structure Top Level

Peripherals
Computer
Central Processing Unit Main Memory
Computer
Systems Interconnection
Communication lines
Input Output
Structure The CPU

Computer
I/O
CPU
Registers Arithmetic and Logic Unit
System Bus
CPU
Internal CPU Interconnection
Memory
Control Unit
Structure The Control Unit

Control Unit
CPU
Control Unit
ALU
Sequencing Logic Control Unit Registers and Decoders
Internal Bus Registers
Control Memory
Computer Evolution and Performance

CHAPTER 2
Brief History of Computers

The First Generation: Vacuum Tubes ENIAC
o Electronic Numerical Integrator And Computer o Worlds first general purpose electronic digital computer o John Mauchly and John Eckert o It weighs 30 tons, occupying 1500 square feet of floor space, and containing more than 18,000 vacuum tubes.

The First Generation: Vacuum Tubes Von Neumann/Turing o Stored Program concept
o o Main memory storing programs and data Attributed to John von Neumann who was an ENIAC designer and Alan Turing was the one who developed the idea Input and output equipment operated by control unit
o In 1946, von Neumann and his colleagues began the design of a new stored program computer, referred to as the IAS computer. o The IAS computer, although not completed until 1952, is the prototype of all subsequent general-purpose computers.

IAS computer consist of:
o A main memory, which stores both data and instructions o An arithmetic and logic unit (ALU) capable of operating on binary data o A control unit, which interprets the instructions in memory and causes them to be executed o Input and output (I/O) equipment operated by the control unit
Structure of the IAS computer
John von Neumann and the IAS machine, 1952
UNIVAC o UNIVAC I (Universal Automatic

o o
o o o
Computer) 1947 -Eckert-Mauchly Computer Corporation first successful commercial computer. It was intended for both scientific and commercial applications. US Bureau of Census 1950 calculations Became part of Sperry-Rand Corporation Late 1950s -UNIVAC II -Faster -More memory
IBM
o Punched-card processing equipment o 1953 -the 701 o IBMs first stored program computer o Scientific calculations o 1955 -the 702 o Business applications o Lead to 700/7000 series

The Second Generation: Transistor
o
Transistors
o o o o
is smaller, cheaper, and dissipates less heat than a vacuum tube but can be used in the same way as a vacuum tube to construct computers invented at Bell Labs in 1947 by William Shockley IBM 7000 DEC (Digital Equipment Corporation) was founded in 957 Produced PDP-1 in the same year

The Third Generation: Integrated Circuits o A computer is made up of gates, memory
cells and interconnections o single, self-contained transistor is called a discrete component o All these can be manufactured either separately (discrete components) or on the same piece of semiconductor

Generations of Computers oVacuum tube -1946-1957 oTransistor -1958-1964 oSmall scale integration -1965 on -Up to 100 devices on a chip oMedium scale integration -to 1971 -100-3,000 devices on a chip oLarge scale integration -1971-1977 -3,000 -100,000 devices on a chip oVery large scale integration -1978 -1991 -100,000 -100,000,000 devices on a chip oUltra large scale integration 1991 -Over 100,000,000 devices on a chip
Moores Law Increased density of components on chip Gordon Moore co-founder of Intel Number of transistors on a chip will double every year Since 1970s development has slowed a little
Number of transistors doubles every 18 months
Cost of a chip has remained almost unchanged Higher packing density means shorter electrical paths, giving higher performance Smaller size gives increased flexibility Reduced power and cooling requirements Fewer interconnections increases reliability
Growth in CPU Transistor Count
IBM 360 Series first planned family of computers. Similar or identical O/S Increasing speed Increasing number of I/O ports (i.e. more terminals) Increased memory size Increased cost Multiplexed switch structure
DEC PDP - 8
1964 First minicomputer (after miniskirt!) Did not need air conditioned room Small enough to sit on a lab bench $16,000 -$100k+ for IBM 360 Embedded applications & OEM BUS STRUCTURE
DEC-PDP 8 Bus Structure
Semiconductor Memory
1970
Fairchild Size of a single core -i.e. 1 bit of magnetic core storage Holds 256 bits Non-destructive read Much faster than core Capacity approximately doubles each year
Microprocessors -Intel 1971 -4004 First microprocessor All CPU components on a single chip 4 bit Multiplication by repeated addition, no hardware multiplier! Followed in 1972 by 8008 8 bit Both designed for specific applications 1974 -8080 Intels first general purpose microprocessor
1970s Processors
1980s Processors
1990s Processors
Recent Processors
Designing for Performance

Year by year, the cost of computer systems continues to drop dramatically, while the performance and capacity of those systems continue to rise equally dramatically The basic building blocks for todays computer miracles are virtually the same as those of the IAS computer from over 50 years ago, while on the other hand, the techniques for squeezing the last iota of performance out of the materials at hand have become increasingly sophisticated.

But many techniques have been invented to improve the performance. Some of the main techniques are the following:
Pipelining On board cache On board L1 and L2 Cache Branch Prediction -The processor looks ahead in the instruction code fetched from memory and predicts which branches, or groups of instructions, are likely to be processed next. If the processor guesses right most of the time, it can pre-fetch the correct instructions and buffer them so that the processor is kept busy.

Data Flow Analysis - The processor analyzes which instructions are dependent on each others results, or data, to create an optimized schedule of instructions. Speculative Execution - Using branch prediction and data flow analysis, some processors speculatively execute instructions ahead of their actual appearance in the program execution, holding the results in temporary locations. This enables the processor to keep its execution engines as busy as possible by executing instructions that are likely to be needed.
Performance Balance
While processor power has raced ahead at breakneck speed, other critical components of the computer have not kept up. The result is a need to look for performance balance: an adjusting of the organization and architecture to compensate for the mismatch among the capabilities of the various components. Processor speed increased Memory capacity increased
Logic and Memory Performance Gap
While processor speed has grown rapidly, the speed with which data can be transferred between main memory and the processor has lagged badly. The interface between processor and main memory is the most crucial pathway in the entire computer because it is responsible for carrying a constant flow of program instructions and data between memory chips and the processor. If memory or the pathway fails to keep pace with the processors insistent demands, the processor stalls in a wait state, and valuable processing time is lost.
Solutions
Increased number of bits retrieved at one time
Make DRAM wider rather than deeper
Change DRAM interface

Cache
Reduce frequency of memory access

More complex cache and cache on chip
Increase interconnection bandwidth

High speed buses Hierarchy of buses
I/O Devices
As computers become faster and more capable, more sophisticated applications are developed that support the use of peripherals with intensive I/O demands. Solutions
Caching Buffering Higher-speed interconnection buses More elaborate bus structures Multiple processor configurations
Typical I/O Device Data Rates
The key is balance among:

Processor components Main memory I/O Devices Interconnection structures
The evolution of the Intel X86 Architecture

8080: The worlds first general-purpose microprocessor. This was an 8-bit machine, with an 8-bit data path to memory. The 8080 was used in the first personal computer, the Altair. 8086: A far more powerful, 16-bit machine. In addition to a wider data path and larger registers, the 8086 sported an instruction cache, or queue, that pre-fetches a few instructions before they are executed. A variant of this processor, the 8088, was used in IBMs first personal computer, securing the success of Intel. The 8086 is the first appearance of the x86 architecture.

80286: This extension of the 8086 enabled addressing a 16-MByte memory instead of just 1 MByte. 80386: Intels first 32-bit machine, and a major overhaul of the product. With a 32-bit architecture, the 80386 rivaled the complexity and power of minicomputers and mainframes introduced just a few years earlier. This was the first Intel processor to support multitasking, meaning it could run multiple programs at the same time.

80486: The 80486 introduced the use of much more sophisticated and powerful cache technology and sophisticated instruction pipelining. The 80486 also offered a built-in math coprocessor, offloading complex math operations from the main CPU. Pentium: With the Pentium, Intel introduced the use of superscalar techniques, which allow multiple instructions to execute in parallel.

Pentium Pro: The Pentium Pro continued the move into superscalar organization begun with the Pentium, with aggressive use of register renaming, branch prediction, data flow analysis, and speculative execution. Pentium II: The Pentium II incorporated Intel MMX technology, which is designed specifically to process video, audio, and graphics data efficiently. Pentium III: The Pentium III incorporates additional floating-point instructions to support 3D graphics software.

Pentium 4: The Pentium 4 includes additional floating-point and other enhancements for multimedia.8 Core: This is the first Intel x86 microprocessor with a dual core, referring to the implementation of two processors on a single chip. Core 2: The Core 2 extends the architecture to 64 bits. The Core 2 Quad provides four processors on a single chip.
Embedded Systems and ARM

The ARM architecture refers to a processor architecture that has evolved from RISC design principles and is used in embedded systems. The term embedded system refers to the use of electronics and software within a product, as opposed to a general-purpose computer, such as a laptop or desktop system. Embedded system. A combination of computer hardware and software, and perhaps additional mechanical or other parts, designed to perform a dedicated function. In many cases, embedded systems are part of a larger system or product, as in the case of an
Embedded Systems and ARM

Embedded Systems Requirements: Small to large systems, implying very different cost constraints, thus different needs for optimization and reuse Relaxed to very strict requirements and combinations of different quality requirements, for example, with respect to safety, reliability, real-time, flexibility, and legislation Short to long life times Different environmental conditions in terms of, for example, radiation, vibrations, and humidity
Possible Organization of an Embedded System
ARM Designed by ARM Inc., Cambridge, England Its not a processor, but an architecture AMR licenses it to manufacturers As of 2007, about 98 percent of the more than one billion mobile phones sold each year use at least one ARM processor ARM chips are the processors in Apples popular iPod and iPhone devices. ARM is probably the most widely used embedded processor architecture and indeed the most widely used processor architecture of any kind in the world.
ARM Evolution
ARM processors are designed to meet the needs of three system categories:
Embedded real-time systems: Systems for storage, automotive body and power-train, industrial, and networking applications Application platforms: Devices running open operating systems including Linux, Palm OS,
Symbian OS, and Windows CE in wireless, consumer entertainment and digital imaging applications
Secure applications: Smart cards, SIM cards, and payment terminals
Performance Assessment
In evaluating processor hardware and setting requirements for new systems, performance is one of the key parameters to consider, along with cost, size, security, reliability, and in some cases power consumption. System clock speed Operations performed by a processor, such as fetching an instruction, decoding the instruction, performing an arithmetic operation, and so on are governed by a system clock. The speed of a processor is dictated by the pulse frequency produced by the clock, measured in cycles per second, or Hertz (Hz).
Performance Assessment
Clock signals are generated by a quartz crystal, which generates a constant signal wave while power is applied. This wave is converted into a digital voltage pulse stream that is provided in a constant flow to the processor circuitry. The rate of pulses is known as the clock rate, or clock speed. One increment, or pulse, of the clock is referred to as a clock cycle, or a clock tick. The time between pulses is the cycle time.
System Clock
Instruction execution takes place in discrete steps

Fetch, decode, load and store, arithmetic or logical Usually require multiple clock cycles per instruction
Pipelining simultaneous execution of instructions Conclusion: clock speed is not the whole story about performance
Instruction execution rate

Define the instruction count, Ic for a program as the number of machine instructions executed for that program until it runs to completion or for some defined time interval. An important parameter is the average cycles per instruction CPI for a program. If all instructions required the same number of clock cycles, then CPI would be a constant value for a processor. However, on any give processor, the number of clock cycles required varies for different types of instructions, such as load, store, branch, and so on.

Let CPIi be the number of cycles required for
instruction type i. and Ii be the number of executed instructions of type I be the number of cycles required for instruction type i. and Ii be the number of executed instructions of type i for a given program. Then we can calculate an overall CPI as follows:
We can refine this formulation by recognizing that during the execution of an instruction, part of the work is done by the processor, and part of the time a word is being transferred to or from memory. In this latter case, the time to transfer depends on the memory cycle time, which may be greater than the processor cycle time. We can rewrite the preceding equation as
p no. of processor cycles M no. of memory references K ratio between memory cycle time and processor cycle time

Millions of instructions per second (MIPS) Millions of floating point instructions per second (MFLOPS) Heavily dependent on: instruction set compiler design processor implementation
cache & memory hierarchy We can express the MIPS rate in terms of the clock rate and CPI as follows:
For example, consider the execution of a program which results in the execution of 2 million instructions on a 400-MHz processor. The program consists of four major types of instructions. The instruction mix and the CPI for each instruction type are given below based on the result of a program trace experiment:
The average CPI when the program is executed on a uniprocessor with the above trace results is CPI 0.6 + (2*0.18) + (4*0.12) + (8*0.1) = 2.24. The corresponding MIPS rate is (400*106)/(2.24*106) = 178. Floating point performance is expressed as millions of floating-point operations per second (MFLOPS), defined as follows:
Benchmarks Programs designed to test performance benchmark suite is a collection of programs, defined in a high-level language, that together attempt to provide a representative test of a computer in a particular application or system programming area. System Performance Evaluation Corporation (SPEC), maintained and defined the best known collection of benchmark suites
Averaging Results
To obtain a reliable comparison of the performance of various computers, it is preferable to run a number of different benchmark programs on each machine and then average the results. For example, if m different benchmark program, then a simple arithmetic mean can be calculated as follows:
Where Ri is the high-level language instruction execution rate for the ith benchmark program. Alternative: Harmonic Mean
Ahmdals Law Gene Amdahl Potential speed-up of program using multiple processors Concluded that:
Code needs to be parallelizable Speed up is bound, giving diminishing returns for more processors Task dependent Servers gain by maintaining multiple
connections on multiple processors Databases can be split into parallel tasks
Let T be the total execution time of the program using a single processor. Then the speedup using a parallel processor with N processors that fully exploits the parallel portion of the program is as follows:
Two important conclusions can be drawn: 1. When f is small, the use of parallel processors has little effect. 2. As N approaches infinity, speedup is bound by 1/(1 f), so that there are diminishing returns for using more processors.
Speedup
Suppose that a feature of the system is used during execution a fraction of the time f, before enhancement, and that the speedup of that feature after enhancement is SUf. Then the overall speedup of the system is
For example, suppose that a task makes extensive use of floating-point operations, with 40% of the time is consumed by floating-point operations. With a new hardware design, the floating-point module is speeded up by a factor of K. Then the overall speedup is:
Thus, independent of K, the maximum speedup is
1.67
Top Level View of Computer Function and Interconnection

CHAPTER 3
Computer Components
The Control Unit and the Arithmetic and Logic Unit constitute the Central Processing Unit An instruction interpreter and a module of general-purpose arithmetic and logic functions Data and instructions must be put into the system Taken together, theses are referred to as I/O components Memory/Main Memory place to store temporarily both instructions and data.
Top-Level View Components
Top-Level View
The CPU exchanges data with memory. For this purpose, it typically makes use of two internal (to the CPU) registers: a memory address register (MAR), which specifies the address in memory for the next read or write, and a memory buffer register (MBR), which contains the data to be written into memory or receives the data read from memory. Similarly, an I/O address register (I/OAR) specifies a particular I/O device. An I/O buffer (I/OBR) register is used for the exchange of data between an I/O module and the CPU An I/O module transfers data from external devices to CPU and memory, and vice versa. It contains internal buffers for temporarily holding these data until they can be sent on.
Computer Function
The basic function performed by a computer is execution of a program The processor does the actual work by executing instructions specified in the program. Instruction processing consists of two steps:
The processor reads ( fetches) instructions from memory one at a time and executes each instruction. Program execution (executes) consists of repeating the process of instruction fetch and instruction execution
Instruction Fetch and Execute

Fetch Cycle
Program Counter (PC) holds address of next instruction to fetch Processor fetches instruction from memory location pointed to by PC Increment PC Unless told otherwise Instruction loaded into Instruction Register (IR) Processor interprets instruction and performs required actions
Computer Function
Instruction Cycle
processing required for a single instruction The two steps are referred to as the fetch cycle and the execute cycle. Program execution halts only if the machine is turned off, some sort of unrecoverable error occurs, or a program instruction that halts the computer is encountered.

Execute Cycle
Processor-memory
data transfer between CPU and main memory
Processor I/O
Data transfer between CPU and I/O module
Data processing
Some arithmetic or logical operation on data
Control
Alteration of sequence of operations
e.g. jump Combination of above
Example of a Program Execution

In this example, three instruction cycles, each consisting of a fetch cycle and an execute cycle, are needed to add the contents of location 940 to the contents of 941. With a more complex set of instructions, fewer cycles would be needed. Some older processors, for example, included instructions that contain more than one memory address. Thus the execution cycle for a particular instruction on such processor could involve more than one reference to memory. Also, instead of memory references, an instruction may specify an I/O operation.
Instruction Cycle State Diagram
Instruction Cycle State Diagram

States in the upper part of the diagram involve an exchange between the processor and either memory or an I/O module. States in the lower part of the diagram involve only internal processor operations. The OAC state appears twice, because an instruction may involve a read, a write, or both. However, the action performed during that state is fundamentally the same in both cases, and so only a single state identifier is needed.
Instruction Cycle State

The states can be described as follows:
Instruction address calculation (IAC): Determine the address of the next instruction to be executed. Instruction fetch (IF): Read instruction from its memory location into the processor. Instruction operation decoding (IOD): Analyze instruction to determine type of operation to be performed and operand(s) to be used. Operand address calculation (OAC): If the operation involves reference to an operand in memory or available via I/O, then determine the address of the operand.
Instruction Cycle State

The states can be described as follows:
Operand fetch (OF): Fetch the operand from memory or read it in from I/O. Data operation (DO): Perform the operation indicated in the instruction. Operand store (OS): Write the result into memory or out to I/O.
Interrupts
Mechanism by which other modules (e.g. I/O) may interrupt normal sequence of processing Program
e.g. overflow, division by zero
Timer
Generated by internal processor time
I/O
from I/O controller
Hardware failure
e.g. memory parity error
Program Flow Control
Interrupt Cycle
Added to instruction cycle Processor checks for interrupt
Indicated by an interrupt signal
If no interrupt, fetch next instruction If interrupt pending:

Suspend execution of current program Save context Set PC to start address of interrupt handles routine Process interrupt Restore context and continue interrupted program
Transfer of Control via Interrupts
Transfer of Control via Interrupts

From the point of view of the user program, an interrupt is just that: an interruption of the normal sequence of execution. When the interrupt processing is completed, execution resumes Thus, the user program does not have to contain any special code to accommodate interrupts; the processor and the operating system are responsible for suspending the user program and then resuming it at the same point.
Interrupt Cycle
Added to instruction cycle Processor checks for interrupt
Indicated by an interrupt signal
If no interrupt, fetch next instruction If interrupt pending:

Suspend execution of current program Save context Set PC to start address of interrupt handles routine Process interrupt Restore context and continue interrupted program
Instruction Cycle with Interrupts

The processor now proceeds to the fetch cycle and fetches the first instruction in the interrupt handler program, which will service the interrupt. The interrupt handler program is generally part of the operating system. Typically, this program determines the nature of the interrupt and performs whatever actions are needed. In the example we have been using, the handler determines which I/O module generated the interrupt and may branch to a program that will write more data out to that I/O module. When the interrupt handler routine is completed, the processor can resume execution of the user program at the point of interruption.

In the interrupt cycle, the processor checks to see if any interrupts have occurred, indicated by the presence of an interrupt signal. If no interrupts are pending, the processor proceeds to the fetch cycle and fetches the next instruction of the current program. If an interrupt is pending, the processor does the following:
It suspends execution of the current program being executed and saves its context. This means saving the address of the next instruction to be executed (current contents of the program counter) and any other data relevant to the processors current activity It sets the program counter to the starting address of an interrupt handler routine.
Program Timing Short I/O Wait
Program Timing Long I/O Wait
Instruction Cycle State Diagram w/ Interrupts
Multiple Interrupts
Disable Interrupts
Processor will ignore further interrupts whilst processing one interrupt Interrupts remain pending and are checked after first interrupt has been processed Interrupts handled in sequence as they occur
Define Priorities
Low priority interrupts can be interrupted by higher priority interrupts
When higher priority interrupt has been processed, processor returns to previous interrupt
Multiple Interrupts - Nested
Multiple Interrupts - Sequential
Interconnection Structures
The collection of paths connecting the various modules is called the interconnection structure. The design of this structure will depend on the exchanges that must be made among modules.
Types of exchanges that are needed by indicating the major forms of input and output for each module type:
Memory: Typically, a memory module will consist of N words of equal length. Each word is assigned a unique numerical address (0, 1, . . . ,N 1). A word of data can be read from or written into the memory I/O module: From an internal (to the computer system) point of view, I/O is functionally similar to memory. There are two operations, read and write. Further, an I/O module may control more than one external device. We can refer to each of the interfaces to an external device as a port and give each a unique address (e.g., 0, 1, . . . ,M 1).
- Processor: The processor reads in instructions and data, writes out data after processing, and uses control signals to control the overall operation of the system. It also receives interrupt signals.
Computer Module
Memory Connection
Receives and sends data Receives addresses (of locations)
Receives control signals

Read Write Timing
Input / Output Connection

Similar to memory from computers viewpoint Output
Receive data from computer Send data to peripheral
Input
Receive data from peripheral Send data from computer
Input / Output Connection

Receive control signals from computer Send control signals to peripherals
Ex. Spin disk
Receive addresses from computer
Send interrupt signals (control)
CPU Connection
Reads instruction and data Writes out data (after processing)
Sends control signals to other units

Receives (& acts on) interrupts
Bus Interconnection
A bus is a communication pathway connecting two or more devices Multiple devices connect to the bus, and a signal transmitted by any one device is available for reception by all other devices attached to the bus. A bus that connects major computer components (processor, memory, I/O) is called a system bus.
Bus Structure
On any bus the lines can be classified into three functional groups:
The data lines provide a path for moving data among system modules. These lines, collectively, are called the data bus. The address lines are used to designate the source or destination of the data on the data bus. The control lines are used to control the access to and the use of the data and address lines.
Bus Structure
The operation of the bus is as follows. If one module wishes to send data to another, it must do two things: (1) obtain the use of the bus, and (2) transfer data via the bus. If one module wishes to request data from another module, it must (1) obtain the use of the bus, and (2) transfer a request to the other module over the appropriate control and address lines. It must then wait for that second module to send the data.
Bus Structure
Typical Physical Realization of a Bus Architecture
Traditional ISA with Cache
High Performance Bus
Bus Types
Dedicated
Separate data & address lines
Multiplexed
Shared lines Address valid or data valid control line Advantage - fewer lines Disadvantages
More complex control Ultimate performance
Bus Arbitration
More than one module controlling the bus
Ex. CPU and DMA controller
Only one module may control bus at one time Arbitration may be centralised or distributed
Centralized and Distributed Arbitration

Centralised
Single hardware device controlling bus access
Bus Controller Arbiter
May be part or separate
Distributed
Each module may claim the bus Control logic on all modules
Timing
Co-ordination of events on bus Synchronous
Events determined by clock signals Control Bus includes clock line A single 1-0 is a bus cycle All devices can read clock line Usually sync on leading edge Usually a single cycle for an event
Synchronous Timing Diagram
Asynchronous Timing Read Diagram
Asynchronous Timing Write Diagram
PCI Bus
Peripheral Component Interconnection Intel released to public domain 32 or 64 bit
PCI Bus Lines (required)

System Lines
Including clock and reset
Address and Data

32 time mux lines for address/data Interrupt & validate lines
Interface Control Arbitration

Not shared Direct connection to PCI bus arbiter
Error Lines
PCI Bus Lines (optional)

Interrupt Lines
Not shared
Cache Support 64-bit Bus Extension

Additional 32 lines Time multiplexed
2 lines to enable devices to agree to use 64bit transfer JTAG/Boundary Scan

For testing procedures
PCI Commands
Transaction between initiator (master) and target Master claims bus Determine type of transaction
Ex. I/O read/write
Address phase One or more data phases
PCI Read Timing Diagram
PCI Bus Arbiter
PCI Bus Arbitration
Cache Memory
Chapter 4
Terminology
Capacity: the amount of information that can be
contained in a memory unit
usually in terms of words or bytes
Word: the natural unit of organization in the memory,

typically the number of bits used to represent a number Addressable unit: the fundamental data element size that can be addressed in the memory
typically either the word size or individual bytes
Unit of transfer: The number of data elements

transferred at a time
usually bits in main memory and blocks in secondary memory
Transfer rate: Rate at which data is transferred to/from

the memory device
Terminology
Access time:
For RAM, the time to address the unit and perform the transfer For non-random access memory, the time to position the R/W head over the desired location
Memory cycle time: Access time plus any other

time required before a second access can be started
Access technique: how are memory contents

accessed
Memory Hierarchy
Major design objective of any memory system
To provide adequate storage capacity at An acceptable level of performance At a reasonable cost
Four interrelated ways to meet this goal
Use a hierarchy of storage devices Develop automatic space allocation methods for efficient use of the memory Through the use of virtual memory techniques, free the user from memory management tasks Design the memory and its related interconnection structure so that the processor can operate at or near its maximum rate
Memory Hierarchy
Basis of the memory hierarchy
Registers internal to the CPU for temporary data storage (small in number but very fast) External storage for data and programs (relatively large and fast) External permanent storage (much larger and much slower)
Characteristics of the memory hierarchy
Consists of distinct levels of memory components Each level characterized by its size, access time, and cost per bit Each increasing level in the hierarchy consists of modules of larger capacity, slower access time, and lower cost/bit
Try to match the processor speed with the rate of information transfer from the lowest element in the hierarchy
Goal of the memory hierarchy
Memory Hierarchy Diagram
Hierarchy List
Registers L1 Cache L2 Cache Main memory Disk cache Disk Optical Tape
Cache Memory
Cache memory is a critical component of the memory hierarchy
Compared to the size of main memory, cache is relatively small Operates at or near the speed of the processor Very expensive compared to main memory Cache contains copies of sections of main memory
Cache Memory
Small amount of fast memory Sits between normal main memory and CPU May be located on CPU chip or module
Cache and Main Memory
Cache/Main Memory Structure
Cache Operation - Overview

CPU requests contents of memory location Check cache for this data If present, get from cache (fast) If not present, read required block from main memory to cache Then deliver from cache to CPU Cache includes tags to identify which block of main memory is in each cache slot
Locality of Reference
The cache memory works because of locality of reference
Memory references made by the processor, for both instructions and data, tend to cluster together
Instruction loops, subroutines Data arrays, tables
Keep these clusters in high speed memory to reduce the average delay in accessing data Over time, the clusters being referenced will change -- memory management must deal with this
Typical Cache Organization
Cache Design
Addressing Size Mapping Function Replacement Algorithm Write Policy Block Size Number of Caches
Cache Addressing
Where does cache sit?
Between processor and virtual memory management unit Between MMU and main memory
Logical cache (virtual cache) stores data using virtual addresses

Processor accesses cache directly, not thorough physical cache Cache access faster, before MMU address translation Virtual addresses use same address space for different applications
Must flush cache on each context switch
Physical cache stores data using main memory physical addresses
Mapping Function
Because there are fewer cache lines than main memory blocks, an algorithm is needed for mapping main memory blocks into cache lines. The choice of the mapping function dictates how the cache is organized. 3 techniques: direct, associative, and set associative.
Direct Mapping
The simplest technique, known as direct mapping, maps each block of main memory into only one possible cache line.
Direct Mapping
Direct Mapping
Set Associative Mapping

Set-associative mapping is a compromise that exhibits the strengths of both the direct and associative approaches while reducing their disadvantages.
Fully Associative Mapping

Associative mapping overcomes the disadvantage of direct mapping by permitting each main memory block to be loaded into any line of the cache
Fully Associative Mapping
Write Policy
Must not overwrite a cache block unless main memory is up to date
Multiple CPUs may have individual caches

I/O may address main memory directly
Write Through
All writes go to main memory as well as cache Multiple CPUs can monitor main memory traffic to keep local (to CPU) cache up to date Lots of traffic Slows down writes Remember bogus write through caches!
Write Back
Updates initially made in cache only Update bit for cache slot is set when update occurs If block is to be replaced, write to main memory only if update bit is set Other caches get out of sync I/O must access main memory through cache N.B. 15% of memory references are writes

Comp Arch Ch1 Ch2 Ch3 Ch4

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Comp Arch Ch1 Ch2 Ch3 Ch4

Caricato da

Copyright:

Formati disponibili

William Stallings Computer Organization and Architecture 8th Edition

Architecture and Organization

Organization is how features are implemented

Architecture and Organization

Organization differs between different versions

Structure and Function

Structure and Function

Functional View of the Computer

Structure Top Level

Structure The CPU

Internal CPU Interconnection

Structure The Control Unit

Sequencing Logic Control Unit Registers and Decoders

Internal Bus Registers

Computer Evolution and Performance

Brief History of Computers

Brief History of Computers

Brief History of Computers

Structure of the IAS computer

John von Neumann and the IAS machine, 1952

UNIVAC o UNIVAC I (Universal Automatic

Brief History of Computers

Brief History of Computers

Brief History of Computers

Growth in CPU Transistor Count

DEC-PDP 8 Bus Structure

Designing for Performance

Designing for Performance

Designing for Performance

Logic and Memory Performance Gap

Change DRAM interface

Reduce frequency of memory access

Increase interconnection bandwidth

Typical I/O Device Data Rates

The key is balance among:

The evolution of the Intel X86 Architecture

The evolution of the Intel X86 Architecture

The evolution of the Intel X86 Architecture

The evolution of the Intel X86 Architecture

The evolution of the Intel X86 Architecture

Embedded Systems and ARM

Embedded Systems and ARM

Possible Organization of an Embedded System

Instruction execution takes place in discrete steps

Instruction execution rate

Instruction execution rate

Instruction execution rate

connections on multiple processors Databases can be split into parallel tasks

Thus, independent of K, the maximum speedup is

Top Level View of Computer Function and Interconnection

Top-Level View Components

Instruction Fetch and Execute

Instruction Fetch and Execute

e.g. jump Combination of above

Example of a Program Execution

Instruction Fetch and Execute

Instruction Cycle State Diagram

Instruction Cycle State Diagram

Instruction Cycle State

Instruction Cycle State

Program Flow Control

If no interrupt, fetch next instruction If interrupt pending:

Transfer of Control via Interrupts