Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Parallel Architectures
Contents at a Glance
Review of week 10 Cache Coherence Protocols Snoopy Protocol, MESI Protocol Execution Time (MIPS rate) Latency vs. throughput Limitations
P1
Memory
P2
I-O 1
I-O 2
Communications P1 link P2
I-O 1
March 20, 2012
I-O 2
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
Message-Passing Architecture
memory memory
...
memory
cache
cache
cache
processor
processor
...
processor
interconnection network
Shared-Memory Architecture
processor 1 processor 2
...
processor N
cache
cache
cache
interconnection network
memory 1
memory 2
...
memory M
(a) On-chip parallelism. (b) A coprocessor. (c) A multiprocessor. (d) A multicomputer.Mitra A grid. March 20, 2012 Richard Salomon, Sudipto (e)
(Tanenbaum, Structured Computer Organization, Fifth Edition, (c) 2006 Pearson Education) Copyright Box Hill Institute
Problem - multiple copies of same data may reside in some caches and the main memory If the processors are allowed to update their own copies (in the cache) freely, may result in an inconsistent view of memory This results in a cache coherence problem Hence multiple copies of the data (in cache) have to be kept identical
March 20, 2012
Write back
Write operations are made to the cache. Main memory is updated when the corresponding cache line is flushed. This policy can lead to inconsistency
Write through
All write operations are made to the main memory and the cache This can also give problems unless caches monitor memory traffic
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
The objective is to update recently used local variables in the cache and let them reside through numerous reads and write The protocol maintains consistency of shared variables in multiple caches at the same time Cache Coherence approaches
Compiler and operating system deal with problem Overhead of detecting potential problems is transferred from run time to compile time. Design complexity transferred from hardware to software Compiler based mechanisms determine which data item may become unsafe for caching and mark them Operating system or the hardware prevents Sudipto them from beingRichard Salomon,Hill Institute cached Mitra Copyright Box
However, software tends to make conservative decisions which results in an Inefficient cache utilization (prevent any shared data variables from being cached) An efficient approach is to analyze code to determine safe periods for caching shared variables Compiler then inserts instructions into the generated code to implement cache coherence during the critical periods
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
Hardware Solution
Generally referred to as Cache coherence protocols Dynamic recognition of potential problems at run time More efficient use of cache as the problem is dealt with when it arises Transparent to programmer and the compiler Two categories
Collect and maintain information about copies of data in cache There is a centralized controller, that is part of a main memory controller, and a directory stored in main memory When a request is made, the centralized controller checks and issues necessary commands for data transfer between memory and cache or between caches The central controller keeps the state information updated Salomon, Sudipto Mitra Richard
Copyright Box Hill Institute
invalidate, multiple readers but only one writer at a time Write update, multiple writers and multiple readers
March 20, 2012
Write Invalidate
Multiple readers, one writer When a write is required, all other caches of the line are invalidated Writing processor then has exclusive (cheap) access until line required by another processor Used in Pentium II and PowerPC systems State of every line is marked as modified, exclusive, shared or invalid (MESI)
Write Update
Multiple readers and writers Updated word is distributed to all other processors Some systems use an adaptive mixture of both solutions
MESI Protocol
Modified : line in the cache has been modified (different from mail memory) and is available only in this line Exclusive : line in the cache is the same as that in main memory and is not present in any other cache Shared : line in the cache is the same as that in main memory and may be present in another cache Invalid : line in the cache does not contain valid data
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
Read Miss
Processor initiates a memory read to read the line containing the missing address Processor generates a signal to alert all other units to snoop the transaction Possible Outcomes
Invalid
Read Hit
Write Miss
The processor reads the required item State remains modified, shared or exclusive Processor initiates a memory read to read the line containing the missing address Processor issues a read with intent to modify (RWITM) The line is marked modified
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
Write Hit
The effect depends on the current state of the line in the local cache Possible Outcomes
Shared
Slide35
Greyhound
7.7 hours
65 mph
60
3900
Time to do the task from start to finish execution time, response time, latency Tasks per unit time mostly used for throughput, bandwidth
March 20, 2012
data movement
Slide36
It takes 4 months to grow a tomato. Can you only grow 3 tomatoes a year ?? If you run only one job at a time, time = 1/throughput
Slide37
user CPU time? (time CPU spends running your code) total CPU time (user + kernel)? (includes op. sys. code) Wallclock time? (total elapsed time) Answer depends ...
For measuring processor speed, we can use total CPU. If no I/O or interrupts, wallclock may be better
more precise (microseconds rather than 1/100 sec) can measure individual sections of code
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide38
Performance
CPU performance = 1 / total CPU time System performance = 1 / wallclock time These terms only make sense if you know what program is measured ...
Can answer What was performance? by It took 15 March 20, 2012 seconds. CSE 141 - Performance I and II
Slide39
Every conventional processor has a clock with a fixed cycle time or clock rate
Rate often measured in MHz = millions of cycles/second Time often measured in ns (nanoseconds) X MHz corresponds to 1000/X ns (e.g. 500 MHz 2 ns clock)
Slide40
instructions/program
cycles/instruction
seconds/cycle
but its an intuitive and useful concept. Note: Use dynamic instruction count (#instructions executed), not static (#instructions in compiled code)
March 20, 2012
Slide41
Same machine, different programs Same program, different machines, but same ISA Same program, different ISAs
March 20, 2012 Slide42
Will a 1.7 GHz PC be faster than a 867 MHz Mac?? Not necessarily CPI or Instruction Count may differ.
see http://www.apple.com/g4/myth (Photoshop benchmark)
(MIPS = Millions of Instructions / sec)
PowerPC G4 can execute 4 instruction/cycle (CPI=1/4) 867 MHz clock 3468 MIPS peak But it doesnt necessarily execute that quickly.
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide43
E.g.,
Slide45
Often written MFLOPS. maximum float ops per cycle / cycle time
(in microseconds)
Normalized MFLOP/S uses conventions (e.g. divide counts as three float ops) so flop count of a program is machine-independent.
OK for floating-point intensive programs Depends on program - a better MFLOP/S rate on program P doesnt guarantee better performance on Q.
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide46
Relative Performance
Note the swapping of which goes on top when you use times
March 20, 2012
Slide47
times faster than (or times as fast as) means theres a multiplicative factor relating quantities
X was 3 time faster than Y speed(X) = 3 speed(Y) X was 25% faster than Y speed(X) = (1+25/100) speed(Y)
X was 5% slower than Y speed(X) = (1-5/100) speed(Y) 100% slower means it doesnt move at all !
X was 3 times slower than Y means speed(X) = 1/3 speed(Y) It hints at having a measure of slowness Ill mostly avoid using this.
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide48
Slide49
Slide50
Clock speed?
No
Unless ISA is same
Benchmarks
Its hard to convince manufacturers to run your program (unless youre a BIG customer) A benchmark is a set of programs that are representative of a class of problems.
Full application:
SPEC = System Performance Evaluation Cooperative (int and float) (for Unix workstations) Other suites for databases, web servers, graphics,...
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide52
Improving Latency
Smaller transistors shorten distances. To reduce disk access time, make disks rotate faster.
Replace stagecoach by pony express or telegraph. Replace DRAM by SRAM. Once upon a time, bipolar or GaAs were much faster than CMOS.
Copyright Box Hill Institute
Slide53 But incremental CSE 141 - Performance I and II CMOS have triumphed. improvements to
Improving Bandwidth
Use wider buses, more disks, multiple processors, more functional units ...
Run multiple tasks on separate hardware Reduces the time needed for a single stage Build separate resources for each stage. Start a new task down the pipe every (shorter) timestep
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide54
Pipelining
Washing/rinsing and spinning done in same tub. Takes 15 (wash/rinse) + 5 (spin) minutes Time for 1 load: 20 minutes Time for 10 loads: 200 minutes Tub for washing & rinsing (15 minutes) Separate spinner (10 minutes) Time for 1 load: 25 minutes Time for 10 loads: 160 minutes
CSE 141 - Performance I and II Copyright Box Hill Institute
Parallelism vs pipelining
Both improve throughput or bandwidth Automobiles: More plants vs. assembly line I/O bandwidth: Wider buses (e.g. parallel port) vs. pushing bits onto bus faster (serial port). Memory-to-processor: wider buses vs. faster rate CPU speed:
superscalar processor having multiple functional units so you can execute more than one instructions per cycle. superpipelining using more steps than classical 5-stage pipeline recent microprocessors use both techniques.
CSE 141 - Performance I and II Copyright Box Hill Institute
Slide56
Physics: speed of light, size of atoms, heat generated (speed requires energy loss), capacity of electromagnetic spectrum (for wireless), ... Limits with current technology: size of magnetic domains, chip size (due to defects), lithography, pin count. New technologies on the horizon: quantum computers, molecular computers, superconductors, optical computers, holographic storage, ... Fallacy improvements will stop Pitfall trying to predict > 5 years in future
March 20, 2012
Slide58
Summary
The objective of a cache coherence protocol is to update recently used local variables in the cache and let them reside through numerous reads and write Processor performance can be measured by the rate at which it executes instructions, Execution time = instructions x CPI x cycle time A benchmark is a set of programs that are representative of a class of problems
Richard Salomon, Sudipto Mitra Copyright Box Hill Institute
Reference
Stallings William, 2003, Computer Organization & Architecture designing for performance, Sixth Edition, Pearson Education, Inc, ISBN 0 - 13 049307 4. M Morris Mano, Computer System Architecture, Third Edition, Prentice Hall. Measuring Performance, UCSD, CSE 141, Larry Carter, Winter 2002 Tanenbaum, Structured Computer Organization, Fifth Edition, 2006 Pearson Education, Inc. All rights reserved. 0-13-148521-0. CS 284a Lecture, Tuesday, 7 October 1997, John Thornley.
March 20, 2012
Further Reading
Manufacturers websites Relevant Special Interest Groups [SIG] Articles in magazines IEEE Computer Society Task Force on Cluster Computing web-site