Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Memory hierarchies, the basics of cache design Main memory architectures; advanced topics in computer architecture
- Reading assignment PH: 5.1-5.2, C.9
Cumulative review
CSE331 W14&15.2
Devices
Memory Input Output
Datapath
Higher density (1 transistor cells), lower power, cheaper but slower (access times of 50 to 70 nsec) Dynamic, so needs to be refreshed regularly (~ every 8 ms)
Main Memory
Cache
CSE331 W14&15.5
Take advantage of the principle of locality to present the user with as much memory as is available in the cheapest technology at the speed offered by the fastest technology
Processor
4-8 bytes (word)
L1$ (SRAM)
8-32 bytes (block)
L2$ (SRAM)
1 to 4 blocks
Inclusive what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM
CSE331 W14&15.6
Access Time: time between request and when word is read or written
- read access and write access times can be different
CSE331 W14&15.7
Processor
Cache
Main Memory
I/O Controller
I/O Controller
Graphics
I/O Controller
Network
Disk
Disk
CSE331 W14&15.8
Remember, caches want information provided to them one block at a time (and a block is usually more than one word)
use DRAMs that support fast multiple word accesses, preferably ones that match the block size of the cache
make sure the memory-bus can support the DRAM access rates and patterns with the goal of increasing the Memory-Bus_to_Cache bandwidth
CSE331 W14&15.9
Small printed circuit board that holds DRAMs with a 64-bit datapath
Each contain eight x4 (by 4) DRAM parts or x8 DRAM parts
CSE331 W14&15.10
r o w
d e c o d e r
The column address selects the requested bit from the row in each plane Irwin Fall 09 PSU
DRAM Organization:
Column Address
N cols
DRAM
Row Address
1st M-bit
RAS
2nd M-bit
CAS
Row Address
CSE331 W14&15.12
Col Address
Row Address
Col Address
Irwin Fall 09 PSU
Column Address
N cols
Pulse CAS to access other M-bit blocks on that row Successive reads or writes within the row are faster since dont have to precharge and (re)access that row
Cycle Time 2nd M-bit
DRAM N rows
Row Address
1st M-bit
RAS
CAS Row Address
CSE331 W14&15.13
Col Address
Col Address
Col Address
Col Address
Irwin Fall 09 PSU
Like page mode DRAMS, synchronous DRAMs have the ability to transfer a burst of data from a series of sequential addresses that are in the same row
For words in the same burst, dont have to provide the complete (row and column) addresses
Specify the starting (row+column) address and the burst length (burst must all be in the same DRAM row). The row is accessed from the DRAM and loaded into a row cache (SRAM). Data words in the burst are then accessed from that SRAM under control of a clock signal.
Transfers burst data on both the rising and falling edge of the clock (so twice fast)
Now have DDR2 and DDR3 with even higher clock rates
Irwin Fall 09 PSU
CSE331 W14&15.14
Column Address
+1 N cols
Input CAS as the starting burst address along with a burst length to read a burst of data from a series of sequential addresses within that row on the clock edge
DRAM N rows
Row Address
M bit planes
clock
RAS CAS
1st M-bit
2st M-bit
3st M-bit
4st M-bit
Row Address
CSE331 W14&15.15
Col Address
Row Address
Irwin Fall 09 PSU
http://en.wikipedia.org/wiki/DDR_SDRAM
Module Width
Year
16b
1980
16b
1983
32b
1986
64b
1993
64b
1997
64b
2000
Mb/chip
Die size (mm2) Pins/chip
0.06
35 16
0.25
45 16
1
70 18
16
130 20
64
170 54
256
204 66
BWidth (MB/s)
Latency (nsec)
13
225
40
170
160
125
267
75
640
62
1600
52
In the time that the memory to processor bandwidth has more than doubled the memory latency has improved by a factor of only 1.2 to 1.4 To deliver such high bandwidth, the internal DRAM has to be organized as interleaved memory banks
CSE331 W14&15.16 Irwin Fall 09 PSU
The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways
on-chip
CPU
One word wide organization (one word wide bus and one word wide memory)
Assume
1.
Cache
bus
2.
15 memory bus clock cycles to get the 1st word in the block from DRAM (row cycle time), 5 memory bus clock cycles for 2nd, 3rd, 4th words (column access time) 1 memory bus clock cycle to return a word of data number of bytes accessed from memory and transferred to cache/CPU per memory bus clock cycle
Irwin Fall 09 PSU
DRAM Memory
3.
CSE331 W14&15.18
on-chip
CPU
Cache bus
If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall for the number of cycles required to return one data word from memory
cycle to send address
DRAM Memory
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per memory bus clock cycle
CSE331 W14&15.19
on-chip
What if the block size is four words and each word is in a different DRAM row?
cycle to send 1st address
CPU
cycles to return last data word total clock cycles miss penalty
DRAM Memory
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per memory bus clock cycle
CSE331 W14&15.21
on-chip
What if the block size is four words and all words are in the same DRAM row?
cycle to send 1st address
CPU
cycles to return last data word total clock cycles miss penalty
DRAM Memory
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per memory bus clock cycle
CSE331 W14&15.23
on-chip
CPU
Cache bus DRAM DRAM DRAM DRAM Memory Memory Memory Memory bank 0 bank 1 bank 2 bank 3
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
CSE331 W14&15.25
Superpipelining
Execute multiple instructions at one time with the decisions on which instructions to execute simultaneously being made dynamically by the hardware E.g., Pentium 4
Execute multiple instructions at one time with the decisions on which instructions to execute simultaneously being made statically by the compiler
Hyperthreading Multicore
Irwin Fall 09 PSU
CSE331 W14&15.28
Increasing the depth of the pipeline leads to very short clock cycles, so very high clock rates (and more instructions in flight at one time)
the more pipeline latch overhead (i.e., the pipeline latch accounts for a larger and larger percentage of the clock cycle time)
the bigger the clock skew issues (i.e., because of faster and faster clocks)
CSE331 W14&15.29
Execute multiple instructions at one time with the decisions on which instructions to execute simultaneously being made dynamically by the hardware
Register renaming (RUU) architecture structures are used to solve these storage dependencies in superscalar processors
CSE331 W14&15.30
RUU queue
Commit
7 Functional Units: 2 integer ALUs, 1 FP ALU, 1FP move, load, store, complex Up to 126 instructions in flight, including 48 loads and 24 stores
CSE331 W14&15.31
VLIW Processors
Execute multiple instructions at one time with the decisions on which instructions to execute simultaneously being made statically at compile time by the compiler
Issue packet the set of instructions that are bundled together and issued in one clock cycle think of it as one large instruction with multiple operations
- The mix of instructions in the packet (bundle) is usually restricted a single instruction with several predefined fields
The compiler does static branch prediction and code scheduling to reduce (control) or eliminate (data) hazards
VLIWs have
CSE331 W14&15.32
Processor must duplicate the state hardware for each thread a separate register file, PC, instruction buffer, and store buffer for each thread
The caches, TLBs, BHT, BTB, RUU can be shared (although the miss rates may increase if they are not sized accordingly) The memory can be shared through virtual memory mechanisms
Hardware must support efficient thread context switching
CSE331 W14&15.33
Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient
Fetch Thrd Sel Decode RegFile x8 Execute Memory WB
I$ ITLB
Inst bufx8
Decode
D$ DTLB Stbufx8
Crossbar Interface
PC logicx8