Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Processor Architecture
Stanford University Tuesday, June 6, 2000
The research papers discussed prior to this session dealt with the problems related
to the instruction fetch. Starting from this session the focus of discussion was shifted
towards another very important part of the computer organization - Memory access and
latency of memory components.
1 Introduction
There has been a remarkable increase in the speed of execution of instructions which is
performed by the processor. But this speed is greatly dependent upon the speed at which
memory components can supply instructions and data to the processor. Unfortunately
the access time of memory components, which has been increasing at the rate of around
25% per year, hasn’t been improving as fast as the processor performance whose rate of
improvement has been 60% per year and hence the memory access penalty is considerably
increasing with each enhancement in the processor architecture.
In this session we discussed the following three papers which address the issue of
memory latency:
MIPS peak instruction issue rate, separate 4KB L1 instruction and data caches with 16B
lines, and unified 1MB L2 cache with 128B lines. The miss penalties are assumed to
be 24 instruction times for L1 and 320 instruction times for L2 cache. The author also
mentions that off-chip cycle time is 3-8 times longer than the instruction issue rate.
As the title also mentions all the proposed enhancements are meant for improving the
performance of direct-mapped cache and the reason for only considering direct-mapped
cache is its faster access as compared to set-associative cache in which in addition to line
access, time is spent on predicting the set to access.
During the discussion a reference was also made to an important paper by Harold
Stone [4]. This paper presents a model for the estimation of the influence of the cache
size on the reload transients. According to the author cache could be visualized as an
array of sets with each set having a different number of elements depending upon the
number of lines that can be placed at a particular location in the cache.
During a memory reference, a conflict miss occurs when it happens to map to a cache
line already occupied by a different line. This results in replacement of the cache line. If a
reference is made to the cache line which has just been evicted by the previous reference,
the line is brought back form L2 cache or main memory incurring a penalty. According
to the analysis of the data presented by the author, on average 39% of L1 D-cache misses
and 29% of L1 I-cache are conflict misses.
Jouppi has proposed 3 different enhancements for the direct mapped caches to reduce
these conflict misses:
1. Miss Cache isn’t considerably effective if the data being brought in after a miss is
greater in size as compared to its capacity.
2. This technique introduces redundancy because the same data is being placed in L1
cache and the Miss Cache.
Analyzing the performance of the system with miss cache added to the baseline specs,
it is clear that miss cache removes a large number of conflict misses even with only 2
entries. The major sources of conflict misses in case of I-cache are procedure calls which
are not very frequent. For 4KB D-cache a miss cache of 2 entries can remove 25% of
conflict misses on average, or 13% misses overall. If the number of entries is increased
EE482: Lecture #7 3
to 4, 36% of conflict misses can be removed, or 18% of the misses overall. Increasing the
number of entries further does provide any significant improvement.
• Prefetch always - which brings in the next line whenever a memory access is made.
This is not possible due to bandwidth limitations.
• Prefetch on miss - counts on the sequential nature of accesses and so brings in 2 lines
instead of 1. This technique performs quite well on instructions and considerably
on data.
• Tagged prefetch - brings in the next line whenever a line in L1 cache is accessed
for the first time. This technique isn’t very effective as the memory access latency
is quite large.
When a miss occurs, the Stream Buffer begins prefetching successive lines starting
at the miss target. The missed line is placed in L1 cache while the lines following it are
4 EE482: Lecture #7
placed in the Stream Buffer. By doing this cache pollution is avoided. Stream Buffers
are also used to fetch from L2 cache and this is done to hide the L2 access latency.
A limitation of stream buffers is encountered when dealing with arrays e.g. adding
individual elements of array and storing them in another array. In this case flushing of
stream buffer will be frequent and so performance improvement will be small. Multi-way
stream buffers have been proposed as a solution to this problem. Multi-way stream buffers
store cache lines belonging to different streams and can greatly reduce the mentioned
problem. The usefulness can be seen from the fact that one way stream buffer removed
25% of the data cache misses on the simulated benchmarks while multi-way stream buffers
removed 43% of the misses showing 2 times performance improvement.
This papers has proposed very practical techniques for improving memory system
performance which are being utilized successfully in the current microprocessors.
oldest pending reference the does not violate the timing and resource limitations of the
DRAM. The benefit of this algorithm is that accesses targeting other banks can be made
while waiting for a precharge or activate operation to complete for the oldest pending
reference. The sustained bandwidth increases by 79% for the microbenchmarks, 17% for
the applications and 40% for application traces, over in-order scheme.
References
[1] N. Jouppi, ”Improving Direct-Mapped Cache Performance by the Addition of a Small
Fully-Associative Cache and Prefetch Buffers”, in the Proceedings of the 17th Interna-
tional Symposium on Computer Architecture, June 1990.
[2] A. Saulsbury, F. Pong, A. Nowatzyk, ”Missing the Memory Wall: The Case for Pro-
cessor/Memory Integration”, in the Proceedings of the 23rd International Symposium
on Computer Architecture, May 1996.
[3] S. Rixner, W. Dally, U. Kapasi, P. Mattson, and J. Owens, ”Memory Access Schedul-
ing”, in the Proceedings of the 27th International Symposium on Computer Architec-
ture, June 2000.
[4] D. Thiebaut and H. S. Stone, ”Footprints in Cache”, in the Proceedings of the Joint
Conference on Computer Performance Modelling, Measurement and Evaluation, 1986.