Finale PPT Tanmay Rao

www.whizchip.
com
Whiz, Wizard - a person with extra-ordinary skill or accomplishment
Design And Evaluation Of Cache Performance Evaluation System

(To Implement Search Algorithms and Evaluate Simulation Models)
Tanmay Rao M Reg No - 101002014 MS VLSI CAD tanmayr@whizchip.com
Confidential information of WhizChip Design Technologies (www.whizchip.com). Contains WhizChip or its customers proprietary and business sensitive data.
Contents
Introduction Terms Related to Cache Cache States Project Specifications Need For Search Algorithm Proposed Algorithm 1:BST Proposed Algorithm 2:Splay Tree Class Architecture Conclusion and Scope for Future Work
References
Introduction
Cache plays a vital role in improving performance by providing data to the

requesting master in a SoC with in a very few clock cycles. Cache reduces the need for frequent access of the main memory which would take typically 50 to 100 clock cycles.
Importance of cache in a typical SoC containing several masters is

determined by adding cache at different levels.
Four Simulation models are designed to determine the importance of cache in

a SoC. The local caching of data introduces the cache coherence problem.
Cache coherence problem is solved by implementing cache coherency

protocol.
Search algorithms are used to implement the cache controllers. Depending on

the performance two search algorithms are implemented and evaluated.
Terms Related to Cache
Cache Entries Cache Performance Replacement Policies Locality Write Policies
Master Stalls
Flag Bits Cache Miss Cache Hierarchy Victim Cache
Cache States

Valid, Invalid: When valid cache line is present in the cache. When invalid, cache line is not present in the cache. Unique, Shared: When unique, the cache line exists only in one cache. When shared, cache line exists in more than one cache. Clean, Dirty: When Clean, the cache line is not changed so no need to update the main memory, when this clean cache line is replaced. When dirty, the cache line is changed, need to update main memory when this cache line is replaced.
Valid Unique Shared Invalid
Dirty
UD Unique Dirty
SD Shared Dirty
I Invalid
Clean
UC Unique Clean
SC Shared Clean
Project Specifications
Main Memory Controller This controller accepts address on the shared bus. This controller also has a delay to mimic the real time latency that
is seen in a typical main memory access.
Typical delays are in the range of 50-100 clock cycles

L1 Cache Controller The cache controller will serve the master requests for read and
write.
On a cache miss it needs to fetch the data from the main memory
or get it from other snooped cache.
The additional responsibility of the cache controller is that it needs

to support snooping which can have conflicting effects on its normal operation.
Project Specifications
L2 Cache Controller When a dirty cache line from any of the L1 cache is to be replaced then it
is moved to L2 cache.
Saves the clock cycles needed to write it back to the main memory.
Search Algorithm Two Search Algorithms have been implemented in this project. Cache Simulation Models Four Simulation Models were developed. I have done the
evaluation of the models using test cases.
Model 1
M1
M2
L1 Cache
L1 Cache
Snoop Channel Main Memory Main Memory Channel
M3
M4 M4
L1 Cache
L1 Cache
Model 2
M1 M2
L2 Cache L1 Cache L1 Cache
Snoop Channel L2 cache Channel
Main Memory Channel Main Memory
M3
M4
L1 Cache
L1 Cache
Model 3
M1
M2
L1 Cache
L1 Cache
M3
M4
L1 Cache
L1 Cache
Model 4
M1
M2
M3
M4
Constraints and Assumptions

Constraints The whole model is written in SystemVerilog Memory leaks associated with C++ does not exist in SystemVerilog. Assumptions The delay associated with main memory is modeled at 100 cycles. It can
vary upon the user.
The replacement algorithm for the cache is not a standard policy. The read and write channels for the main memory are separate. No particular snooping protocol is used.
Need For Search Algorithm

Basic requirement is a search for the requested address in the cache. There are many algorithms that have been devised for efficient search. Along with search we need to add and delete addresses . We also need to update and replace the cache lines with new lines.
The algorithm should add and delete the address in a manner which does not affect the search drastically.
way.
We need a suitable data structure which can store the address in an effective
The memory footprint of the data structure should also be optimum so that it
does not consume too much memory resources.
Background Study
Hash Coding
Hash coding is a process in which a search key is transformed, through the use of a hash function to an actual address for the associated data. A very simple hash function is the modulues function. Pseudo CAMs
Since fully associative memories are difficult and expensive to build, relative to normal main memory, a method of building a large random access memory but with associative access would be advantageous.
The pseudo CAM uses a multiple memory bank architecture in which a key is hashed to an address which is valid with every bank. Pre-Computation Technique Here extra information is stored along with tag. This extra information is derived from the stored bit. For an input tag we first compute the number of ones / zeros and compare with the stored bit
Proposed Algorithm 1:BST

The data structure used for storing the address is a binary search tree. Properties of a binary search tree are The left subtree of a node should have values less than the current node's
values.
The right subtree of a node should have values greater than the current
node's values.
Both the left and right subtrees should also be binary search tree The number of elements in a data structure depends on number of levels of
the binary search tree. As the number of levels increase the number of elements increase. If we have 'n' levels in a binary search tree then we have 2n -1 elements.
Search Operation
Root node=addr?
Yes
Search Successful
No
No
Is addr=left node?
Yes
Search Successful
Is addr=right node ?
No
Yes Search Successful
Add Operation
Is Root empty ?
Yes
Add address to the tree
No
No
Is address less than current node?
Yes
Is left child empty?
Yes
No No Is address greater than current node?
Yes
Is right child empty?
Yes
Delete Operation
Is the address to be deleted root node? No
Yes
Find the in-order successor of the root node
Replace root node with in node successor Delete the duplicate enteries
Is the address to be deleted intermediate node? No
Yes
Find the in-order successor of the intermediate node
Replace node with in node successor
Is the address to be deleted leaf node?
Yes
Delete the leaf node
Update Operation
Old cache lines data != new cache lines data
Yes
Update the data in the cache line
No
Old cache lines dirty != new cache lines dirty
Yes
Update the dirty bit in the cache line
No
Old cache lines shared != new cache lines shared
Yes
Update the shared bit in the cache line
Replace Operation
Cache line shared?
Yes
Replace the cache line with the new cache line
No
Cache line clean?
Yes
Replace the cache line with the new cache line
No The last added node is replaced with the new cache line
Proposed Algorithm 2: Splay Tree

A modified version of a binary search tree called the splay tree is used to
implement the data structure
A splay tree is self-balancing binary search tree. A balanced binary search tree has uniform height on both sub trees . Along with self-balancing property, the splay tree has the additional property
that whenever a new address is added, it is bought to the root node.
This process of bringing the address that is added to the root is called
splaying.
So in a splay tree the time required to access most recently used address is
very less as they will be nearer to the root.
Splaying

The three types of splay steps are: Zig Step
This step is done when p is the root. The tree is rotated on the edge between x and p.
Zig-Zig Step. This step is done when p is not the root and x and p are either both right children or are both left children. The tree is rotated on the edge joining p with its parent g, then rotated on the edge joining x with p.
Zig-Zag Step.
This step is done when p is not the root and x is a right child and p is a left child or vice versa. The tree is rotated on the edge between x and p, then rotated on the edge between x and its new parent g.
Splay
P X
Splay
G X P D P A X D G
A B C
Search Operation
This is the most important operation in a binary search tree. The given address is compared with the root, its left child and right child. The next stage of comparisons has four possibilities. If the address to be searched is less than p and the left child then the
address if exists will be in the node represented by A.
If the address to be searched is less than p and greater than the left child
then the address if exists will be in the node represented by B.
If the address to be searched is greater than p and less than the right child
then the address if exists will be in the node represented by C.
If the address to be searched is greater than p and the right child then the
address if exists will be in the node represented by D
Search Operation
LC
RC
Add Operation
Current = root node No Change current node to right child
Addr > current and Addr > right node
Yes
Is right child empty ?
Yes
Add address to right child of the right node
No No
Change current node to left child
Addr > current and Addr < right node
Yes
Is left Yes child empty ?
Yes
Add address to left child of the right node
No
Splay the tree if required
Addr < current and Addr > left node No
Yes
Is right child empty ?
Yes
Add address to right child of the right node
No
Change current node to right child
Addr < current and Addr < left node
Yes
Is left child empty ?
Yes
Add address to left child of the right node
No End
No
Change current node to left child
Delete Operation
Is the address to be deleted root node? No
Yes
Find the in-order successor of the root node
Replace root node with in node successor Delete the duplicate enteries
Is the address to be deleted intermediate node? No
Yes
Find the in-order successor of the intermediate node
Replace node with in node successor
Splay if required Yes Delete the leaf node
Is the address to be deleted leaf node?
Cache Line
Field Key Description Stores the Address. The address is matched in search operation. operation. It is unique part of the cache line which distinguishes distinguishes from other cache lines. Stores the Data that is associated with particular address. This This may be consistent with the main memory or may be provided by the master. Flag bit .This bit when set indicates that the cache line is shared shared among other masters Flag bit. This bit when set indicates that the data in the cache cache line is dirty. When the cache line is evicted data must be must be written back to the main memory. Flag bit. This bit when set indicates that the data in the cache cache line is valid.
Data
Shared
Dirty
Invalid
Data Structure Class
The add task is used to add the cache line to the data structure. The delete operation for the binary search tree is also implemented. Temporary nodes are used to find the in node successor or the in -node
predecessor
In this implementation the in-node successor method is implemented.
The update operation is common for the both the algorithms. The splay task is implemented only for the second algorithm. In the implementation we splay the data structure when we add 3, 5, 7, 9
and 15 elements
Splay For 3 Element Tree

8 8
16
16
24
24
16
24
Splay For 5 Element Tree

16 16
24
32
32
24
40
40
24
24
32
16
40
16
40
32
Algorithm Class
Binary Search Tree The algorithm class takes the data structure object as a parameter. This means that though the data structure changes we need not change the
algorithm.
Splay Tree In the splay tree implementation the whole data structure is divided into
eight binary trees.
The hash function will select the bank, where the addresses are stored. Pipelined scheme saves cycles compared with algorithm that does not
have pipeline as the add and delete will sit idle. Here all the three work in parallel hence saving clock cycles.
Main Memory Class
Another important component in the model is the Main Memory Controller Read accepts the address and gives the data with a data valid signal after 20
cycles.
Similarly the write task accepts the address and data to be written in the main
memory.
There is another task which initializes the locations of the main memory for
simulation purpose.
Test Cases for Evaluation of Model 1

Description Average Cycles (BST) Average Cycles (Splay Tree) Local Cache Hits(4 Masters) 32 64 96 128 32 64 64 32 96 0 32 Snoop Cache Hits(4 Masters) 0 0 0 0 32 32 64 96 32 128 64 Main Memory Access(4 Masters) 96 64 32 0 64 32 0 0 0 0 32
Local Hit Rate - 0.25 Local Hit Rate - 0.5 Local Hit Rate - 0.75 Local Hit Rate - 1 Local Hit Rate - 0.25 Snoop Hit Rate - 0.25
15.11 10.32 5.66 1.24 10.97
15.15 10.25 5.35 0.65 10.58 5.67 1.09 1.61 0.99 2.24 6.13
Local Hit Rate - 0.5 6.19 Snoop Hit Rate - 0.25 Local Hit Rate - 0.5 2.5 Snoop Hit Rate - 0.5 Local Hit Rate - 0.25 2.64 Snoop Hit Rate - 0.75 Local Hit Rate - 0.75 1.23 Snoop Hit Rate - 0.25 Snoop Hit Rate - 1 4.31 Local Hit Rate - 0.25 6.02 Snoop Hit Rate - 0.5

Description Average Cycles (BST) 15.11 10.32 5.66 1.24 Average Cycles Local Cache Snoop Cache Victim Cache Main Memory Cycles Hits(4 Cache Hits(4 Hits(4 Access(4 (Splay Tree) Masters) Hits(4 Masters) Masters)
Masters)
Local Hit Rate - 0.25 Local Hit Rate - 0.5 Local Hit Rate - 0.75 Local Hit Rate - 1 15.15 10.25 5.35 0.65 11.13 32 64 96 128 32 0 0 0 0 0 0 0 0 0 32 96 64 32 0 64
Local Hit Rate - 0.25 10.80 Victim Hit Rate - 0.25 Local Hit Rate 0.25 6.67
6.66
32
32
32
32
Snoop Hit Rate - 0.25 Victim Hit Rate - 0.25

Local Hit Rate - 0.25 11.09 Snoop Hit Rate - 0.25 Local Hit Rate - 0.5 Snoop 6.29 Snoop Hit Rate - 0.25 Local Hit Rate - 0.5 Snoop 2.65 Snoop Hit Rate - 0.5 1.16 64 64 0 0 10.66 32 32 0 64
5.75
64
32
32

Description Average Cycles (BST) Average Cycles Cycles (Splay Tree) Local Cache Snoop Cache Victim Cache Main Memory Hits(4 Masters) Cache Hits(4 Hits(4 Masters) Access(4 Masters) Masters) Masters) Masters)
Local Hit Rate - 0.25 Snoop Hit Rate - 0.75 Local Hit Rate - 0.75 Snoop Hit Rate - 0.25 Snoop Hit Rate - 1 Local Hit Rate - 0.25 Snoop Hit Rate - 0.5
3.08 1.41 5.14 6.22
1.63 1.03 2.29 6.15 11.77
32 96 0 32 0
96 32 128 64 32
0 0 0 0 32
0 0 0 32 64
Snoop Hit Rate 0.25 11.40 Victim Hit Rate - 0.25 Victim Hit Rate 0.25 Local Hit Rate - 0.25 15.70 3.08
16.01 1.63
0 32
0 96
32 0
96 0
Snoop Hit Rate - 0.75

Local Hit Rate - 0.75 Snoop Hit Rate - 0.25 Snoop Hit Rate - 1 5.14 2.29 0 128 0 0 1.41 1.03 96 32 0 0
Description
Averag ge Cycles s (BST)
Average Cycles (Splay Tree)
Local Cache Hits(4 Masters)
Main Memory Memory Access(4 Masters)
Local Hit - 0.25 Local Hit - 0.5 Local Hit - 0.75 Local Hit -1
Rate - 15.11 Rate - 10.32 Rate - 5.66 Rate - 1.24
15.15 10.25 5.35 0.65
32 64 96 128
96 64 32 0
Description
Averag ge Cycles s (BST)
Average Cycles (Splay Tree)
Main Memory Memory Access(4 Masters)
Local Hit - 0.25 Local Hit - 0.5 Local Hit - 0.75 Local Hit -1
Rate - 20 Rate - 20 Rate - 20 Rate - 20
20 20 20 20
96 64 32 0
Graph for 4 Models

-Cache Model 1 - Cache Model 2 - Cache Model 3 - Cache Model 4 Main Memory access
20 19 18 17 16 15 14
Clock cycles
13 12 11 10 9 8 7 6 5 4 3 2 1 Local Cache hit Snoop Cache hit Victim Cache hit
1..128 16
24
40
48
64
72 129 145 153 169 616 632 640
Input address
Conclusion and Scope for Future Work

Various search algorithms were studied for the implementation of the cache
controller.
Two search algorithms were implemented in systemverilog. The algorithms

are used by the cache models developed.
The model can be enhanced by incorporating more search algorithms. The

user may have their own search algorithm.
We can also use different replacement policies for the cache controller. The
cache architecture itself can be of different types like direct mapped or set associative.
References
1)Hennessy, John and David Patterson, Computer Architecture: A Quantitative Approach. 2) FAST:Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D SIGMOD10. 3) Designing Very Large Content- Addressable Memories by John H Shaffer,University of Pennsylvania
4) Splay Tree Stephen J Allan

5)AMBA AXI and ACE Protocol Specification. 6) SystemVerilog 3.1a LRM
Thank You

Finale PPT Tanmay Rao

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Finale PPT Tanmay Rao

Caricato da

Copyright:

Formati disponibili

www.whizchip.

Whiz, Wizard - a person with extra-ordinary skill or accomplishment

Design And Evaluation Of Cache Performance Evaluation System

Tanmay Rao M Reg No - 101002014 MS VLSI CAD tanmayr@whizchip.com

Cache plays a vital role in improving performance by providing data to the

Importance of cache in a typical SoC containing several masters is

Four Simulation models are designed to determine the importance of cache in

Cache coherence problem is solved by implementing cache coherency

Search algorithms are used to implement the cache controllers. Depending on

Terms Related to Cache

Cache Entries Cache Performance Replacement Policies Locality Write Policies

Typical delays are in the range of 50-100 clock cycles

The additional responsibility of the cache controller is that it needs

Snoop Channel Main Memory Main Memory Channel

L2 Cache L1 Cache L1 Cache

Snoop Channel L2 cache Channel

Main Memory Channel Main Memory

Main Memory Channel Main Memory

Main Memory Channel Main Memory

Constraints and Assumptions

Need For Search Algorithm

Proposed Algorithm 1:BST

Yes Search Successful

Add address to the tree

Is address less than current node?

Is left child empty?

Add address to the tree

No No Is address greater than current node?

Is right child empty?

Add address to the tree

Is the address to be deleted root node? No

Find the in-order successor of the root node

Is the address to be deleted intermediate node? No

Find the in-order successor of the intermediate node

Replace node with in node successor

Is the address to be deleted leaf node?

Delete the leaf node

Old cache lines data != new cache lines data

Update the data in the cache line

Old cache lines dirty != new cache lines dirty

Update the dirty bit in the cache line

Old cache lines shared != new cache lines shared

Update the shared bit in the cache line

Cache line shared?

Replace the cache line with the new cache line

Cache line clean?

Replace the cache line with the new cache line

Proposed Algorithm 2: Splay Tree

The three types of splay steps are: Zig Step

Addr > current and Addr > right node

Is right child empty ?

Add address to right child of the right node

Change current node to left child

Addr > current and Addr < right node

Is left Yes child empty ?

Add address to left child of the right node

Splay the tree if required

Addr < current and Addr > left node No

Is right child empty ?

Add address to right child of the right node

Change current node to right child

Addr < current and Addr < left node

Is left child empty ?