Sei sulla pagina 1di 44

www.whizchip.

com

Whiz, Wizard - a person with extra-ordinary skill or accomplishment

Design And Evaluation Of Cache Performance Evaluation System


(To Implement Search Algorithms and Evaluate Simulation Models)

Tanmay Rao M Reg No - 101002014 MS VLSI CAD tanmayr@whizchip.com

Confidential information of WhizChip Design Technologies (www.whizchip.com). Contains WhizChip or its customers proprietary and business sensitive data.

Contents

Introduction Terms Related to Cache Cache States Project Specifications Need For Search Algorithm Proposed Algorithm 1:BST Proposed Algorithm 2:Splay Tree Class Architecture Conclusion and Scope for Future Work

References

Introduction

Cache plays a vital role in improving performance by providing data to the


requesting master in a SoC with in a very few clock cycles. Cache reduces the need for frequent access of the main memory which would take typically 50 to 100 clock cycles.

Importance of cache in a typical SoC containing several masters is


determined by adding cache at different levels.

Four Simulation models are designed to determine the importance of cache in


a SoC. The local caching of data introduces the cache coherence problem.

Cache coherence problem is solved by implementing cache coherency


protocol.

Search algorithms are used to implement the cache controllers. Depending on


the performance two search algorithms are implemented and evaluated.

Terms Related to Cache

Cache Entries Cache Performance Replacement Policies Locality Write Policies

Master Stalls
Flag Bits Cache Miss Cache Hierarchy Victim Cache

Cache States

Valid, Invalid: When valid cache line is present in the cache. When invalid, cache line is not present in the cache. Unique, Shared: When unique, the cache line exists only in one cache. When shared, cache line exists in more than one cache. Clean, Dirty: When Clean, the cache line is not changed so no need to update the main memory, when this clean cache line is replaced. When dirty, the cache line is changed, need to update main memory when this cache line is replaced.
Valid Unique Shared Invalid

Dirty

UD Unique Dirty

SD Shared Dirty

I Invalid

Clean

UC Unique Clean

SC Shared Clean

Project Specifications
Main Memory Controller This controller accepts address on the shared bus. This controller also has a delay to mimic the real time latency that
is seen in a typical main memory access.

Typical delays are in the range of 50-100 clock cycles


L1 Cache Controller The cache controller will serve the master requests for read and
write.

On a cache miss it needs to fetch the data from the main memory
or get it from other snooped cache.

The additional responsibility of the cache controller is that it needs


to support snooping which can have conflicting effects on its normal operation.

Project Specifications
L2 Cache Controller When a dirty cache line from any of the L1 cache is to be replaced then it
is moved to L2 cache.

Saves the clock cycles needed to write it back to the main memory.

Search Algorithm Two Search Algorithms have been implemented in this project. Cache Simulation Models Four Simulation Models were developed. I have done the
evaluation of the models using test cases.

Model 1

M1

M2

L1 Cache

L1 Cache

Snoop Channel Main Memory Main Memory Channel

M3

M4 M4

L1 Cache

L1 Cache

Model 2
M1 M2

L2 Cache L1 Cache L1 Cache

Snoop Channel L2 cache Channel

Main Memory Channel Main Memory

M3

M4

L1 Cache

L1 Cache

Model 3

M1

M2

L1 Cache

L1 Cache

Main Memory Channel Main Memory

M3

M4

L1 Cache

L1 Cache

Model 4

M1

M2

Main Memory Channel Main Memory

M3

M4

Constraints and Assumptions


Constraints The whole model is written in SystemVerilog Memory leaks associated with C++ does not exist in SystemVerilog. Assumptions The delay associated with main memory is modeled at 100 cycles. It can
vary upon the user.

The replacement algorithm for the cache is not a standard policy. The read and write channels for the main memory are separate. No particular snooping protocol is used.

Need For Search Algorithm


Basic requirement is a search for the requested address in the cache. There are many algorithms that have been devised for efficient search. Along with search we need to add and delete addresses . We also need to update and replace the cache lines with new lines.

The algorithm should add and delete the address in a manner which does not affect the search drastically.
way.

We need a suitable data structure which can store the address in an effective

The memory footprint of the data structure should also be optimum so that it
does not consume too much memory resources.

Background Study

Hash Coding

Hash coding is a process in which a search key is transformed, through the use of a hash function to an actual address for the associated data. A very simple hash function is the modulues function. Pseudo CAMs

Since fully associative memories are difficult and expensive to build, relative to normal main memory, a method of building a large random access memory but with associative access would be advantageous.
The pseudo CAM uses a multiple memory bank architecture in which a key is hashed to an address which is valid with every bank. Pre-Computation Technique Here extra information is stored along with tag. This extra information is derived from the stored bit. For an input tag we first compute the number of ones / zeros and compare with the stored bit

Proposed Algorithm 1:BST


The data structure used for storing the address is a binary search tree. Properties of a binary search tree are The left subtree of a node should have values less than the current node's
values.

The right subtree of a node should have values greater than the current
node's values.

Both the left and right subtrees should also be binary search tree The number of elements in a data structure depends on number of levels of
the binary search tree. As the number of levels increase the number of elements increase. If we have 'n' levels in a binary search tree then we have 2n -1 elements.

Search Operation
Root node=addr?

Yes

Search Successful

No

No

Is addr=left node?

Yes

Search Successful

Is addr=right node ?

No

Yes Search Successful

Add Operation

Is Root empty ?

Yes

Add address to the tree

No

No

Is address less than current node?

Yes

Is left child empty?

Yes

Add address to the tree

No No Is address greater than current node?

Yes

Is right child empty?

Yes

Add address to the tree

Delete Operation

Is the address to be deleted root node? No

Yes

Find the in-order successor of the root node

Replace root node with in node successor Delete the duplicate enteries

Is the address to be deleted intermediate node? No

Yes

Find the in-order successor of the intermediate node

Replace node with in node successor

Is the address to be deleted leaf node?

Yes

Delete the leaf node

Update Operation

Old cache lines data != new cache lines data

Yes

Update the data in the cache line

No

Old cache lines dirty != new cache lines dirty

Yes

Update the dirty bit in the cache line

No

Old cache lines shared != new cache lines shared

Yes

Update the shared bit in the cache line

Replace Operation

Cache line shared?

Yes

Replace the cache line with the new cache line

No

Cache line clean?

Yes

Replace the cache line with the new cache line

No The last added node is replaced with the new cache line

Proposed Algorithm 2: Splay Tree


A modified version of a binary search tree called the splay tree is used to
implement the data structure

A splay tree is self-balancing binary search tree. A balanced binary search tree has uniform height on both sub trees . Along with self-balancing property, the splay tree has the additional property
that whenever a new address is added, it is bought to the root node.

This process of bringing the address that is added to the root is called
splaying.

So in a splay tree the time required to access most recently used address is
very less as they will be nearer to the root.

Splaying

The three types of splay steps are: Zig Step

This step is done when p is the root. The tree is rotated on the edge between x and p.

Zig-Zig Step. This step is done when p is not the root and x and p are either both right children or are both left children. The tree is rotated on the edge joining p with its parent g, then rotated on the edge joining x with p.

Zig-Zag Step.

This step is done when p is not the root and x is a right child and p is a left child or vice versa. The tree is rotated on the edge between x and p, then rotated on the edge between x and its new parent g.

Splay
P X

Splay

G X P D P A X D G

A B C

Search Operation

This is the most important operation in a binary search tree. The given address is compared with the root, its left child and right child. The next stage of comparisons has four possibilities. If the address to be searched is less than p and the left child then the
address if exists will be in the node represented by A.

If the address to be searched is less than p and greater than the left child
then the address if exists will be in the node represented by B.

If the address to be searched is greater than p and less than the right child
then the address if exists will be in the node represented by C.

If the address to be searched is greater than p and the right child then the
address if exists will be in the node represented by D

Search Operation

LC

RC

Add Operation
Current = root node No Change current node to right child

Addr > current and Addr > right node

Yes

Is right child empty ?

Yes

Add address to right child of the right node

No No

Change current node to left child

Addr > current and Addr < right node

Yes

Is left Yes child empty ?

Yes

Add address to left child of the right node

No

Splay the tree if required

Addr < current and Addr > left node No

Yes

Is right child empty ?

Yes

Add address to right child of the right node

No

Change current node to right child

Addr < current and Addr < left node

Yes

Is left child empty ?

Yes

Add address to left child of the right node

No End

No

Change current node to left child

Delete Operation

Is the address to be deleted root node? No

Yes

Find the in-order successor of the root node

Replace root node with in node successor Delete the duplicate enteries

Is the address to be deleted intermediate node? No

Yes

Find the in-order successor of the intermediate node

Replace node with in node successor

Splay if required Yes Delete the leaf node

Is the address to be deleted leaf node?

Cache Line
Field Key Description Stores the Address. The address is matched in search operation. operation. It is unique part of the cache line which distinguishes distinguishes from other cache lines. Stores the Data that is associated with particular address. This This may be consistent with the main memory or may be provided by the master. Flag bit .This bit when set indicates that the cache line is shared shared among other masters Flag bit. This bit when set indicates that the data in the cache cache line is dirty. When the cache line is evicted data must be must be written back to the main memory. Flag bit. This bit when set indicates that the data in the cache cache line is valid.

Data

Shared

Dirty

Invalid

Data Structure Class

The add task is used to add the cache line to the data structure. The delete operation for the binary search tree is also implemented. Temporary nodes are used to find the in node successor or the in -node
predecessor

In this implementation the in-node successor method is implemented.

The update operation is common for the both the algorithms. The splay task is implemented only for the second algorithm. In the implementation we splay the data structure when we add 3, 5, 7, 9
and 15 elements

Splay For 3 Element Tree


8 8

16

16

24

24

16

24

Splay For 5 Element Tree


16 16

24

32

32

24

40

40

24

24

32

16

40

16

40

32

Algorithm Class

Binary Search Tree The algorithm class takes the data structure object as a parameter. This means that though the data structure changes we need not change the
algorithm.

Splay Tree In the splay tree implementation the whole data structure is divided into
eight binary trees.

The hash function will select the bank, where the addresses are stored. Pipelined scheme saves cycles compared with algorithm that does not
have pipeline as the add and delete will sit idle. Here all the three work in parallel hence saving clock cycles.

Main Memory Class

Another important component in the model is the Main Memory Controller Read accepts the address and gives the data with a data valid signal after 20
cycles.

Similarly the write task accepts the address and data to be written in the main
memory.

There is another task which initializes the locations of the main memory for
simulation purpose.

Test Cases for Evaluation of Model 1


Description Average Cycles (BST) Average Cycles (Splay Tree) Local Cache Hits(4 Masters) 32 64 96 128 32 64 64 32 96 0 32 Snoop Cache Hits(4 Masters) 0 0 0 0 32 32 64 96 32 128 64 Main Memory Access(4 Masters) 96 64 32 0 64 32 0 0 0 0 32

Local Hit Rate - 0.25 Local Hit Rate - 0.5 Local Hit Rate - 0.75 Local Hit Rate - 1 Local Hit Rate - 0.25 Snoop Hit Rate - 0.25

15.11 10.32 5.66 1.24 10.97

15.15 10.25 5.35 0.65 10.58 5.67 1.09 1.61 0.99 2.24 6.13

Local Hit Rate - 0.5 6.19 Snoop Hit Rate - 0.25 Local Hit Rate - 0.5 2.5 Snoop Hit Rate - 0.5 Local Hit Rate - 0.25 2.64 Snoop Hit Rate - 0.75 Local Hit Rate - 0.75 1.23 Snoop Hit Rate - 0.25 Snoop Hit Rate - 1 4.31 Local Hit Rate - 0.25 6.02 Snoop Hit Rate - 0.5

Test Cases for Evaluation of Model 2


Description Average Cycles (BST) 15.11 10.32 5.66 1.24 Average Cycles Local Cache Snoop Cache Victim Cache Main Memory Cycles Hits(4 Cache Hits(4 Hits(4 Access(4 (Splay Tree) Masters) Hits(4 Masters) Masters)

Masters)
Local Hit Rate - 0.25 Local Hit Rate - 0.5 Local Hit Rate - 0.75 Local Hit Rate - 1 15.15 10.25 5.35 0.65 11.13 32 64 96 128 32 0 0 0 0 0 0 0 0 0 32 96 64 32 0 64

Local Hit Rate - 0.25 10.80 Victim Hit Rate - 0.25 Local Hit Rate 0.25 6.67

6.66

32

32

32

32

Snoop Hit Rate - 0.25 Victim Hit Rate - 0.25


Local Hit Rate - 0.25 11.09 Snoop Hit Rate - 0.25 Local Hit Rate - 0.5 Snoop 6.29 Snoop Hit Rate - 0.25 Local Hit Rate - 0.5 Snoop 2.65 Snoop Hit Rate - 0.5 1.16 64 64 0 0 10.66 32 32 0 64

5.75

64

32

32

Test Cases for Evaluation of Model 2


Description Average Cycles (BST) Average Cycles Cycles (Splay Tree) Local Cache Snoop Cache Victim Cache Main Memory Hits(4 Masters) Cache Hits(4 Hits(4 Masters) Access(4 Masters) Masters) Masters) Masters)

Local Hit Rate - 0.25 Snoop Hit Rate - 0.75 Local Hit Rate - 0.75 Snoop Hit Rate - 0.25 Snoop Hit Rate - 1 Local Hit Rate - 0.25 Snoop Hit Rate - 0.5

3.08 1.41 5.14 6.22

1.63 1.03 2.29 6.15 11.77

32 96 0 32 0

96 32 128 64 32

0 0 0 0 32

0 0 0 32 64

Snoop Hit Rate 0.25 11.40 Victim Hit Rate - 0.25 Victim Hit Rate 0.25 Local Hit Rate - 0.25 15.70 3.08

16.01 1.63

0 32

0 96

32 0

96 0

Snoop Hit Rate - 0.75


Local Hit Rate - 0.75 Snoop Hit Rate - 0.25 Snoop Hit Rate - 1 5.14 2.29 0 128 0 0 1.41 1.03 96 32 0 0

Test Cases for Evaluation of Model 3

Description

Averag ge Cycles s (BST)

Average Cycles (Splay Tree)

Local Cache Hits(4 Masters)

Main Memory Memory Access(4 Masters)

Local Hit - 0.25 Local Hit - 0.5 Local Hit - 0.75 Local Hit -1

Rate - 15.11 Rate - 10.32 Rate - 5.66 Rate - 1.24

15.15 10.25 5.35 0.65

32 64 96 128

96 64 32 0

Test Cases for Evaluation of Model 4

Description

Averag ge Cycles s (BST)

Average Cycles (Splay Tree)

Main Memory Memory Access(4 Masters)

Local Hit - 0.25 Local Hit - 0.5 Local Hit - 0.75 Local Hit -1

Rate - 20 Rate - 20 Rate - 20 Rate - 20

20 20 20 20

96 64 32 0

Graph for 4 Models


-Cache Model 1 - Cache Model 2 - Cache Model 3 - Cache Model 4 Main Memory access

20 19 18 17 16 15 14

Clock cycles

13 12 11 10 9 8 7 6 5 4 3 2 1 Local Cache hit Snoop Cache hit Victim Cache hit

1..128 16

24

40

48

64

72 129 145 153 169 616 632 640

Input address

Conclusion and Scope for Future Work


Various search algorithms were studied for the implementation of the cache
controller.

Two search algorithms were implemented in systemverilog. The algorithms


are used by the cache models developed.

The model can be enhanced by incorporating more search algorithms. The


user may have their own search algorithm.

We can also use different replacement policies for the cache controller. The
cache architecture itself can be of different types like direct mapped or set associative.

References

1)Hennessy, John and David Patterson, Computer Architecture: A Quantitative Approach. 2) FAST:Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs. Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D SIGMOD10. 3) Designing Very Large Content- Addressable Memories by John H Shaffer,University of Pennsylvania

4) Splay Tree Stephen J Allan


5)AMBA AXI and ACE Protocol Specification. 6) SystemVerilog 3.1a LRM

Thank You

Potrebbero piacerti anche