Sei sulla pagina 1di 44

Stall-Time Fair

Memory Access
Scheduling
Onur Mutlu and Thomas Moscibroda
Computer Architecture Group
Microsoft Research

Multi-Core Systems

unfairness

CORE 0

CORE 1

CORE 2

CORE 3

L2
CACHE

L2
CACHE

L2
CACHE

L2
CACHE

DRAM MEMORY CONTROLLER

DRAM DRAM DRAM


Bank 0 Bank 1 Bank 2

Multi-Core
Chip

Shared DRAM
Memory System

. . . DRAM

Bank 7

DRAM Bank Operation


Access Address
(Row 1,
0, Column 0)
1)
9)
Row decoder

Rows

Row address 0
1

Columns

Row 01
Row
Empty
1
9
Column address 0

Row Buffer CONFLICT


HIT
!

Column decoder
Data
3

DRAM Controllers

A row-conflict memory access takes significantly longer


than a row-hit access

Current controllers take advantage of the row buffer

Commonly used scheduling policy (FR-FCFS)

[Rixner, ISCA00]

(1) Row-hit (column) first: Service row-hit memory accesses first


(2) Oldest-first: Then service older accesses first

This scheduling policy aims to maximize DRAM throughput

But, it is unfair when multiple threads share the DRAM system

Outline

The Problem

Stall-Time Fair Memory Scheduling

Unfair DRAM Scheduling


Fairness definition
Algorithm
Implementation
System software support

Experimental Evaluation
Conclusions

The Problem

Multiple threads share the DRAM controller


DRAM controllers are designed to maximize DRAM
throughput
DRAM scheduling policies are thread-unaware and
unfair

Row-hit first: unfairly prioritizes threads with high row


buffer locality

Streaming threads
Threads that keep on accessing the same row

Oldest-first: unfairly prioritizes memory-intensive


threads
6

T0: Row 0
T0:
T1: Row 05

Row decoder

The Problem

T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
Request Buffer

Row
Row 00

Row Buffer

Column decoder
Row size: 8KB, cache block
size: 64B

T0: streaming thread


128 requests
T1: non-streaming
thread

of T0 serviced before
T1
Data
7

Consequences of Unfairness in
DRAM
7.74
DRAM is the only shared resource

4.72

1.85

1.05

Vulnerability to denial of service

[Moscibroda & Mutlu, Usenix

Security07]

System throughput loss


Priority inversion at the system/OS level
Poor performance predictability
8

Outline

The Problem

Stall-Time Fair Memory Scheduling

Unfair DRAM Scheduling


Fairness definition
Algorithm
Implementation
System software support

Experimental Evaluation
Conclusions

Fairness in Shared DRAM


Systems

A threads DRAM performance dependent on its inherent

Interference between threads can destroy either or both


A fair DRAM scheduler should take into account all
factors affecting each threads DRAM performance

Row-buffer locality
Bank parallelism

Not solely bandwidth or solely request latency

Observation: A threads performance degradation due


to interference in DRAM mainly characterized by the
extra memory-related stall-time due to contention with
other threads
10

Stall-Time Fairness in Shared


DRAM Systems

A DRAM system is fair if it slows down equal-priority threads equally


Compared to when each thread is run alone on the same system

Fairness notion similar to SMT [Cazorla, IEEE Micro04][Luo, ISPASS01], SoEMT


[Gabor, Micro06], and shared caches [Kim, PACT04]

Tshared: DRAM-related stall-time when the thread is running with other


threads
Talone: DRAM-related stall-time when the thread is running alone

Memory-slowdown = Tshared/Talone

The goal of the Stall-Time Fair Memory scheduler (STFM) is to equalize


Memory-slowdown for all threads, without sacrificing performance
Considers inherent DRAM performance of each thread

11

Outline

The Problem

Stall-Time Fair Memory Scheduling

Unfair DRAM Scheduling


Fairness definition
Algorithm
Implementation
System software support

Experimental Evaluation
Conclusions

12

STFM Scheduling Algorithm (1)

During each time interval, for each thread, DRAM


controller

At the beginning of a scheduling cycle, DRAM controller

Tracks Tshared
Estimates Talone

Computes Slowdown = Tshared/Talone for each thread with an


outstanding legal request
Computes unfairness = MAX Slowdown / MIN Slowdown

If unfairness <

Use DRAM throughput oriented baseline scheduling policy

(1) row-hit first


(2) oldest-first
13

STFM Scheduling Algorithm (2)

If unfairness

Use fairness-oriented scheduling policy

(1) requests from thread with MAX Slowdown first


(2) row-hit first
(3) oldest-first

Maximizes DRAM throughput if it cannot improve


fairness
Does NOT waste useful bandwidth to improve fairness

If a request does not interfere with any other, it is


scheduled
14

How Does STFM Prevent


Unfairness?
T0: Row 0
T1: Row 5
T0: Row 0
T1: Row 111
T0: Row 0
T0:
T1: Row 0
16
T0 Slowdown 1.10
1.00
1.04
1.07
1.03

Row
16
Row
00
Row111

Row Buffer

T1 Slowdown 1.14
1.03
1.06
1.08
1.11
1.00
Unfairness

1.06
1.04
1.03
1.00
1.05

Data
15

Outline

The Problem

Stall-Time Fair Memory Scheduling

Unfair DRAM Scheduling


Fairness definition
Algorithm
Implementation
System software support

Experimental Evaluation
Conclusions

16

Implementation

Tracking Tshared

Relatively easy

The processor increases a counter if the thread cannot


commit instructions because the oldest instruction
requires DRAM access

Estimating Talone

More involved because thread is not running alone


Difficult to estimate directly
Observation:
Talone = Tshared - Tinterference
Estimate Tinterference: Extra stall-time due to
interference
17

Estimating Tinterference(1)

When a DRAM request from thread C is scheduled

Thread C can incur extra stall time:

The requests row buffer hit status might be affected by


interference

Estimate the row that would have been in the row buffer if
the thread were running alone
Estimate the extra bank access latency the request incurs
Tinterference(C) +=

Extra Bank Access Latency


# Banks Servicing Cs Requests

Extra latency amortized across outstanding accesses of thread C


(memory level parallelism)
18

Estimating Tinterference(2)

When a DRAM request from thread C is scheduled

Any other thread C with outstanding requests incurs


extra stall time

Interference in the DRAM data bus


Tinterference(C) +=

Bus Transfer Latency of Scheduled Request

Interference in the DRAM bank (see paper)


Tinterference(C) +=

Bank Access Latency of Scheduled Request


# Banks Needed by C Requests * K

19

Hardware Cost

<2KB storage cost for

Arithmetic operations approximated

Fixed point arithmetic


Divisions using lookup tables

Not on the critical path

8-core system with 128-entry memory request buffer

Scheduler makes a decision only every DRAM cycle

More details in paper


20

Outline

The Problem

Stall-Time Fair Memory Scheduling

Unfair DRAM Scheduling


Fairness definition
Algorithm
Implementation
System software support

Experimental Evaluation
Conclusions

21

Support for System Software

Supporting system-level thread weights/priorities

Thread weights communicated to the memory controller


Larger-weight threads should be slowed down less
Each threads slowdown is scaled by its weight
Weighted slowdown used for scheduling

Favors threads with larger weights

OS can choose thread weights to satisfy QoS requirements

: Maximum tolerable unfairness set by system software

Dont need fairness? Set large.


Need strict fairness? Set close to 1.
Other values of : trade-off fairness and throughput
22

Outline

The Problem

Stall-Time Fair Memory Scheduling

Unfair DRAM Scheduling


Fairness definition
Algorithm
Implementation
System software support

Experimental Evaluation
Conclusions

23

Evaluation Methodology

2-, 4-, 8-, 16-core systems


x86 processor model based on Intel Pentium M
4 GHz processor, 128-entry instruction window
512 Kbyte per core private L2 caches

Detailed DRAM model based on Micron DDR2-800


128-entry memory request buffer
8 banks, 2Kbyte row buffer
Row-hit round-trip latency: 35ns (140 cycles)
Row-conflict latency: 70ns (280 cycles)

Benchmarks
SPEC CPU2006 and some Windows Desktop applications
256, 32, 3 benchmark combinations for 4-, 8-, 16-core
experiments
24

Comparison with Related Work

Baseline FR-FCFS

Low DRAM throughput


Unfairly penalizes non-intensive threads

FR-FCFS+Cap

Unfairly penalizes non-intensive threads with low-row-buffer locality

FCFS

[Rixner et al., ISCA00]

Static cap on how many younger row-hits can bypass older accesses
Unfairly penalizes non-intensive threads

Network Fair Queueing (NFQ)

[Nesbit et al., Micro06]

Per-thread virtual-time based scheduling

A threads private virtual-time increases when its request is scheduled

Prioritizes requests from thread with the earliest virtual-time

Equalizes bandwidth across equal-priority threads

Does not consider inherent performance of each thread


Unfairly prioritizes threads with non-bursty access patterns (idleness
problem)
Unfairly penalizes threads with unbalanced bank usage (in paper)
25

Idleness/Burstiness Problem in Fair


Queueing
Only
Thread
Thread
1s4
2virtual
3
serviced
timeinincreases
interval [t3,t4]
[t1,t2]
[t2,t3]
even since
though
itsno
virtual
othertime
thread
is smaller
needs than
DRAM
Thread 1s
Serviced
Serviced
Serviced
Serviced

Non-bursty thread suffers large performance loss


even though it fairly utilized DRAM when no other thread needed it
26

Unfairness on 4-, 8-, 16-core


Systems
Unfairness = MAX Memory Slowdown / MIN Memory Slowdown

1.27X

1.81X

1.26X

27

System Performance
5.8%

4.1%

4.6%

28

Hmean-speedup (ThroughputFairness Balance)


10.8%

9.5%

11.2%

29

Outline

The Problem

Stall-Time Fair Memory Scheduling

Unfair DRAM Scheduling


Fairness definition
Algorithm
Implementation
System software support

Experimental Evaluation
Conclusions

30

Conclusions

A new definition of DRAM fairness: stall-time fairness

New DRAM scheduling algorithm enforces this definition

Equal-priority threads should experience equal memory-related


slowdowns
Takes into account inherent memory performance of threads

Flexible and configurable fairness substrate


Supports system-level thread priorities/weights QoS policies

Results across a wide range of workloads and systems show:

Improving DRAM fairness also improves system throughput


STFM provides better fairness and system performance than
previously-proposed DRAM schedulers
31

Thank you. Questions?

Stall-Time Fair
Memory Access
Scheduling
Onur Mutlu and Thomas Moscibroda
Computer Architecture Group
Microsoft Research

Backup

Structure of the STFM


Controller

35

Comparison using NFQ QoS


Metrics

Nesbit et al. [MICRO06] proposed the following


target for quality of service:

A thread that is allocated 1/Nth of the memory system


bandwidth will run no slower than the same thread on a
private memory system running at 1/Nth of the
frequency of the shared physical memory system
Baseline with memory bandwidth scaled down by N

We compared different DRAM schedulers


effectiveness using this metric

Number of violations of the above QoS target


Harmonic mean of IPC normalized to the above baseline
36

Violations of the NFQ QoS


Target

37

Hmean Normalized IPC using NFQ


Baseline 7.3%
5.9%
5.1%
10.3%

9.1%

7.8%

38

Shortcomings of the NFQ QoS


Target

Low baseline (easily achievable target) for equal-priority


threads

N equal-priority threads a thread should do better than on a


system with 1/Nth of the memory bandwidth
This target is usually very easy to achieve

Unachievable target in some cases

Especially when N is large

Consider two threads always accessing the same bank in an


interleaved fashion too much interference

Baseline performance very difficult to determine in a real


system

Cannot scale memory frequency arbitrarily


Not knowing baseline performance makes it difficult to set
thread priorities (how much bandwidth to assign to each thread)
39

A Case Study
7.28

2.07

2.08

1.87

1.27

Memory Slowdown

Unfairness:

40

Windows Desktop Workloads

41

Enforcing Thread Weights

42

Effect of

43

Effect of Banks and Row Buffer


Size

44

Potrebbero piacerti anche