Micro Service

Stall-Time Fair
Memory Access
Scheduling
Onur Mutlu and Thomas Moscibroda
Computer Architecture Group
Microsoft Research
Multi-Core Systems
unfairness
CORE 0
CORE 1
CORE 2
CORE 3
L2
CACHE
L2
CACHE
L2
CACHE
L2
CACHE
DRAM MEMORY CONTROLLER
DRAM DRAM DRAM

Bank 0 Bank 1 Bank 2
Multi-Core
Chip
Shared DRAM
Memory System
. . . DRAM
Bank 7
DRAM Bank Operation

Access Address
(Row 1,
0, Column 0)
1)
9)
Row decoder
Rows
Row address 0
1
Columns
Row 01
Row
Empty
1
9
Column address 0
Row Buffer CONFLICT

HIT
!
Column decoder
Data
3
DRAM Controllers
A row-conflict memory access takes significantly longer

than a row-hit access
Current controllers take advantage of the row buffer
Commonly used scheduling policy (FR-FCFS)
[Rixner, ISCA00]
(1) Row-hit (column) first: Service row-hit memory accesses first

(2) Oldest-first: Then service older accesses first
This scheduling policy aims to maximize DRAM throughput
But, it is unfair when multiple threads share the DRAM system
Outline
The Problem
Stall-Time Fair Memory Scheduling
Unfair DRAM Scheduling

Fairness definition
Algorithm
Implementation
System software support
Experimental Evaluation
Conclusions
The Problem
Multiple threads share the DRAM controller

DRAM controllers are designed to maximize DRAM
throughput
DRAM scheduling policies are thread-unaware and
unfair
Row-hit first: unfairly prioritizes threads with high row

buffer locality
Streaming threads
Threads that keep on accessing the same row
Oldest-first: unfairly prioritizes memory-intensive

threads
6
T0: Row 0
T0:
T1: Row 05
Row decoder
The Problem
T1:
T0:Row
Row111
0
T1:
T0:Row
Row16
0
Request Buffer
Row
Row 00
Row Buffer
Column decoder
Row size: 8KB, cache block
size: 64B
T0: streaming thread

128 requests
T1: non-streaming
thread
of T0 serviced before
T1
Data
7
Consequences of Unfairness in
DRAM
7.74
DRAM is the only shared resource
4.72
1.85
1.05
Vulnerability to denial of service
[Moscibroda & Mutlu, Usenix
Security07]
System throughput loss

Priority inversion at the system/OS level
Poor performance predictability
8
Outline
The Problem

Fairness definition
Algorithm
Implementation
Conclusions
Fairness in Shared DRAM

Systems
A threads DRAM performance dependent on its inherent
Interference between threads can destroy either or both

A fair DRAM scheduler should take into account all
factors affecting each threads DRAM performance
Row-buffer locality
Bank parallelism
Not solely bandwidth or solely request latency
Observation: A threads performance degradation due

to interference in DRAM mainly characterized by the
extra memory-related stall-time due to contention with
other threads
10
Stall-Time Fairness in Shared

DRAM Systems
A DRAM system is fair if it slows down equal-priority threads equally

Compared to when each thread is run alone on the same system
Fairness notion similar to SMT [Cazorla, IEEE Micro04][Luo, ISPASS01], SoEMT

[Gabor, Micro06], and shared caches [Kim, PACT04]
Tshared: DRAM-related stall-time when the thread is running with other

threads
Talone: DRAM-related stall-time when the thread is running alone
Memory-slowdown = Tshared/Talone
The goal of the Stall-Time Fair Memory scheduler (STFM) is to equalize

Memory-slowdown for all threads, without sacrificing performance
Considers inherent DRAM performance of each thread
11
Outline
The Problem

Fairness definition
Algorithm
Implementation
Conclusions
12
STFM Scheduling Algorithm (1)
During each time interval, for each thread, DRAM

controller
At the beginning of a scheduling cycle, DRAM controller
Tracks Tshared
Estimates Talone
Computes Slowdown = Tshared/Talone for each thread with an

outstanding legal request
Computes unfairness = MAX Slowdown / MIN Slowdown
If unfairness <
Use DRAM throughput oriented baseline scheduling policy
(1) row-hit first

(2) oldest-first
13
STFM Scheduling Algorithm (2)
If unfairness
Use fairness-oriented scheduling policy
(1) requests from thread with MAX Slowdown first

(2) row-hit first
(3) oldest-first
Maximizes DRAM throughput if it cannot improve

fairness
Does NOT waste useful bandwidth to improve fairness
If a request does not interfere with any other, it is

scheduled
14
How Does STFM Prevent

Unfairness?
T0: Row 0
T1: Row 5
T0: Row 0
T1: Row 111
T0: Row 0
T0:
T1: Row 0
16
T0 Slowdown 1.10
1.00
1.04
1.07
1.03
Row
16
Row
00
Row111
Row Buffer
T1 Slowdown 1.14
1.03
1.06
1.08
1.11
1.00
Unfairness
1.06
1.04
1.03
1.00
1.05
Data
15
Outline
The Problem

Fairness definition
Algorithm
Implementation
Conclusions
16
Implementation
Tracking Tshared
Relatively easy
The processor increases a counter if the thread cannot

commit instructions because the oldest instruction
requires DRAM access
Estimating Talone
More involved because thread is not running alone

Difficult to estimate directly
Observation:
Talone = Tshared - Tinterference
Estimate Tinterference: Extra stall-time due to
interference
17
Estimating Tinterference(1)
When a DRAM request from thread C is scheduled
Thread C can incur extra stall time:
The requests row buffer hit status might be affected by

interference
Estimate the row that would have been in the row buffer if
the thread were running alone
Estimate the extra bank access latency the request incurs
Tinterference(C) +=
Extra Bank Access Latency

# Banks Servicing Cs Requests
Extra latency amortized across outstanding accesses of thread C

(memory level parallelism)
18
Estimating Tinterference(2)
When a DRAM request from thread C is scheduled
Any other thread C with outstanding requests incurs

extra stall time
Interference in the DRAM data bus

Tinterference(C) +=
Bus Transfer Latency of Scheduled Request
Interference in the DRAM bank (see paper)

Tinterference(C) +=
Bank Access Latency of Scheduled Request

# Banks Needed by C Requests * K
19
Hardware Cost
<2KB storage cost for
Arithmetic operations approximated
Fixed point arithmetic

Divisions using lookup tables
Not on the critical path
8-core system with 128-entry memory request buffer
Scheduler makes a decision only every DRAM cycle
More details in paper

20
Outline
The Problem

Fairness definition
Algorithm
Implementation
Conclusions
21
Support for System Software
Supporting system-level thread weights/priorities
Thread weights communicated to the memory controller

Larger-weight threads should be slowed down less
Each threads slowdown is scaled by its weight
Weighted slowdown used for scheduling
Favors threads with larger weights
OS can choose thread weights to satisfy QoS requirements
: Maximum tolerable unfairness set by system software
Dont need fairness? Set large.

Need strict fairness? Set close to 1.
Other values of : trade-off fairness and throughput
22
Outline
The Problem

Fairness definition
Algorithm
Implementation
Conclusions
23
Evaluation Methodology
2-, 4-, 8-, 16-core systems

x86 processor model based on Intel Pentium M
4 GHz processor, 128-entry instruction window
512 Kbyte per core private L2 caches
Detailed DRAM model based on Micron DDR2-800

128-entry memory request buffer
8 banks, 2Kbyte row buffer
Row-hit round-trip latency: 35ns (140 cycles)
Row-conflict latency: 70ns (280 cycles)
Benchmarks
SPEC CPU2006 and some Windows Desktop applications
256, 32, 3 benchmark combinations for 4-, 8-, 16-core
experiments
24
Comparison with Related Work
Baseline FR-FCFS
Low DRAM throughput

Unfairly penalizes non-intensive threads
FR-FCFS+Cap
Unfairly penalizes non-intensive threads with low-row-buffer locality
FCFS
[Rixner et al., ISCA00]
Static cap on how many younger row-hits can bypass older accesses
Unfairly penalizes non-intensive threads
Network Fair Queueing (NFQ)
[Nesbit et al., Micro06]
Per-thread virtual-time based scheduling
A threads private virtual-time increases when its request is scheduled
Prioritizes requests from thread with the earliest virtual-time
Equalizes bandwidth across equal-priority threads
Does not consider inherent performance of each thread

Unfairly prioritizes threads with non-bursty access patterns (idleness
problem)
Unfairly penalizes threads with unbalanced bank usage (in paper)
25
Idleness/Burstiness Problem in Fair

Queueing
Only
Thread
Thread
1s4
2virtual
3
serviced
timeinincreases
interval [t3,t4]
[t1,t2]
[t2,t3]
even since
though
itsno
virtual
othertime
thread
is smaller
needs than
DRAM
Thread 1s
Serviced
Serviced
Serviced
Serviced
Non-bursty thread suffers large performance loss

even though it fairly utilized DRAM when no other thread needed it
26
Unfairness on 4-, 8-, 16-core

Systems
Unfairness = MAX Memory Slowdown / MIN Memory Slowdown
1.27X
1.81X
1.26X
27
System Performance
5.8%
4.1%
4.6%
28
Hmean-speedup (ThroughputFairness Balance)

10.8%
9.5%
11.2%
29
Outline
The Problem

Fairness definition
Algorithm
Implementation
Conclusions
30
Conclusions
A new definition of DRAM fairness: stall-time fairness
New DRAM scheduling algorithm enforces this definition
Equal-priority threads should experience equal memory-related

slowdowns
Takes into account inherent memory performance of threads
Flexible and configurable fairness substrate

Supports system-level thread priorities/weights QoS policies
Results across a wide range of workloads and systems show:
Improving DRAM fairness also improves system throughput

STFM provides better fairness and system performance than
previously-proposed DRAM schedulers
31
Thank you. Questions?
Stall-Time Fair
Memory Access
Scheduling
Onur Mutlu and Thomas Moscibroda
Computer Architecture Group
Microsoft Research
Backup
Structure of the STFM

Controller
35
Comparison using NFQ QoS

Metrics
Nesbit et al. [MICRO06] proposed the following

target for quality of service:
A thread that is allocated 1/Nth of the memory system

bandwidth will run no slower than the same thread on a
private memory system running at 1/Nth of the
frequency of the shared physical memory system
Baseline with memory bandwidth scaled down by N
We compared different DRAM schedulers

effectiveness using this metric
Number of violations of the above QoS target

Harmonic mean of IPC normalized to the above baseline
36
Violations of the NFQ QoS

Target
37
Hmean Normalized IPC using NFQ

Baseline 7.3%
5.9%
5.1%
10.3%
9.1%
7.8%
38
Shortcomings of the NFQ QoS

Target
Low baseline (easily achievable target) for equal-priority

threads
N equal-priority threads a thread should do better than on a

system with 1/Nth of the memory bandwidth
This target is usually very easy to achieve
Unachievable target in some cases
Especially when N is large
Consider two threads always accessing the same bank in an

interleaved fashion too much interference
Baseline performance very difficult to determine in a real

system
Cannot scale memory frequency arbitrarily

Not knowing baseline performance makes it difficult to set
thread priorities (how much bandwidth to assign to each thread)
39
A Case Study
7.28
2.07
2.08
1.87
1.27
Memory Slowdown
Unfairness:
40
Windows Desktop Workloads
41
Enforcing Thread Weights
42
Effect of
43
Effect of Banks and Row Buffer

Size
44

Micro Service

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Micro Service

Caricato da

Copyright:

Formati disponibili

Stall-Time Fair

DRAM MEMORY CONTROLLER

DRAM DRAM DRAM

DRAM Bank Operation

Row Buffer CONFLICT

A row-conflict memory access takes significantly longer

Current controllers take advantage of the row buffer

Commonly used scheduling policy (FR-FCFS)

(1) Row-hit (column) first: Service row-hit memory accesses first

This scheduling policy aims to maximize DRAM throughput

But, it is unfair when multiple threads share the DRAM system

Stall-Time Fair Memory Scheduling

Unfair DRAM Scheduling

Multiple threads share the DRAM controller

Row-hit first: unfairly prioritizes threads with high row

Oldest-first: unfairly prioritizes memory-intensive

T0: streaming thread

Vulnerability to denial of service

[Moscibroda & Mutlu, Usenix

System throughput loss

Stall-Time Fair Memory Scheduling

Unfair DRAM Scheduling

Fairness in Shared DRAM

A threads DRAM performance dependent on its inherent

Interference between threads can destroy either or both

Not solely bandwidth or solely request latency

Observation: A threads performance degradation due

Stall-Time Fairness in Shared

A DRAM system is fair if it slows down equal-priority threads equally

Fairness notion similar to SMT [Cazorla, IEEE Micro04][Luo, ISPASS01], SoEMT

Tshared: DRAM-related stall-time when the thread is running with other

The goal of the Stall-Time Fair Memory scheduler (STFM) is to equalize

Stall-Time Fair Memory Scheduling

Unfair DRAM Scheduling

STFM Scheduling Algorithm (1)

During each time interval, for each thread, DRAM

At the beginning of a scheduling cycle, DRAM controller

Computes Slowdown = Tshared/Talone for each thread with an

Use DRAM throughput oriented baseline scheduling policy

(1) row-hit first

STFM Scheduling Algorithm (2)

Use fairness-oriented scheduling policy

(1) requests from thread with MAX Slowdown first

Maximizes DRAM throughput if it cannot improve

If a request does not interfere with any other, it is

How Does STFM Prevent

Stall-Time Fair Memory Scheduling

Unfair DRAM Scheduling

The processor increases a counter if the thread cannot

More involved because thread is not running alone

When a DRAM request from thread C is scheduled

Thread C can incur extra stall time:

The requests row buffer hit status might be affected by

Extra Bank Access Latency

Extra latency amortized across outstanding accesses of thread C

When a DRAM request from thread C is scheduled

Any other thread C with outstanding requests incurs

Interference in the DRAM data bus

Bus Transfer Latency of Scheduled Request

Interference in the DRAM bank (see paper)

Bank Access Latency of Scheduled Request

<2KB storage cost for

Arithmetic operations approximated