Sei sulla pagina 1di 36

Data

Center TCP
(DCTCP)
Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye
Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan

MicrosoB Research Stanford University

Data Center Packet Transport


Large purpose-built DCs
Huge investment: R&D,
business

Transport inside the DC
TCP rules (99.9% of trac)

Hows TCP doing?
2

TCP in the Data Center


Well see TCP does not meet demands of apps.
Suers from bursty packet drops, Incast [SIGCOMM 09], ...
Builds up large queues:
Adds signicant latency.
Wastes precious buers, esp. bad with shallow-buered switches.

Operators work around TCP problems.


Ad-hoc, inecient, oBen expensive solucons
No solid understanding of consequences, tradeos

Roadmap
Whats really going on?
Interviews with developers and operators
Analysis of applicacons
Switches: shallow-buered vs deep-buered
Measurements

A systemacc study of transport in MicrosoBs DCs


Idencfy impairments
Idencfy requirements

Our solucon: Data Center TCP


4

Case Study: Microso= Bing


Measurements from 6000 server produccon cluster

Instrumentacon passively collects logs
Applicacon-level
Socket-level
Selected packet-level

More than 150TB of compressed data over a month



5

ParHHon/Aggregate ApplicaHon Structure


TLA

Picasso

Art is

1.

Deadline
50ms
2. Art is =a 2 lie

..


3.

Picasso

T ime is money

Deadline = 50ms

1. Art is a lie

2. The chief

..

Missed deadline

MLA MLA

..

1.
Strict deadlines (SLAs)

2.

3.

3.

Lower quality result

It
iArt
s Computers
InspiraHon
yclike
hief
our
Bad
is taw
o
e nemy
lork
lie
aive
yrHsts
ou
that
ian
dcs
re
ooes
an
life
af cmu
opy.
cpiakes
seless.
reaHvity
magine
toor
ehat
xist,
m
uis s
an
the
is
i s
Deadline =The
Everything
1I'd
0ms

They
but cit with
ulHmate
an
Good
m
realize
o
ust
good
nly
lots
areal.
rHsts
gnd
sive
o
tseducHon.
he
ense.
f ym
ou
yts ou
oney.
ruth.
teal.
waorking.
nswers.

Worker Nodes

Generality of ParHHon/Aggregate
The foundacon for many large-scale web applicacons.
Web search, Social network composicon, Ad seleccon, etc.

Example: Facebook

Internet

ParHHon/Aggregate ~ MulHget
Aggregators: Web Servers
Workers: Memcached Servers

Web
Servers
Memcached Protocol

Memcached Servers
7

Workloads
Parccon/Aggregate
(Query)

Delay-sensiHve

Short messages [50KB-1MB]


(CoordinaHon, Control state)

Delay-sensiHve

Large ows [1MB-50MB]


(Data update)

Throughput-sensiHve

Impairments
Incast
Queue Buildup
Buer Pressure

Incast
Worker 1

Synchronized mice collide.

Caused by ParHHon/Aggregate.
Aggregator

Worker 2

Worker 3
RTOmin = 300 ms
Worker 4

TCP Hmeout

10

MLA Query CompleHon Time (ms)

Incast Really Happens

Requests are jthipered over 10ms window.


99.9 o p ercenHle
being htigh
racked.

Jieering trades
median ais gainst
percenHles.

Jipering switched o around 8:30 am.
11

Queue Buildup
Sender 1

Big ows buildup queues.

Increased latency for short ows.


Receiver

Sender 2

Measurements in Bing cluster

For 90% packets: RTT < 1ms


For 10% packets: 1ms < RTT < 15ms
12

Data Center Transport Requirements


1. High Burst Tolerance
Incast due to Parccon/Aggregate is common.

2. Low Latency
Short ows, queries

3. High Throughput
Concnuous data updates, large le transfers

The challenge is to achieve these three together.


13

Tension Between Requirements


High Throughput
High Burst Tolerance

Low Latency

Deep Buers:
Queuing Delays
Increase Latency

Shallow Buers:
Bad for Bursts &
Throughput

Reduced RTOmin
(SIGCOMM 09)
Doesnt Help Latency

AQM RED:
Avg Queue Not Fast
Enough for Incast

ObjecHve:
Low Queue Occupancy & HDCTCP
igh Throughput

14

The DCTCP Algorithm


15

Review: The TCP/ECN Control Loop


Sender 1

ECN = Explicit CongesHon NoHcaHon

ECN Mark (1 bit)

Receiver

Sender 2

16

Small Queues & TCP Throughput:


The Buer Sizing Story
Bandwidth-delay product rule of thumb:
A single ow needs buers for 100% Throughput.

Cwnd
Buer Size
B
Throughput
100%
17

Small Queues & TCP Throughput:


The Buer Sizing Story
Bandwidth-delay product rule of thumb:
A single ow needs buers for 100% Throughput.

Appenzeller rule of thumb (SIGCOMM 04):


Large # of ows: is enough.

Cwnd
Buer Size
B
Throughput
100%
17

Small Queues & TCP Throughput:


The Buer Sizing Story
Bandwidth-delay product rule of thumb:
A single ow needs buers for 100% Throughput.

Appenzeller rule of thumb (SIGCOMM 04):


Large # of ows: is enough.

Cant rely on stat-mux benet in the DC.


Measurements show typically 1-2 big ows at each server, at most 4.

17

Small Queues & TCP Throughput:


The Buer Sizing Story
Bandwidth-delay product rule of thumb:
A single ow needs buers for 100% Throughput.

Appenzeller rule of thumb (SIGCOMM 04):


Large # of ows: is enough.

Cant rely on stat-mux benet in the DC.


Measurements show typically 1-2 big ows at each server, at most 4.

Real Rule of Thumb:


Low Variance in Sending Rate Small Buers Suce
B

17

Two Key Ideas


1. React in proporcon to the extent of congescon, not its presence.


Reduces variance in sending rates, lowering queuing requirements.

ECN Marks

TCP

DCTCP

1 0 1 1 1 1 0 1 1 1

Cut window by 50%

Cut window by 40%

0 0 0 0 0 0 0 0 0 1

Cut window by 50%

Cut window by 5%


2. Mark based on instantaneous queue length.
Fast feedback to beper deal with bursts.

18

Data Center TCP Algorithm


B Mark K Dont

Switch side:
Mark packets when Queue Length > K.

Mark

Sender side:
Maintain running average of frac%on of packets marked ().

In each RTT:


The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your
computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again.

AdapHve window decreases:


Note: decrease factor between 1 and 2.

19

(Kbytes)

DCTCP in AcHon

Setup: Win 7, Broadcom 1Gbps Switch


Scenario: 2 long-lived ows, K = 30KB

20

Why it Works
1. High Burst Tolerance
Large buer headroom bursts t.
Aggressive marking sources react before packets are dropped.


2. Low Latency
Small buer occupancies low queuing delay.

3. High Throughput
ECN averaging smooth rate adjustments, low variance.

21

Analysis
How low can DCTCP maintain queues without loss of throughput?
How do we set the DCTCP parameters?

Need to quanHfy queue size oscillaHons (Stability).
Window Size
W*+1
W*
(W*+1)(1-/2)

Time
22

Analysis
How low can DCTCP maintain queues without loss of throughput?
How do we set the DCTCP parameters?

Need to quanHfy queue size oscillaHons (Stability).
Window Size

Packets sent in this


RTT are marked.

W*+1
W*
(W*+1)(1-/2)

Time
22

Analysis
How low can DCTCP maintain queues without loss of throughput?
How do we set the DCTCP parameters?

Need to quanHfy queue size oscillaHons (Stability).

85% Less Buer than TCP


22

EvaluaHon
Implemented in Windows stack.
Real hardware, 1Gbps and 10Gbps experiments

90 server testbed
Broadcom Triumph 48 1G ports 4MB shared memory
Cisco Cat4948 48 1G ports 16MB shared memory
Broadcom Scorpion 24 10G ports 4MB shared memory

Numerous micro-benchmarks
Throughput and Queue Length Fairness and Convergence
MulH-hop
Incast
StaHc vs Dynamic Buer Mgmt
Queue Buildup
Buer Pressure

Cluster trac benchmark


23

Cluster Trac Benchmark


Emulate trac within 1 Rack of Bing cluster
45 1G servers, 10G server for external trac

Generate query, and background trac


Flow sizes and arrival cmes follow distribucons seen in Bing

Metric:
Flow complecon cme for queries and background ows.
We use RTOmin = 10ms for both TCP & DCTCP.
24

Baseline
Background Flows

Query Flows

25

Baseline
Background Flows

Query Flows

Low latency for short ows.


25

Baseline
Background Flows

Query Flows

Low latency for short ows.


High throughput for long ows.
25

Baseline
Background Flows

Query Flows

Low latency for short ows.


High throughput for long ows.
High burst tolerance for query ows.

25

Scaled Background & Query


10x Background, 10x Query

Short messages

Query

26

Conclusions
DCTCP sacses all our requirements for Data Center
packet transport.
Handles bursts well
Keeps queuing delays low
Achieves high throughput

Features:
Very simple change to TCP and a single switch
parameter.
Based on mechanisms already available in Silicon.

27

Potrebbero piacerti anche