Sei sulla pagina 1di 55

Arhitecturi și Protocoale

de Comunicații (APC)

Protocoale de
nivel Transport
End-to-end data transport
Other Other
apps apps
File transfer File transfer
FTP E-mail E-mail FTP
Web apps SMTP, SMTP, Web apps
POP, IMAP POP, IMAP
HTTP HTTP

⚫ Transport layer Host Host


⚫ Controls end-to-end transfer. ...
Router(s)
...
⚫ IP philosophy: minimum TL TL
NL NL NL NL
functionality in the network
DL DL DL DL
(best effort service).
PHY PHY PHY PHY
 Important role reserved for
transport layer protocols.
© Octavian Catrina 2 
Transport layer: main functions
⚫ Addressing
Identify data transport endpoints ("applications").
Transport address = Network address + Transport selector (port).
⚫ Error control (end-to-end)
⚫ Connectionless: Detect and discard damaged data units.
⚫ Connection-oriented: Ensure data stream integrity. Detect and
correct lost, damaged, reordered data units.
⚫ Flow control (end-to-end)
Connection-oriented: Adapt the transmitter's data rate to the
receiver's data rate.
⚫ Congestion control
Connection-oriented: Limit the transmitter's data rate to avoid
network congestion.

© Octavian Catrina 3 
Review: TCP/IP protocol stack

Web browser,
e-mail, ... Other user
applications User
Applications
space
Application protocols:
HTTP, SMTP, FTP, ...

Application Programming Interface (API)

TCP UDP
Transport
OS
IGMP ICMP RIP OSPF kernel
IP
Network
RARP ARP

LAN DL WAN DL
technology technology Data Link

© Octavian Catrina 4 
Addressing
Applications IP address + TCP/UDP port Applications

TCP, UDP IP address + Protocol id TCP, UDP


IP address
IP IP

DL DL
PHY PHY

⚫ Which host?
⚫ IP address. 32 bits (IPv4), in IP packet header.
⚫ Which transport protocol?
⚫ Protocol id. 8 bits (IPv4), in IP packet header.

⚫ Which application process (and communication endpoint)?


⚫ TCP/UDP port number. 16 bits, TCP/UDP header.

© Octavian Catrina 5 
TCP/UDP ports and the Client-Server model
Distributed application
Server Client
Wait for service request Start
Receive request Send request
Serve request
Send result Receive result
Stop

Transport
protocol

⚫ Server
⚫ Listens continuously for requests on a port known to clients.
Reserved server ports: < 1024 (see RFC 1700).
However, many servers use port numbers > 1024.
⚫ Client
⚫ A currently unused port ( 1024) is dynamically allocated for
the duration of its life. Issues request to known server port.
© Octavian Catrina 6 
Example: HTTP server and clients
neptun.elc.ro hugo.int.fr zola.int.fr
141.85.43.8 139.29.100.11 139.29.35.18

HTTP HTTP HTTP HTTP HTTP


client server client
session HTTP session
port: 3135 port: 80 port: 5768
TCP connection TCP TCP connection
TCP TCP
141.85.43.8 : 3135, 139.29.35.18 : 5768,
139.29.100.11: 80 139.29.100.11: 80
IP IP IP
IP datagrams IP datagrams

⚫ Clients connect to the "well-known" HTTP server port 80.


⚫ A server can handle concurrently connections with multiple clients.
They can always be distinguished because of different client side
addresses (different client IP addresses or port numbers).
© Octavian Catrina 7 
UDP datagrams

Source IP address
Pseudo-
header Destination IP address
0 Protocol (17) UDP datagram length
IP header 0 15 16 31
(20 bytes + opt.)
Source UDP port Destination UDP port
UDP header
(8 bytes) Length Checksum
UDP data Data

⚫ Pseudo-header
⚫ Part of IP header contents. Accompanies UDP datagram at the
interface between UDP and IP.
⚫ UDP checksum
⚫ Covers UDP datagram and pseudo-header.
⚫ Checksum computation is optional.
© Octavian Catrina 8 
TCP packets ("segments")
Source IP address
Pseudo-
header Destination IP address
0 Protocol TCP segment length
0 4 10 16 31

IP header Source TCP port Destination TCP port


(20 octets + opt.)
Sequence number
TCP header
(20 octets +opt.) Acknowledgement number (cumulative)
Hdr.len. - Control bits Window (size)
TCP data Checksum Urgent pointer
Options (if any)
Data (if any)

Control bits (flags):


URG ACK PSH RST SYN FIN

⚫ Checksum covers TCP header, data and pseudo-header.


⚫ Options: Selective acknowledgments (SACK), Max. Segment Size (MSS), etc.
© Octavian Catrina 9 
TCP header fields
⚫ Source Port: 16 bits. The source port number.
⚫ Destination Port: 16 bits. The destination port number.
⚫ Sequence Number: 32 bits. The sequence number of the first data octet in this
segment (except when SYN is present). If SYN is present the sequence number is
the initial sequence number (ISN) and the first data octet is ISN+1.
⚫ Acknowledgment Number: 32 bits. If the ACK control bit is set this field contains
the value of the next sequence number the sender of the segment is expecting to
receive. Once a connection is established this is always sent.
⚫ Header Length (Data Offset): 4 bits. The number of 32 bit words in the TCP
Header. This indicates where the data begins. The TCP header (even one including
options) is an integral number of 32 bits long.
⚫ Reserved: 6 bits. Reserved for future use. Must be zero.
⚫ Control Bits: 6 bits (from left to right): URG: Urgent Pointer field significant. ACK:
Acknowledgment field significant. PSH: Push Function. RST: Reset the connection.
SYN: Synchronize sequence numbers. FIN: No more data from sender.
⚫ Window (size): 16 bits. The number of data octets beginning with the one indicated
in the acknowledgment field which the sender of this segment is willing to accept.
⚫ Checksum: 16 bits. The checksum field is the 16 bit one's complement of the one's
complement sum of all 16 bit words in the header and payload.

© Octavian Catrina 10
Overview of TCP operation

User A TCP A TCP B User B


(initiator) CLOSED CLOSED (listener)

Open-Active Open-Passive
SYN, ...
SYN-SENT LISTEN
SYN+ACK, ... SYN-RCVD
Open-Success
ESTABLISHED ACK, ... Open-Success
Send(dt[100]) ESTABLISHED
..., dt[100]
Receive(dt[100])
ACK, ...
Close
FIN-WAIT-1 FIN, ... Closing
ACK, ... CLOSE-WAIT
Close
FIN-WAIT-2 FIN, ...
Terminate LAST-ACK
TIME-WAIT ACK, ... Terminate
CLOSED CLOSED

© Octavian Catrina 11 
TCP state machine

CLOSED
U-1 TCP-1 TCP-2 U-2
CLOSED CLOSED
Open-
Close/ Open-Active Open-Passive
Passive/
LISTEN Open- [SYN, ...] LISTEN
Active/SYN SYN-SENT
SYN/ SYN+ACK [SYN+ACK, ...] SYN-RCVD
Open-Success
RST/ ESTAB. [ACK, ...]
SYN SYN Open-Success
RECVD SENT ESTAB.
Send(100)
ACK/ SYN+ACK/ACK [..., data(100)]

Close/FIN ESTAB- CLOSE [ACK, ...] Deliver(100)


LISHED FIN/ACK WAIT Close
[FIN, ...] Closing
Close/FIN Close/FIN
FIN-WAIT-1
[ACK, ...] CLOSE-WAIT
FIN FIN/ACK CLOSING LAST
WAIT-1 ACK FIN-WAIT-2 Close
Terminate [FIN, ...]
ACK/ LAST-ACK
ACK/ FIN-ACK/ACK ACK/
TIME-WAIT [ACK, ...] Terminate
FIN TIME Exp.
WAIT-2 WAIT 2*MSL CLOSED
FIN/ACK CLOSED

© Octavian Catrina 12 
TCP state machine (cont.)
⚫ LISTEN - waiting for a connection request (CTRL=SYN) from any remote TCP.
⚫ SYN-SENT - waiting for a matching reply after having sent a connection request
(CTRL=SYN).
⚫ SYN-RECEIVED - waiting for a matching connection acknowledgment
(CTRL=ACK) after having received a connection request (CTRL=SYN) and
replied (CTRL=SYN+ACK).
⚫ ESTABLISHED - connection is open, user data can be sent and received.
⚫ FIN-WAIT-1 - waiting for connection termination request from the remote TCP
(CTRL=FIN+ACK), or acknowledgment of the termination request previously sent.
⚫ FIN-WAIT-2 - waiting for a connection termination request from the remote TCP.
⚫ CLOSE-WAIT - waiting for a connection termination request from the local user.
⚫ CLOSING - waiting for connection termination request acknowledgment from
remote TCP.
⚫ LAST-ACK - waiting for acknowledgment of the connection termination request
previously sent to the remote TCP (after acknowledging its termination request).
⚫ TIME-WAIT - waiting for enough time to pass to be sure the remote TCP received
the acknowledgment of its connection termination request.
⚫ CLOSED - no connection active or pending.

© Octavian Catrina, International University in Germany 13 


Connection establishment
User A TCP A TCP B User B
(initiator) CLOSED CLOSED (listener)
Open-Active Open-Passive
SYN, seq=x, wnd=w1
SYN-SENT LISTEN
SYN+ACK SYN-RCVD
seq=y, ack=x+1, wnd=w2
Open-Success
ESTABLISHED ACK, seq=x+1, ack=y+1
Open-Success
ESTABLISHED

⚫ Passive open: Listener (e.g., server) is ready to communicate.


⚫ Accepts incoming connections on specified port number.
⚫ Active open: User initiates the communication (e.g., client).
⚫ SYN segment: Requests the establishment of a new connection.
⚫ SYN+ACK segment: Confirms (accepts) the connection establishment.

⚫ Three-way handshake procedure


Together with the choice of initial sequence numbers avoids connection
establishment anomalies.
© Octavian Catrina 14 
Example: Connection establishment
TCP connection establishment from 1.1.3.2 to 1.1.4.2 with negotiation of TCP options
TCP options in this example: Maximum data segment size = 1460 octets (MSS option). Selective acknowledgments (SACK option).

SYN segment
details

Supports (and wants to use) the selective


acknowledgement (SACK) mechanism
© Octavian Catrina 15
Data transfer
⚫ Reliable transfer of byte streams
⚫ Same octet values, same octet count and order.
Checksum to detect and drop segments with bit-errors.
Sequence numbers associated to data octets.
⚫ Structure of submitted data stream not preserved.
⚫ Urgent data service
⚫ User can request immediate delivery of a subset of data
in the byte stream (URG flag + pointer).

⚫ Main functions
⚫ Error control.
⚫ Flow control.
⚫ Congestion control.

© Octavian Catrina 16 
Error control: Data acknowledgement
User A TCP A TCP B User B
ESTABLISHED ESTABLISHED
Send(data[500])
ACK, seq=s1, ack=s2, data[500]
Receive(data[500])
ACK, seq=s2, ack=s1+500
Send(data[300])
ACK, seq=s1+500, ack=s2, data[300]
Send(data[200]) Receive(data[300])
ACK, seq=s1+800, ack=s2, data[200]
Send(data[400])
ACK, seq=s2, ack=s1+800
Receive(data[200])
ACK, seq=s2, ack=s1+1000, data[400]

Receive(data[400])
ACK, seq=s1+1000, ack=s2+400

Basic error control: cumulative acknowledgements and retransmission timer.


The receiver returns an acknowledgement segment for every received data
segment. The acknowledgement field indicates current in-order received data.
© Octavian Catrina 17 
Error control: Sequence numbers

Send sequence space


Data sent, Data sent,
Data not sent
acknowledged unacknowledged
SND.UNA SND.NXT
 SEG.ACK → SEG.SEQ
Ack. sent, Data sent,
not received not received
RCV.NXT  SEG.SEQ+SEG.LEN
Receive sequence space → SEG.ACK

Data received Data not received

TCP segment header fields used for TCP state variables used for error control
error control: • Send Sequence Variables
• Sequence number (SEG.SEQ) SND.UNA - Send Unacknowledged
• Acknowledgment number (SEG.ACK) SND.NXT - Send Next
• Checksum. • Receive Sequence Variables
RCV.NXT - Receive Next
© Octavian Catrina 18 
Basic data retransmission
User A TCP A TCP B User B
ESTABLISHED ESTABLISHED
Send(data[500])
ACK, seq=s1, data[500]
Send(data[300])
ACK, seq=s1+500, data[300] Store in buffer
Send(data[200])
ACK, seq=s1+800, data[200] ACK, ack=s1 500 ? 300

ACK, ack=s1 500 ? 300 200

Timeout
Retransmission ACK, seq=s1, data[500]
500 300 200
of the data from
s1 to s1+500 ACK, ack=s1+1000 Receive(data[1000])
triggered by
timeout

Error control using cumulative acknowledgements and a retransmission timer.


The receiver may save out-of-order data (beyond RCV.NXT) in its buffer to
reduce retransmissions. Cumulative acks do not provide precise information
about lost data. Inefficient, especially when multiple data segments are lost.
© Octavian Catrina 19 
Dynamic timer adjustment

TCP A TCP B

Tdata[k], s1+500 seq=s1, data[500]


RTT[k] =
Tack[k]-Tdata[k] Tdata[k+1], s1+800 seq=s1+500, data[300] ACK, ack=s1+500

RTT[k+1] = Tack[k]
ACK, ack=s1+800
Tack[k+1]-Tdata[k+1]
Tack[k+1]

⚫ Permanent round-trip time (RTT) measurements: RTT[k].


⚫ Computation of smoothed RTT: SRTT[k].
⚫ Computation of smoothed RTT deviation: SDEV[k].
⚫ New timer value: RTO[k] = SRTT[k] + 4  SDEV[k].
(Details in the annex)

© Octavian Catrina 20 
Selective retransmission
User A TCP A TCP B User B
ESTABLISHED ESTABLISHED
Send(data[500])
ACK, seq=s1, data[500]
Send(data[300])
ACK, seq=s1+500, data[300] Store in buffer
Send(data[200])
ACK, seq=s1+800, data[200] 500 ? 300
ACK, ack=s1, sack=(s1+500 → s1+800) 500 ? 300 200
ACK, ack=s1, sack=s1+500 → s1+1000)
Retransmission
of the data from ACK, seq=s1, data[500]
500 300 200
s1 to s1+500
triggered by ACK, ack=s1+1000 Receive(data[1000])
selective ack.
Faster recovery

IF the SACK option is supported and enabled, the receiver uses selective
acknowledgments to tell the sender what out-of-order data (beyond RCV.NXT)
is saved in its buffer, and hence what data has to be retransmitted.
Faster recovery, especially when multiple data segments are lost.
© Octavian Catrina 21 
Example: Fast retransmission, no SACK
HTTP data transfer over TCP connection from 1.1.4.2 to 1.1.3.2 without SACK.
Fast retransmission: The sender retransmits after receiving > 3 duplicate acknowledgements.
Data segment length: 1024 octets. Single lost packet.
Lost data:
seq  [78354, 79378),
1 data segment

Retransmission of lost data segment


Lost data has been recovered

© Octavian Catrina 22
Selective acknowledgment (SACK)

Cumulative acknowledgment:
All data has been received up
to sequence number 162232

Missing data from


sequence number
162232 to 163346
(1024 octets)

Selective acknowledgment
option (SACK):
Further data received
from sequence number
163346 to 164370

© Octavian Catrina 23
Example: Selective retransmission (1)
HTTP data transfer over TCP connection from 1.1.4.2 to 1.1.3.2 with selective acknowledgements (SACK)
Data segment length: 1024 octets. SLE: SACK Left Edge. SRE: SACK Right Edge.
Single lost packet.

Lost data: seq  [162232, 163346), 1 data segment

Retransmission of lost data segment


All the lost data has now been recovered

© Octavian Catrina 24
Example: Selective retransmission (2)
HTTP data transfer over TCP connection from 1.1.4.2 to 1.1.1.2 with selective acknowledgements (SACK)
Data segment length: 1024 octets. SLE: SACK Left Edge. SRE: SACK Right Edge.
Congested network path, multiple lost packets.

Lost data, seq  [40466, 43538)


3 data segments (congestion)

Lost data, seq 


[45586, 46610)
1 data segment

Retransmission
of the 4 lost data
segments

All the lost data has now been recovered

© Octavian Catrina 25
Flow control
⚫ Allows the receiver to slow down a faster transmitter
⚫ End-to-end flow control using sliding-window mechanism.

⚫ Flow control window


⚫ Upper limit for the amount of data that a transmitter can send
(beyond the acknowledged sequence).
⚫ Explicitly indicated by the receiver.

Sender Receiver,
bottleneck.
Data (sequence number)

Update transmitter Indicate receiver


window window
ACK(acknowledgement number, window size)

© Octavian Catrina 26 
Flow control: Sender/receiver windows

Last advertised window


(SND.WND  SEG.WND)

Send Consumed Available


sequence space
Data sent, Data sent, Can be sent Cannot be sent
acknowledged unacknowledged (sender window) (out of window)

SND.UNA SND.NXT SND.UNA


 SEG.ACK → SEG.SEQ + SND.WND
Ack. sent, Data sent,
not received not received
RCV.NXT RCV.NXT
Receive → SEG.ACK + RCV.WND
sequence space

Data received Can be received Cannot be


(receiver window) received

Advertised window size


(RCV.WND → SEG.WND)

© Octavian Catrina 27 
Flow control: example

User A TCP A TCP B User B


ESTABLISHED ESTABLISHED
ACK, ack=s1, wnd=1000 Receiver buffer
Send(data[500]) 1000
ACK, seq=s1, data[500]
Send(data[300])
ACK, seq=s1+500, data[300] 500 500
Send(data[400])
ACK, seq=s1+800, data[200] 500 300 200
Waiting for credit
ACK, ack=s1+500, wnd=500 500 300 200
ACK, ack=s1+800, wnd=200
ACK, ack=s1+1000, wnd=0 Receive(data[800])
Stop Retrans Timer
Start Persist Timer 200 800
ACK, ack=s1+1000,wnd=800
Stop Persist Timer
ACK, seq=s1+1000, data[200]
200 200 600
ACK, ack=s1+1200,wnd=600

© Octavian Catrina 28 
Congestion control
⚫ Limits transmission to avoid network congestion
⚫ As congestion is building up, IP routers start dropping packets.
Also, the transfer delay increases (due to queuing delay).
⚫ TCP congestion control adjusts the transmission rate according
to implicit congestion signals from the network.
Assumes that packets are lost due to congestion rather than bit errors.
Details in next section.

LAN LAN

IP network (bottleneck)

Slow down
TCP transmitter Dropped packets
(detected by error control)
© Octavian Catrina 29 
Graceful close
User A TCP A TCP B User B
ESTABLISHED ESTABLISHED
Close
FIN-WAIT-1 FIN, ACK, seq=s1, ack=s2 Closing
Closing stream A→B ACK, seq=s2, ack=s1+1 CLOSE-WAIT
Stream A→B closed
FIN-WAIT-2
Stream A→B closed Close
FIN, ACK, seq=s2, ack=s1+1
Terminate LAST-ACK
Closing stream B→A
TIME-WAIT ACK, seq=s1+1, ack=s2+1 Terminate
Streams AB closed
CLOSED
CLOSED Streams AB closed

⚫ TCP closes independently each of the two data streams and


ensures complete data delivery for each data stream.
Example: The stream A → B. User A asks TCP to close the stream A → B by invoking Close (no
more data to send). After sending all buffered data, TCP A sends a FIN segment, to inform B that
it closes the stream A → B and TCP B should have received all data up to seq = s1. After
receiving all the data (with retransmissions, if necessary), TCP B sends an ACK segment with
ack = s1 + 1, to confirm it received the data up to s1, as well as the FIN segment.

© Octavian Catrina 30 
Closing: avoiding anomalies
⚫ TIME-WAIT state
⚫ The TCP endpoint that sends the last ACK during the
closing procedure must delay the release of the
connection's state.
⚫ Duration: 2MSL, where MSL = Maximum Segment
Lifetime (e.g., 1-2 min.).
⚫ Purpose of the TIME-WAIT state
⚫ Allow recovery of the last closing handshake (when the
last ACK is lost and hence the last FIN is retransmitted).
⚫ Prevent the reuse of the connection's address pair as
long as its packets can survive (2MSL) in the network.
 Avoid interference between successive connection
instances.

© Octavian Catrina 31 
Congestion in IP networks
Non-responsive Flows

© Octavian Catrina 32
Packet forwarding model

Limited packet
R1 queue (buffer) size R5

Bottleneck
R3 link R4
Packets dropped
when the queue
R2 overflows R6

⚫ A router determines the interface on which a packet has to be


sent out, and appends the packet to its queue.
⚫ Packet queues allow routers to handle short term overload, i.e.,
received packet bursts exceeding the link bandwidth.
⚫ In case of persistent overload (or large bursts), queues overflow,
and packets are discarded.

© Octavian Catrina 33 
Example: Congestion in IP networks

R1
r=100
R5 Overall throughput:
r=10
r1=10 r1=2 r1+r2 = 2+1 = 3
R3 r=20 r1=2 R4 r=1
But we could get:
r2=90 r2=1
r=100 r2=18 r1+r2 = 10+1 = 11
R2 R6

⚫ How? Overload combined with waste of resources


⚫ E.g., bottleneck links: insufficient bandwidth on links R3-R4-R6
for the red flow; overloaded link R3-R4 for the blue flow.
⚫ R3 and R4 drop many packets, because the bandwidth of the
links R3-R4 and R4-R6 is largely exceeded.
⚫ The red flow throttles the blue flow: its packets consume a lot
of bandwidth on R3-R4, and are dropped later at R4!
⚫ Very inefficient and unfair operation
© Octavian Catrina 34 
Congestion
⚫ Behavior under heavy load knee cliff

Throughput
⚫ Knee: point after which Congestion
collapse
⚫ throughput increases slowly.
⚫ delay increases quickly.
⚫ Cliff: point after which under saturation over
utilization utilization
⚫ throughput decreases quickly
Load
to zero - congestion collapse.

Delay
⚫ delay goes to infinity.

⚫ Cause of congestion collapse


⚫ Resources are consumed by
useless packets, e.g., discarded
later, repeated retransmissions.
Load

⚫ Congestion avoidance: stay at knee


⚫ Congestion control: stay left of the cliff
© Octavian Catrina 35 
Congestion experiments (1)
Test 1: H0, H1 send at r0 = r1 = 8 (UDP)
r0 = 8
H0 r0'  8 BW5 =10 H5 r5 = 8
BW0 = 100 r1'  8
r3 = 16
r1 = 8 BW6 =10
H1 H6 r6 = 8
BW1 = 100 R1 BW = 20 R2
Bottleneck link
N3 N4

Data rates are measured


at H0, H1, H5, H6 and on
the bottleneck link. No congestion. No data lost.
The color code is shown Network delivers r = 16 (up to 20).
in the picture above.

Flow H0 to H5 Flow H1 to H6 Measured throughput includes


starts at 0. starts at 20. 20% encapsulation overhead!

© Octavian Catrina 36
Congestion experiments (2)
Test 2: H1 sends at r1 = 80 (UDP)
r0 = 8
H0 r0'  1.82 H5 r5  1.8
BW0 = 100 BW5 =10
r1'  18.18
r3 = 20
r1 = 80 BW6 =10
H1 H6 r6  10
BW1 = 100 R1 BW = 20 R2
Bottleneck link
N3 N4

Data rates are measured


at H0, H1, H5, H6 and on
Congestion: inefficient, unfair network utilization.
the bottleneck link.
- H1 sends at r=80 on path with BW=10. Only r=10 delivered.
Color code is shown in
- H1 takes most of the BW on the bottleneck link. Almost nothing left for
the picture above.
H0, almost all data lost.
- Network delivers r  11.8, although r = 8+10 = 18 <20 is possible.

© Octavian Catrina 37
Congestion experiments (3)
Test 3: H1 sends at r1 = 80 (UDP), BW6 = 1
r0 = 8
H0 r0'  1.82 H5 r5  1.8
BW0 = 100 BW5 =10
r1'  18.18
r3 = 20
r1 = 80 BW6 =1
H1 H6 r6  1
Bottleneck
BW1 = 100 R1 BW = 20 R2 link
Bottleneck link
N3 N4

Data rates are measured


at H0, H1, H5, H6 and on
Congestion collapse: hardly anything delivered.
the bottleneck link.
- Network delivers r  2.8, although r = 8+1 = 9 < 11 is possible.
Color code is shown in
the picture.

© Octavian Catrina 38
Congestion in IP networks
TCP Congestion Control

© Octavian Catrina 39
IP and TCP congestion control
⚫ IP network behavior ⚫ TCP host behavior
When congestion builds up, the TCP monitors the amount of
packets accumulate in routers' data in transit, the round-trip-
packet queues. time, and the lost packets.
The transfer delay increases, the It assumes that lost packets are
routers start dropping packets. congestion symptoms.
⚫ IP congestion control ⚫ TCP congestion control
Routers use queue management TCP limits its transmission
mechanisms to control the queue using a congestion window,
size and decide which packets to dynamically adjusted based on
forward or drop and when. congestion symptoms.

Goals
⚫ Efficiency: Avoid overload (collapse) as well as underutilization.
⚫ Fairness: Allocate a fair share of resources to all flows.
⚫ Smooth convergence (low oscillations) to efficiency and fairness.
© Octavian Catrina 40

TCP data transfer (1/3)
Amount of data in the pipe:
Transmission N=RD
rate R'
Data pipe with rate R and delay D
DATA

⚫ The network path used by a TCP connection is (roughly) a data


pipe with rate R and delay D
⚫ The rate R is limited by the slowest link. The delay D is the
sum of per hop transmission, propagation and queuing delays.
⚫ Ideally, the source sends at rate R'  R and, once the data pipe
fills up, data flows at the maximum throughput.
⚫ If the source sends faster, the pipe "leaks" and some data is
lost (i.e., discarded by the routers).
⚫ Problem: R and D are variable (depending on network traffic)
and the TCP sender does not know them.
© Octavian Catrina 41 
TCP data transfer (2/3)
Amount of data in the pipe:
Transmission N=RD
rate R'
Data pipe with rate R and delay D
DATA

ACK

seq=s1, data[1000]
seq=s1+1000, data[1000]
seq=s1+2000, data[1000]
RTT
(Round-Trip ACK, ack=s1+1000
seq=s1+3000, data[1000]
Time) ACK, ack=s1+2000
seq=s1+4000, data[1000]
seq=s1+5000, data[1000] ACK, ack=s1+3000
ACK, ack=s1+4000
...

...
⚫ The TCP acknowledgment mechanism allows TCP to fill the pipe
(handle multiple unacknowledged data segments), and to estimate
the current RTT and the current amount of data in the pipe.
Sent during RTT  Unacknowledged = SND.NXT - SND.UNA.
© Octavian Catrina 42 
TCP data transfer (3/3)
Amount of data in the pipe:
Transmission N=RD
rate R'
Data pipe with rate R and delay D
DATA
TCP sender:
R=? D=?
CWND  R  RTT ACK
R'  CWND / RTT

⚫ TCP adjusts its transmission rate to the available data rate on the
network path:
⚫ Maintains a congestion window CWND which approximates RRTT.
⚫ Limits transmission such that the amount of unacknowledged data
(SND.NXT - SND.UNA) is less than CWND (also less than the
window advertised by the receiver, for end-to-end flow control).
⚫ Advances the window (and sends more data) when ACKs arrive,
indicating that some data was delivered (hence exited the pipe).
Therefore, ACKs also provide transmission timing (self-clocking).
⚫ How to dynamically adjust CWND such that to satisfy the goals
(efficiency, fairness, smooth convergence)?
© Octavian Catrina 43 
Efficiency and fairness
Source 1 Goals: Efficiency and fairness
x1 Control system model Example: 2 flows
Source 2 x2 fairness
xk Goals x2 > x1 line: x1=x2
....

xn

Flow 2, rate x2
too much
Binary feedback: for x2
Source n - decrease xk under- over-
- increase xk utilization utilization
x1+x2  C x1+x2  C
increase: xk(t+1) = aIxk(t) + bI
x1 > x 2
decrease: xk(t+1) = aDxk(t) + bD too much efficiency
for x1 line: x1+x2=C
⚫ This system converges to data rates meeting
Flow 1, rate x1
the efficiency and fairness goals only for
additive or multiplicative increase and
(x1,x2)
multiplicative decrease. (aDx1+bI,
fairness
aDx2+bI)

Flow 2, rate x2
line
⚫ Best: Additive Increase & Multiplicative
(aDx1,
Decrease (AIMD): aDx2)
Additive Increase: xk(t+1)=xk(t)+bI
Multiplicative Decrease: xk(t+1)=aDxk(t) efficiency
⚫ Basic solution used by TCP congestion control line

mechanisms (+ enhancements).
© Octavian Catrina
Flow 1, rate x1
44 
TCP congestion control
⚫ Congestion window (CWND) adjustment
⚫ At steady state, CWND oscillates around the current optimal
value, CWND  R  RTT, for the throughput R that the network
path can offer to the TCP flow, and the current RTT.
congestion window
Additive Multiplicative
increase decrease

time

⚫ Basic algorithm components


⚫ Additive increase: "Congestion avoidance“.
⚫ Multiplicative decrease: After detecting packet loss.
⚫ "Slow Start“: Gradual increase of CWND from 1 to SSTHRESH
(slow start threshold) after connection setup and timeouts.
⚫ Enhancements: "Fast Retransmit" and "Fast Recovery".
© Octavian Catrina 45 
Slow Start, Congestion Avoidance (1/3)
cwnd Timeout
(multiplicative
Slow Start Congestion decrease) Congestion
(fast additive Avoidance Avoidance
increase!) (slow additive Slow
increase) Start
ssthresh
ssthresh

time

⚫ Slow start: if cwnd  ssthresh - additive increase


Increment cwnd by one segment for each (non-duplicate) received ACK.
The congestion window starts from 1 segment and doubles every RTT.
⚫ Congestion avoidance: if cwnd  ssthresh - additive increase
Increment cwnd by 1/cwnd for each (non-duplicate) ACK.
The congestion window grows linearly, with one segment every RTT.
⚫ Timeout (congestion symptom) - multiplicative decrease
ssthresh  max(SentUnacked/2, 2segment); cwnd  1segment.
© Octavian Catrina 46 
Slow Start, Congestion Avoidance (2/3)

(Simplified - see RFC 2581)


cwnd Variation of the congestion window
1 DATA
during Slow start and Congestion Initially:
ACK
Avoidance cwnd = 1;
cwnd
2 16 ssthresh = large;
congestion Ack received (not a duplicate):
14
avoidance
3 12 // Additive increase
4
10 if (cwnd < ssthresh) // Slow Start
8 ssthresh=8 cwnd = cwnd + 1;
5 slow start
8 6 else // Congestion Avoidance
4 cwnd = cwnd + 1/cwnd;
2 Timeout:
round-trip times
9 0 // Multiplicative decrease
0 1 2 3 4 5 6 7 8
ssthresh = SentUnacked/2;
cwnd = congestion window (segments)
... ssthresh = slow-start threshold (segm.) cwnd = 1;
10

© Octavian Catrina 47 
Slow Start, Congestion Avoidance (3/3)

cwnd Timeout Timeout


ssthresh (multipl. Congestion (multipl. Congestion
(initial) decrease) Avoidance decrease) Avoidance
Slow Slow (slow additive Slow
Start Start increase) Start
(fast additive ssthresh
increase!) ssthresh

time

⚫ Another scenario: timeout during slow start


Note how ssthresh is adjusted after each timeout.
ssthresh  max(SentUnacked/2, 2segment);
cwnd  1segment.

© Octavian Catrina 48 
Data retransmission
seq=s1, data[1000] seq=s1, data[1000]
seq=s1+1000, data[1000] seq=s1+1000, data[1000]
seq=s1+2000, data[1000] seq=s1+2000, data[1000]
ACK, ack=s1 ACK, ack=s1
seq=s1+3000, data[1000] seq=s1+3000, data[1000]
ACK, ack=s1 ACK, ack=s1
ACK, ack=s1 2 duplicate ACKs ACK, ack=s1
seq=s1, data[1000] Could retransmit (?)
Retransmission timer expires
ACK, ack=s1+3000
seq=s1, data[1000] Timeout retransmission
RFC 2581 requires the reception of three
ACK, ack=s1+3000 duplicate ACKs before fast retransmission

⚫ Timeout retransmission
⚫ The timer is adjusted based on RTT measurements, but set to
a conservative value (substantially larger than RTT).
⚫ Duplicate ACKs
⚫ When data arrives out of order due to the loss of previous
segments, the receiver returns an ACK indicating the expected
sequence number. Can be used to trigger earlier retransmission.
© Octavian Catrina 49 
Fast Retransmit, Fast Recovery
cwnd TCP Tahoe cwnd Fast retrs.+ TCP Reno
Fast
retransmit Fast recovery
Slow Congestion Slow Congestion Slow Congestion Congestion
Start Avoidance Start Avoidance Start Avoidance Avoidance

time time

⚫ Fast Retransmit
⚫ Add another congestion symptom event: three duplicate ACKs.
⚫ Faster than waiting for timeout. Introduced in TCP Tahoe.
⚫ Fast Recovery
⚫ Duplicate ACKs are received  The network still delivers data
 Light congestion  Do not empty the pipe, just reduce the
amount of data to half. Set CWND=ssthresh=SentUnacked/2.
⚫ See details in RFC 2581 and RFC 2582. Added in TCP Reno.
© Octavian Catrina 50 
Analysis
Approximation of the TCP behavior
⚫ TCP transmission rate congestion window
R  k·L/(T·q1/2) (bps) W

L = packet length; T = round-trip-time. W/2


q = packet loss rate; k = (3/2)1/2  1.22 time
T0 2T0

⚫ Limitations of TCP congestion control mechanisms


⚫ Vulnerable to non-congestion related loss (e.g., wireless).
⚫ Flows with very long RTT are penalized (lower throughput).
⚫ All sources must cooperate. Otherwise, the sources that respond
to congestion are locked out by the sources that do not.
⚫ TCP friendly applications
⚫ Some applications do not use TCP (e.g., voice/video).
⚫ IETF solution: Applications must be "TCP friendly", i.e., use an
adaptive transmission algorithm with the same data rate as a
TCP connection experiencing the same packet loss.
© Octavian Catrina 51 
Simulation setup

Traffic sources Traffic sinks


TCP or UDP (CBR) TCP or UDP

Monitor Monitor link


queue size throughput

R1 Bottleneck link R2

Trace packets
dropped when the
queue overflows
Monitor TCP Monitor end-to-end
cwnd, ... throughput
Queue management:
FIFO, Tail-drop

© Octavian Catrina 52
TCP Tahoe

Bottleneck
link

TCP

R1 queue size
(capacity: 5000)

ssthresh

Slow Congestion
Start Avoidance

© Octavian Catrina 53
TCP Reno

Bottleneck
link

TCP

R1 queue size
(capacity: 5000)

ssthresh

Slow Congestion Fast


Start Avoidance Recovery

© Octavian Catrina 54
TCP (Reno) + UDP/CBR

Link CBR
start: 80
TCP; TCP; CBR

R1 queue size
(<= 5000)

© Octavian Catrina 55

Potrebbero piacerti anche