forked from torproject/torspec
-
Notifications
You must be signed in to change notification settings - Fork 0
/
324-rtt-congestion-control.txt
2135 lines (1669 loc) · 93.8 KB
/
324-rtt-congestion-control.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Filename: 324-rtt-congestion-control.txt
Title: RTT-based Congestion Control for Tor
Author: Mike Perry
Created: 02 July 2020
Status: Open
0. Motivation [MOTIVATION]
This proposal specifies how to incrementally deploy RTT-based congestion
control and improved queue management in Tor. It is written to allow us
to first deploy the system only at Exit relays, and then incrementally
improve the system by upgrading intermediate relays.
Lack of congestion control is the reason why Tor has an inherent speed
limit of about 500KB/sec for downloads and uploads via Exits, and even
slower for onion services. Because our stream SENDME windows are fixed
at 500 cells per stream, and only ~500 bytes can be sent in one cell,
the max speed of a single Tor stream is 500*500/circuit_latency. This
works out to about 500KB/sec max sustained throughput for a single
download, even if circuit latency is as low as 500ms.
Because onion services paths are more than twice the length of Exit
paths (and thus more than twice the circuit latency), onion service
throughput will always have less than half the throughput of Exit
throughput, until we deploy proper congestion control with dynamic
windows.
Proper congestion control will remove this speed limit for both Exits
and onion services, as well as reduce memory requirements for fast Tor
relays, by reducing queue lengths.
The high-level plan is to use Round Trip Time (RTT) as a primary
congestion signal, and compare the performance of two different
congestion window update algorithms that both use RTT as a congestion
signal.
The combination of RTT-based congestion signaling, a congestion window
update algorithm, and Circuit-EWMA will get us the most if not all of
the benefits we seek, and only requires clients and Exits to upgrade to
use it. Once this is deployed, circuit bandwidth caps will no longer be
capped at ~500kb/sec by the fixed window sizes of SENDME; queue latency
will fall significantly; memory requirements at relays should plummet;
and transient bottlenecks in the network should dissipate.
Extended background information on the choices made in this proposal can
be found at:
https://lists.torproject.org/pipermail/tor-dev/2020-June/014343.html
https://lists.torproject.org/pipermail/tor-dev/2020-January/014140.html
An exhaustive list of citations for further reading is in Section
[CITATIONS].
A glossary of common congestion control acronyms and terminology is in
Section [GLOSSARY].
1. Overview [OVERVIEW]
This proposal has five main sections, after this overview. These
sections are referenced [IN_ALL_CAPS] rather than by number, for easy
searching.
Section [CONGESTION_SIGNALS] specifies how to use Tor's SENDME flow
control cells to measure circuit RTT, for use as an implicit congestion
signal. It also mentions an explicit congestion signal, which can be
used as a future optimization once all relays upgrade.
Section [CONTROL_ALGORITHMS] specifies two candidate congestion window
upgrade mechanisms, which will be compared for performance in simulation
in Shadow, as well as evaluated on the live network, and tuned via
consensus parameters listed in [CONSENSUS_PARAMETERS].
Section [FLOW_CONTROL] specifies how to handle back-pressure when one of
the endpoints stops reading data, but data is still arriving. In
particular, it specifies what to do with streams that are not being read
by an application, but still have data arriving on them.
Section [SYSTEM_INTERACTIONS] describes how congestion control will
interact with onion services, circuit padding, and conflux-style traffic
splitting.
Section [EVALUATION] describes how we will evaluate and tune our
options for control algorithms and their parameters.
Section [PROTOCOL_SPEC] describes the specific cell formats and
descriptor changes needed by this proposal.
Section [SECURITY_ANALYSIS] provides information about the DoS and
traffic analysis properties of congestion control.
2. Congestion Signals [CONGESTION_SIGNALS]
In order to detect congestion at relays on a circuit, Tor will use
circuit Round Trip Time (RTT) measurement. This signal will be used in
slightly different ways in our various [CONTROL_ALGORITHMS], which will
be compared against each other for optimum performance in Shadow and on
the live network.
To facilitate this, we will also change SENDME accounting logic
slightly. These changes only require clients, exits, and dirauths to
update.
As a future optimization, it is possible to send a direct ECN congestion
signal. This signal *will* require all relays on a circuit to upgrade to
support it, but it will reduce congestion by making the first congestion event
on a circuit much faster to detect.
To reduce confusion and complexity of this proposal, this signal has been
moved to the ideas repository, under xxx-backward-ecn.txt [BACKWARD_ECN].
2.1 RTT measurement
Recall that Tor clients, exits, and onion services send
RELAY_COMMAND_SENDME relay cells every CIRCWINDOW_INCREMENT (100) cells
of received RELAY_COMMAND_DATA.
This allows those endpoints to measure the current circuit RTT, by
measuring the amount of time between sending a RELAY_COMMAND_DATA cell
that would trigger a SENDME from the other endpoint, and the arrival of
that SENDME cell. This means that RTT is measured every 'cc_sendme_inc'
data cells.
Circuits will record the minimum and maximum RTT measurement, as well as
a smoothed value of representing the current RTT. The smoothing for the
current RTT is performed as specified in [N_EWMA_SMOOTHING].
Algorithms that make use of this RTT measurement for congestion
window update are specified in [CONTROL_ALGORITHMS].
2.1.1. Clock Jump Heuristics [CLOCK_HEURISTICS]
The timestamps for RTT (and BDP) are measured using Tor's
monotime_absolute_usec() API. This API is designed to provide a monotonic
clock that only moves forward. However, depending on the underlying system
clock, this may result in the same timestamp value being returned for long
periods of time, which would result in RTT 0-values. Alternatively, the clock
may jump forward, resulting in abnormally large RTT values.
To guard against this, we perform a series of heuristic checks on the time delta
measured by the RTT estimator, and if these heurtics detect a stall or a jump,
we do not use that value to update RTT or BDP, nor do we update any congestion
control algorithm information that round.
If the time delta is 0, that is always treated as a clock stall.
If we have measured at least 'cc_bwe_min' RTT values or we have successfully
exited slow start, then every sendme ACK, the new candidate RTT is compared to
the stored EWMA RTT. If the new RTT is either 5000 times larger than the EWMA
RTT, or 5000 times smaller than the stored EWMA RTT, then we do not record that
estimate, and do not update BDP or the congestion control algorithms for that
SENDME ack.
Moreover, if a clock stall is detected by *any* circuit, this fact is
cached, and this cached value is used on circuits for which we do not
have enough data to compute the above heueristics. This cached value is
also exported for use by the edge connection rate calculations done by
[XON_ADVISORY].
2.1.2. N_EWMA Smoothing [N_EWMA_SMOOTHING]
Both RTT estimation and SENDME BDP estimation require smoothing, to
reduce the effects of packet jitter.
This smoothing is performed using N_EWMA[27], which is an Exponential
Moving Average with alpha = 2/(N+1):
N_EWMA = BDP*2/(N+1) + N_EWMA_prev*(N-1)/(N+1).
Flow control rate limiting uses this function
For both RTT and SENDME BDP estimation, N is the number of SENDME acks
between congestion window updates, divided by the value of consensus
parameter 'cc_ewma_cwnd_pct', and then capped at a max of 'cc_ewma_max',
but always at least 2:
N = MAX(MIN(CWND_UPDATE_RATE(cc)*cc_ewma_cwnd_pct/100, cc_ewma_max), 2);
CWND_UPDATE_RATE is normally just round(CWND/cc_sendme_inc), but after
slow start, it is round(CWND/(cc_cwnd_inc_rate*cc_sendme_inc)).
2.2. SENDME behavior changes
We will make four major changes to SENDME behavior to aid in computing
and using RTT as a congestion signal.
First, we will need to establish a ProtoVer of "FlowCtrl=2" to signal
support by Exits for the new SENDME format and congestion control
algorithm mechanisms. We will need a similar announcement in the onion
service descriptors of services that support congestion control.
Second, we will turn CIRCWINDOW_INCREMENT into a consensus parameter
cc_sendme_inc, instead of using a hardcoded value of 100 cells. It is
likely that more frequent SENDME cells will provide quicker reaction to
congestion, since the RTT will be measured more often. If
experimentation in Shadow shows that more frequent SENDMEs reduce
congestion and improve performance but add significant overhead, we can
reduce SENDME overhead by allowing SENDME cells to carry stream data, as
well, using Proposal 325. The method for negotiating a common value of
cc_sendme_inc on a circuit is covered in [ONION_NEGOTIATION] and
[EXIT_NEGOTIATION].
Third, authenticated SENDMEs can remain as-is in terms of protocol
behavior, but will require some implementation updates to account for
variable window sizes and variable SENDME pacing. In particular, the
sendme_last_digests list for auth sendmes needs updated checks for
larger windows and CIRCWINDOW_INCREMENT changes. Other functions to
examine include:
- circuit_sendme_cell_is_next()
- sendme_record_cell_digest_on_circ()
- sendme_record_received_cell_digest()
- sendme_record_sending_cell_digest()
- send_randomness_after_n_cells
Fourth, stream level SENDMEs will be eliminated. Details on handling
streams and backpressure is covered in [FLOW_CONTROL].
3. Congestion Window Update Algorithms [CONTROL_ALGORITHMS]
In general, the goal of congestion control is to ensure full and fair
utilization of the capacity of a network path -- in the case of Tor the spare
capacity of the circuit. This is accomplished by setting the congestion window
to target the Bandwidth-Delay Product[28] (BDP) of the circuit in one way or
another, so that the total data outstanding is roughly equal to the actual
transit capacity of the circuit.
There are several ways to update a congestion window to target the BDP. Some
use direct BDP estimation, where as others use backoff properties to achieve
this. We specify three BDP estimation algorithms in the [BDP_ESTIMATION]
sub-section, and three congestion window update algorithms in [TOR_WESTWOOD],
[TOR_VEGAS], and [TOR_NOLA].
Note that the congestion window update algorithms differ slightly from the
background tor-dev mails[1,2], due to corrections and improvements. Hence they
have been given different names than in those two mails. The third algorithm,
[TOR_NOLA], simply uses the latest BDP estimate directly as its congestion
window.
These algorithms will be evaluated by running Shadow simulations, to
help determine parameter ranges, but experimentation on the live network
will be required to determine which of these algorithms performs best
when in competition with our current SENDME behavior, as used by real
network traffic. This experimentation and tuning is detailed in section
[EVALUATION].
All of these algorithms have rules to update 'cwnd' - the current congestion
window, which starts out at a value controlled by consensus parameter
'cc_cwnd_init'. The algorithms also keep track of 'inflight', which is a count
of the number of cells currently not yet acked by a SENDME. The algorithm MUST
ensure that cells cease being sent if 'cwnd - inflight <= 0'. Note that this
value CAN become negative in the case where the cwnd is reduced while packets
are inflight.
While these algorithms are in use, updates and checks of the current
'package_window' field are disabled. Where a 'package_window' value is
still needed, for example by cell packaging schedulers, 'cwnd - inflight' is
used (with checks to return 0 in the event of negative values).
The 'deliver_window' field is still used to decide when to send a SENDME. In C
tor, the deliver window is initially set at 1000, but it never gets below 900,
because authenticated sendmes (Proposal 289) require that we must send only
one SENDME at a time, and send it immediately after 100 cells are received.
This property turns out to be very useful for [BDP_ESTIMATION].
Implementation of different algorithms should be very simple - each
algorithm should have a different update function depending on the selected algorithm,
as specified by consensus parameter 'cc_alg'.
For C Tor's current flow control, these functions are defined in sendme.c,
and are called by relay.c:
- sendme_note_circuit_data_packaged()
- sendme_circuit_data_received()
- sendme_circuit_consider_sending()
- sendme_process_circuit_level()
Despite the complexity of the following algorithms in their TCP
implementations, their Tor equivalents are extremely simple, each being
just a handful of lines of C. This simplicity is possible because Tor
does not have to deal with out-of-order delivery, packet drops,
duplicate packets, and other network issues at the circuit layer, due to
the fact that Tor circuits already have reliability and in-order
delivery at that layer.
We are also removing the aspects of TCP that cause the congestion
algorithm to reset into slow start after being idle for too long, or
after too many congestion signals. These are deliberate choices that
simplify the algorithms and also should provide better performance for
Tor workloads.
In all cases, variables in these sections are either consensus parameters
specified in [CONSENSUS_PARAMETERS], or scoped to the circuit. Consensus
parameters for congestion control are all prefixed by cc_. Everything else
is circuit-scoped.
3.1. Estimating Bandwidth-Delay Product [BDP_ESTIMATION]
At a high-level, there are three main ways to estimate the Bandwidth-Delay
Product: by using the current congestion window and RTT, by using the inflight
cells and RTT, and by measuring SENDME arrival rate.
All three estimators are updated every SENDME ack arrival.
The SENDME arrival rate is the most accurate way to estimate BDP, but it
requires averaging over multiple SENDME acks to do so. The congestion window
and inflight estimates rely on the congestion algorithm more or less correctly
tracking an approximation of the BDP, and then use current and minimum RTT to
compensate for overshoot.
The SENDME estimator tends to be accurate after ~3-5 SENDME acks. The cwnd and
inflight estimators tend to be accurate once the congestion window exceeds
BDP.
We specify all three because they are all useful in various cases. These cases
are broken up and combined to form the Piecewise BDP estimator.
3.1.1. SENDME arrival BDP estimation
It is possible to directly measure BDP via the amount of time between SENDME
acks. In this period of time, we know that the endpoint successfully received
'cc_sendme_inc' cells.
This means that the bandwidth of the circuit is then calculated as:
BWE = cc_sendme_inc/sendme_ack_timestamp_delta
The bandwidth delay product of the circuit is calculated by multiplying this
bandwidth estimate by the *minimum* RTT time of the circuit (to avoid counting
queue time):
BDP = BWE * RTT_min
In order to minimize the effects of ack compression (aka SENDME responses
becoming close to one another due to queue delay on the return), we
maintain a history a full congestion window worth of previous SENDME
timestamps.
With this, the calculation becomes:
BWE = (num_sendmes-1) * cc_sendme_inc / num_sendme_timestamp_delta
BDP = BWE * RTT_min
Note that because we are counting the number of cells *between* the first
and last sendme of the congestion window, we must subtract 1 from the number
of sendmes actually received. Over the time period between the first and last
sendme of the congestion window, the other endpoint successfully read
(num_sendmes-1) * cc_sendme_inc cells.
Furthermore, because the timestamps are microseconds, to avoid integer
truncation, we compute the BDP using multiplication first:
BDP = (num_sendmes-1) * cc_sendme_inc * RTT_min / num_sendme_timestamp_delta
After all of this, the BDP is smoothed using [N_EWMA_SMOOTHING].
This smoothing means that the SENDME BDP estimation will only work after two
(2) SENDME acks have been received. Additionally, it tends not to be stable
unless at least 'cc_bwe_min' sendme's are used. This is controlled by the
'cc_bwe_min' consensus parameter. Finally, if [CLOCK_HEURISTICS] have detected
a clock jump or stall, this estimator is not updated.
If all edge connections no longer have data available to send on a circuit
and all circuit queues have drained without blocking the local orconn, we stop
updating this BDP estimate and discard old timestamps. However, we retain the
actual estimator value.
Unfortunately, even after all of this, SENDME BDP estimation proved unreliable
in Shadow simulation, due to ack compression.
3.1.2. Congestion Window BDP Estimation
Assuming that the current congestion window is at or above the current BDP,
the bandwidth estimate is the current congestion window size divided by the
RTT estimate:
BWE = cwnd / RTT_current_ewma
The BDP estimate is computed by multiplying the Bandwidth estimate by
the *minimum* circuit latency:
BDP = BWE * RTT_min
Simplifying:
BDP = cwnd * RTT_min / RTT_current_ewma
The net effect of this estimation is to correct for any overshoot of
the cwnd over the actual BDP. It will obviously underestimate BDP if cwnd
is below BDP.
3.1.3. Inflight BDP Estimation
Similar to the congestion window based estimation, the inflight estimation
uses the current inflight packet count to derive BDP. It also subtracts local
circuit queue use from the inflight packet count. This means it will be strictly
less than or equal to the cwnd version:
BDP = (inflight - circ.chan_cells.n) * RTT_min / RTT_current_ewma
If all edge connections no longer have data available to send on a circuit
and all circuit queues have drained without blocking the local orconn, we stop
updating this BDP estimate, because there are not sufficient inflight cells
to properly estimate BDP.
3.1.4. Piecewise BDP estimation
The piecewise BDP estimation is used to help respond more quickly in the event
the local OR connection is blocked, which indicates congestion somewhere along
the path from the client to the guard (or between Exit and Middle). In this
case, it takes the minimum of the inflight and SENDME estimators.
When the local OR connection is not blocked, this estimator uses the max of
the SENDME and cwnd estimator values.
When the SENDME estimator has not gathered enough data, or has cleared its
estimates based on lack of edge connection use, this estimator uses the
Congestion Window BDP estimator value.
3.2. Tor Westwood: TCP Westwood using RTT signaling [TOR_WESTWOOD]
http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-westwood
http://nrlweb.cs.ucla.edu/nrlweb/publication/download/99/2001-mobicom-0.pdf
http://cpham.perso.univ-pau.fr/TCP/ccr_v31.pdf
https://c3lab.poliba.it/images/d/d7/Westwood_linux.pdf
Recall that TCP Westwood is basically TCP Reno, but it uses BDP estimates
for "Fast recovery" after a congestion signal arrives.
We will also be using the RTT congestion signal as per BOOTLEG_RTT_TOR
here, from the Options mail[1] and Defenestrator paper[3].
This system must keep track of RTT measurements per circuit: RTT_min, RTT_max,
and RTT_current. These are measured using the time delta between every
'cc_sendme_inc' relay cells and the SENDME response. The first RTT_min can be
measured arbitrarily, so long as it is larger than what we would get from
SENDME.
RTT_current is N-EWMA smoothed over 'cc_ewma_cwnd_pct' percent of
congestion windows worth of SENDME acks, up to a max of 'cc_ewma_max' acks, as
described in [N_EWMA_SMOOTHING].
Recall that BOOTLEG_RTT_TOR emits a congestion signal when the current
RTT falls below some fractional threshold ('cc_westwood_rtt_thresh') fraction
between RTT_min and RTT_max. This check is:
RTT_current < (1−cc_westwood_rtt_thresh)*RTT_min
+ cc_westwood_rtt_thresh*RTT_max
Additionally, if the local OR connection is blocked at the time of SENDME ack
arrival, this is treated as an immediate congestion signal.
(We can also optionally use the ECN signal described in
ideas/xxx-backward-ecn.txt, to exit Slow Start.)
Congestion signals from RTT, blocked OR connections, or ECN are processed only
once per congestion window. This is achieved through the next_cc_event flag,
which is initialized to a cwnd worth of SENDME acks, and is decremented
each ack. Congestion signals are only evaluated when it reaches 0.
Note that because the congestion signal threshold of TOR_WESTWOOD is a
function of RTT_max, and excessive queuing can cause an increase in RTT_max,
TOR_WESTWOOD may have runaway conditions. Additionally, if stream activity is
constant, but of a lower bandwidth than the circuit, this will not drive the
RTT upwards, and this can result in a congestion window that continues to
increase in the absence of any other concurrent activity.
Here is the complete congestion window algorithm for Tor Westwood. This will run
each time we get a SENDME (aka sendme_process_circuit_level()):
# Update acked cells
inflight -= cc_sendme_inc
if next_cc_event:
next_cc_event--
# Do not update anything if we detected a clock stall or jump,
# as per [CLOCK_HEURISTICS]
if clock_stalled_or_jumped:
return
if next_cc_event == 0:
# BOOTLEG_RTT_TOR threshold; can also be BACKWARD_ECN check:
if (RTT_current <
(100−cc_westwood_rtt_thresh)*RTT_min/100 +
cc_westwood_rtt_thresh*RTT_max/100) or orconn_blocked:
if in_slow_start:
cwnd += cwnd * cc_cwnd_inc_pct_ss # Exponential growth
else:
cwnd = cwnd + cc_cwnd_inc # Linear growth
else:
if cc_westwood_backoff_min:
cwnd = min(cwnd * cc_westwood_cwnd_m, BDP) # Window shrink
else:
cwnd = max(cwnd * cc_westwood_cwnd_m, BDP) # Window shrink
in_slow_start = 0
# Back off RTT_max (in case of runaway RTT_max)
RTT_max = RTT_min + cc_westwood_rtt_m * (RTT_max - RTT_min)
cwnd = MAX(cwnd, cc_circwindow_min)
next_cc_event = cwnd / (cc_cwnd_inc_rate * cc_sendme_inc)
3.3. Tor Vegas: TCP Vegas with Aggressive Slow Start [TOR_VEGAS]
http://intronetworks.cs.luc.edu/1/html/newtcps.html#tcp-vegas
http://pages.cs.wisc.edu/~akella/CS740/F08/740-Papers/BOP94.pdf
http://www.mathcs.richmond.edu/~lbarnett/cs332/assignments/brakmo_peterson_vegas.pdf
ftp://ftp.cs.princeton.edu/techreports/2000/628.pdf
TCP Vegas control algorithm estimates the queue lengths at relays by
subtracting the current BDP estimate from the current congestion window.
Assuming the BDP estimate is accurate, any amount by which the congestion
window exceeds the BDP will cause data to queue.
Thus, Vegas estimates estimates the queue use caused by congestion as:
queue_use = cwnd - BDP
Original TCP Vegas used a cwnd BDP estimator only. We added the ability to
switch this BDP estimator in the implementation, and experimented with various
options. We also parameterized this queue_use calculation as a tunable
weighted average between the cwnd-based BDP estimate and the piecewise
estimate (consensus parameter 'cc_vegas_bdp_mix'). After much testing of
various ways to compute BDP, we were still unable to do much better than the
original cwnd estimator. So while this capability to change the BDP estimator
remains in the C implementation, we do not expect it to be used.
However, it was useful to use a local OR connection block at the time of
SENDME ack arrival, as an immediate congestion signal.
(As an additional optimization, we could also use the ECN signal described in
ideas/xxx-backward-ecn.txt, but this is not implemented. It is likely only of
any benefit during Slow Start, and even that benefit is likely small.)
Congestion signals from RTT, blocked OR connections, or ECN are processed only
once per congestion window. This is achieved through the next_cc_event flag,
which is initialized to a cwnd worth of SENDME acks, and is decremented
each ack. Congestion signals are only evaluated when it reaches 0.
Here is the complete pseudocode for TOR_VEGAS, which is run every time
an endpoint receives a SENDME ack:
# Update acked cells
inflight -= cc_sendme_inc
if next_cc_event:
next_cc_event--
# Do not update anything if we detected a clock stall or jump,
# as per [CLOCK_HEURISTICS]
if clock_stalled_or_jumped:
return
if next_cc_event == 0:
if BDP > cwnd:
queue_use = 0
else:
queue_use = cwnd - BDP
if in_slow_start:
if queue_use < cc_vegas_gamma and not orconn_blocked:
# Increment by slow start %, or at least 2 sendme_inc's worth
cwnd = cwnd + MAX(cwnd * cc_cwnd_inc_pct_ss, 2*cc_sendme_inc)
# If our BDP estimator thinks the BDP is still larger, use that
cwnd = MAX(cwnd, BDP)
else:
cwnd = BDP + cc_vegas_gamma
in_slow_start = 0
else:
if queue_use > cc_vegas_delta:
cwnd = BDP + cc_vegas_delta - cc_cwnd_inc
elif queue_use > cc_vegas_beta or orconn_blocked:
cwnd -= cc_cwnd_inc
elif queue_use < cc_vegas_alpha:
cwnd += cc_cwnd_inc
cwnd = MAX(cwnd, cc_circwindow_min)
# Count the number of sendme acks until next update of cwnd,
# rounded to nearest integer
if in_slow_start:
next_cc_event = round(cwnd / cc_sendme_inc)
else
# Never increment faster in slow start, only steady state.
next_cc_event = round(cwnd / (cc_cwnd_inc_rate * cc_sendme_inc))
3.4. Tor NOLA: Direct BDP tracker [TOR_NOLA]
Based on the theory that congestion control should track the BDP,
the simplest possible congestion control algorithm could just set the
congestion window directly to its current BDP estimate, every SENDME ack.
Such an algorithm would need to overshoot the BDP slightly, especially in the
presence of competing algorithms. But other than that, it can be exceedingly
simple. Like Vegas, but without putting on airs. Just enough strung together.
After meditating on this for a while, it also occurred to me that no one has
named a congestion control algorithm after New Orleans. We have Reno, Vegas,
and scores of others. What's up with that?
Here's the pseudocode for TOR_NOLA that runs on every SENDME ack:
# Do not update anything if we detected a clock stall or jump,
# as per [CLOCK_HEURISTICS]
if clock_stalled_or_jumped:
return
# If the orconn is blocked, do not overshoot BDP
if orconn_blocked:
cwnd = BDP
else:
cwnd = BDP + cc_nola_overshoot
cwnd = MAX(cwnd, cc_circwindow_min)
4. Flow Control [FLOW_CONTROL]
Flow control provides what is known as "pushback" -- the property that
if one endpoint stops reading data, the other endpoint stops sending
data. This prevents data from accumulating at points in the network, if
it is not being read fast enough by an application.
Because Tor must multiplex many streams onto one circuit, and each
stream is mapped to another TCP socket, Tor's current pushback is rather
complicated and under-specified. In C Tor, it is implemented in the
following functions:
- circuit_consider_stop_edge_reading()
- connection_edge_package_raw_inbuf()
- circuit_resume_edge_reading()
The decision on when a stream is blocked is performed in:
- sendme_note_stream_data_packaged()
- sendme_stream_data_received()
- sendme_connection_edge_consider_sending()
- sendme_process_stream_level()
Tor currently maintains separate windows for each stream on a circuit,
to provide individual stream flow control. Circuit windows are SENDME
acked as soon as a relay data cell is decrypted and recognized. Stream
windows are only SENDME acked if the data can be delivered to an active
edge connection. This allows the circuit to continue to operate if an
endpoint refuses to read data off of one of the streams on the circuit.
Because Tor streams can connect to many different applications and
endpoints per circuit, it is important to preserve the property that if
only one endpoint edge connection is inactive, it does not stall the
whole circuit, in case one of those endpoints is malfunctioning or
malicious.
However, window-based stream flow control also imposes a speed limit on
individual streams. If the stream window size is below the circuit
congestion window size, then it becomes the speed limit of a download,
as we saw in the [MOTIVATION] section of this proposal.
So for performance, it is optimal that each stream window is the same
size as the circuit's congestion window. However, large stream windows
are a vector for OOM attacks, because malicious clients can force Exits
to buffer a full stream window for each stream while connecting to a
malicious site and uploading data that the site does not read from its
socket. This attack is significantly easier to perform at the stream
level than on the circuit level, because of the multiplier effects of
only needing to establish a single fast circuit to perform the attack on
a very large number of streams.
This catch22 means that if we use windows for stream flow control, we
either have to commit to allocating a full congestion window worth
memory for each stream, or impose a speed limit on our streams.
Hence, we will discard stream windows entirely, and instead use a
simpler buffer-based design that uses XON/XOFF to signal when this
buffer is too large. Additionally, the XON cell will contain advisory
rate information based on the rate at which that edge connection can
write data while it has data to write. The other endpoint can rate limit
sending data for that stream to the rate advertised in the XON, to avoid
excessive XON/XOFF chatter and sub-optimal behavior.
This will allow us to make full use of the circuit congestion window for
every stream in combination, while still avoiding buffer buildup inside
the network.
4.1. Stream Flow Control Without Windows [WINDOWLESS_FLOW]
Each endpoint (client, Exit, or onion service) sends circuit-level
SENDME acks for all circuit cells as soon as they are decrypted and
recognized, but *before* delivery to their edge connections.
This means that if the edge connection is blocked because an
application's SOCKS connection or a destination site's TCP connection is
not reading, data will build up in a queue at that endpoint,
specifically in the edge connection's outbuf.
Consensus parameters will govern the length of this queue that
determines when XON and XOFF cells are sent, as well as when advisory
XON cells that contain rate information can be sent. These parameters
are separate for the queue lengths of exits, and of clients/services.
(Because clients and services will typically have localhost connections
for their edges, they will need similar buffering limits. Exits may have
different properties, since their edges will be remote.)
The trunnel relay cell payload definitions for XON and XOFF are:
struct xoff_cell {
u8 version IN [0x00];
}
struct xon_cell {
u8 version IN [0x00];
u32 kbps_ewma;
}
4.1.1. XON/XOFF behavior
If the length of an edge outbuf queue exceeds the size provided in the
appropriate client or exit XOFF consensus parameter, a
RELAY_COMMAND_STREAM_XOFF will be sent, which instructs the other endpoint to
stop sending from that edge connection.
Once the queue is expected to empty, a RELAY_COMMAND_STREAM_XON will be sent,
which allows the other end to resume reading on that edge connection. This XON
also indicates the average rate of queue drain since the XOFF.
Advisory XON cells are also sent whenever the edge connection's drain
rate changes by more than 'cc_xon_change_pct' percent compared to
the previously sent XON cell's value.
4.1.2. Edge bandwidth rate advertisement [XON_ADVISORY]
As noted above, the XON cell provides a field to indicate the N_EWMA rate which
edge connections drain their outgoing buffers.
To compute the drain rate, we maintain a timestamp and a byte count of how many
bytes were written onto the socket from the connection outbuf.
In order to measure the drain rate of a connection, we need to measure the time
it took between flushing N bytes on the socket and when the socket is available
for writing again. In other words, we are measuring the time it took for the
kernel to send N bytes between the first flush on the socket and the next
poll() write event.
For example, lets say we just wrote 100 bytes on the socket at time t = 0sec
and at time t = 2sec the socket becomes writeable again, we then estimate that
the rate of the socket is 100 / 2sec thus 50B/sec.
To make such measurement, we start the timer by recording a timestamp as soon
as data begins to accumulate in an edge connection's outbuf, currently 16KB (32
cells). We use such value for now because Tor write up to 32 cells at once on a
connection outbuf and so we use this burst of data as an indicator that bytes
are starting to accumulate.
After 'cc_xon_rate' cells worth of stream data, we use N_EWMA to average this
rate into a running EWMA average, with N specified by consensus parameter
'cc_xon_ewma_cnt'. Every EWMA update, the byte count is set to 0 and a new
timestamp is recorded. In this way, the EWMA counter is averaging N counts of
'cc_xon_rate' cells worth of bytes each.
If the buffers are non-zero, and we have sent an XON before, and the N_EWMA
rate has changed more than 'cc_xon_change_pct' since the last XON, we send an
updated rate. Because the EWMA rate is only updated every 'cc_xon_rate' cells
worth of bytes, such advisory XON updates can not be sent more frequent than
this, and should be sent much less often in practice.
When the outbuf completely drains to 0, and has been 0 for 'cc_xon_rate' cells
worth of data, we double the EWMA rate. We continue to double it while the
outbuf is 0, every 'cc_xon_rate' cells. The measurement timestamp is also set
back to 0.
When an XOFF is sent, the EWMA rate is reset to 0, to allow fresh calculation
upon drain.
If a clock stall or jump is detected by [CLOCK_HEURISTICS], we also
clear the fields, but do not record them in ewma.
NOTE: Because our timestamps are microseconds, we chose to compute and
transmit both of these rates as 1000 byte/sec units, as this reduces the
number of multiplications and divisions and avoids precision loss.
4.1.3. Oomkiller behavior
A malicious client can attempt to exhaust memory in an Exits outbufs, by
ignoring XOFF and advisory XONs. Implementations MAY choose to close specific
streams with outbufs that grow too large, but since the exit does not know
with certainty the client's congestion window, it is non-trival to determine
the exact upper limit a well-behaved client might send on a blocked stream.
Implementations MUST close the streams with the oldest chunks present in their
outbufs, while under global memory pressure, until memory pressure is
relieved.
4.1.4. Sidechannel mitigation
In order to mitigate DropMark attacks[28], both XOFF and advisory XON
transmission must be restricted. Because DropMark attacks are most severe
before data is sent, clients MUST ensure that an XOFF does not arrive before
it has sent the appropriate XOFF limit of bytes on a stream ('cc_xoff_exit'
for exits, 'cc_xoff_client' for onions).
Clients also SHOULD ensure that advisory XONs do not arrive before the
minimum of the XOFF limit and 'cc_xon_rate' full cells worth of bytes have
been transmitted.
Clients SHOULD ensure that advisory XONs do not arrive more frequently than
every 'cc_xon_rate' cells worth of sent data. Clients also SHOULD ensure than
XOFFs do not arrive more frequently than every XOFF limit worth of sent data.
Implementations SHOULD close the circuit if these limits are violated on the
client-side, to detect and resist dropmark attacks[28].
Additionally, because edges no longer use stream SENDME windows, we alter the
half-closed connection handling to be time based instead of data quantity
based. Half-closed connections are allowed to receive data up to the larger
value of the congestion control max_rtt field or the circuit build timeout
(for onion service circuits, we use twice the circuit build timeout). Any data
or relay cells after this point are considered invalid data on the circuit.
Recall that all of the dropped cell enforcement in C-Tor is performed by
accounting data provided through the control port CIRC_BW fields, currently
enforced only by using the vanguards addon[29].
The C-Tor implementation exposes all of these properties to CIRC_BW for
vanguards to enforce, but does not enforce them itself. So violations of any
of these limits do not cause circuit closure unless that addon is used (as
with the rest of the dropped cell side channel handling in C-Tor).
5. System Interactions [SYSTEM_INTERACTIONS]
Tor's circuit-level SENDME system currently has special cases in the
following situations: Intropoints, HSDirs, onion services, and circuit
padding. Additionally, proper congestion control will allow us to very
easily implement conflux (circuit traffic splitting).
This section details those special cases and interactions of congestion
control with other components of Tor.
5.1. HSDirs
Because HSDirs use the tunneled dirconn mechanism and thus also use
RELAY_COMMAND_DATA, they are already subject to Tor's flow control.
We may want to make sure our initial circuit window for HSDir circuits
is set custom for those circuit types, so a SENDME is not required to
fetch long descriptors. This will ensure HSDir descriptors can be
fetched in one RTT.
5.2. Introduction Points
Introduction Points are not currently subject to any flow control.
Because Intropoints accept INTRODUCE1 cells from many client circuits
and then relay them down a single circuit to the service as INTRODUCE2
cells, we cannot provide end-to-end congestion control all the way from
client to service for these cells.
We can run congestion control from the service to the Intropoint, and probably
should, since this is already subject to congestion control.
As an optimization, if that congestion window reaches zero (because the
service is overwhelmed), then we start sending NACKS back to the clients (or
begin requiring proof-of-work), rather than just let clients wait for timeout.
5.3. Rendezvous Points
Rendezvous points are already subject to end-to-end SENDME control,
because all relay cells are sent end-to-end via the rendezvous circuit
splice in circuit_receive_relay_cell().
This means that rendezvous circuits will use end-to-end congestion
control, as soon as individual onion clients and onion services upgrade
to support it. There is no need for intermediate relays to upgrade at
all.
5.4. Circuit Padding
Recall that circuit padding is negotiated between a client and a middle
relay, with one or more state machines running on circuits at the middle
relay that decide when to add padding.
https://github.com/torproject/tor/blob/master/doc/HACKING/CircuitPaddingDevelopment.md
This means that the middle relay can send padding traffic towards the
client that contributes to congestion, and the client may also send
padding towards the middle relay, that also creates congestion.
For low-traffic padding machines, such as the currently deployed circuit
setup obfuscation, this padding is inconsequential.
However, higher traffic circuit padding machines that are designed to
defend against website traffic fingerprinting will need additional care
to avoid inducing additional congestion, especially after the client or
the exit experiences a congestion signal.
The current overhead percentage rate limiting features of the circuit
padding system should handle this in some cases, but in other cases, an
XON/XOFF circuit padding flow control command may be required, so that
clients may signal to the machine that congestion is occurring.
5.5. Conflux
Conflux (aka multi-circuit traffic splitting) becomes significantly
easier to implement once we have congestion control. However, much like
congestion control, it will require experimentation to tune properly.
Recall that Conflux uses a 256-bit UUID to bind two circuits together at
the Exit or onion service. The original Conflux paper specified an
equation based on RTT to choose which circuit to send cells on.
https://www.cypherpunks.ca/~iang/pubs/conflux-pets.pdf
However, with congestion control, we will already know which circuit has
the larger congestion window, and thus has the most available cells in
its current congestion window. This will also be the faster circuit.
Thus, the decision of which circuit to send a cell on only requires
comparing congestion windows (and choosing the circuit with more packets
remaining in its window).
Conflux will require sequence numbers on data cells, to ensure that the
two circuits' data is properly re-assembled. The resulting out-of-order
buffer can potentially be as large as an entire congestion window, if
the circuits are very desynced (or one of them closes). It will be very
expensive for Exits to maintain this much memory, and exposes them to
OOM attacks.
This is not as much of a concern in the client download direction, since
clients will typically only have a small number of these out-of-order
buffers to keep around. But for the upload direction, Exits will need
to send some form of early XOFF on the faster circuit if this
out-of-order buffer begins to grow too large, since simply halting the
delivery of SENDMEs will still allow a full congestion window full of
data to arrive. This will also require tuning and experimentation, and
optimum results will vary between simulator and live network.
6. Performance Evaluation [EVALUATION]
Congestion control for Tor will be easy to implement, but difficult to
tune to ensure optimal behavior.
6.1. Congestion Signal Experiments
Our first experiments were to conduct client-side experiments to
determine how stable the RTT measurements of circuits are across the
live Tor network, to determine if we need more frequent SENDMEs, and/or
need to use any RTT smoothing or averaging.
These experiments were performed using onion service clients and services on
the live Tor network. From these experiments, we tuned the RTT and BDP
estimators, and arrived at reasonable values for EWMA smoothing and the
minimum number of SENDME acks required to estimate BDP.
Additionally, we specified that the algorithms maintain previous congestion
window estimates in the event that a circuit goes idle, rather than revert to
slow start. We experimented with intermittent idle/active live onion clients
to make sure that this behavior is acceptable, and it appeared to be.
In Shadow experimentation, the primary thing to test will be if the OR conn on
Exit relays blocks too frequently when under load, thus causing excessive
congestion signals, and overuse of the Inflight BDP estimator as opposed
to SENDME or CWND BDP. It may also be the case that this behavior is optimal,
even if it does happen.
Finally, we should check small variations in the EWMA smoothing and minimum BDP ack
counts in Shadow experimentation, to check for high variability in these
estimates, and other surprises.
6.2. Congestion Algorithm Experiments
In order to evaluate performance of congestion control algorithms, we will
need to implement [TOR_WESTWOOD], [TOR_VEGAS], and [TOR_NOLA]. We will need to
simulate their use in the Shadow Tor network simulator.
Simulation runs will need to evaluate performance on networks that use
only one algorithm, as well as on networks that run a combinations of
algorithms - particularly each type of congestion control in combination
with Tor's current flow control. Depending upon the number of current
flow control clients, more aggressive parameters of these algorithms may
need to be set, but this will result in additional queueing as well as
sub-optimal behavior once all clients upgrade.
In particular, during live onion service testing, we noticed that these
algorithms required particularly agressive default values to compete against
a network full of current clients. As more clients upgrade, we may be able
to lower these defaults. We should get a good idea of what values we can
choose at what upgrade point, from mixed Shadow simulation.
If Tor's current flow control is so aggressive that it causes probelems with
any amount of remaining old clients, we can experiment with kneecapping these
legacy flow control Tor clients by setting a low 'circwindow' consensus
parameter for them. This will allow us to set more reasonable parameter
values, without waiting for all clients to upgrade.
Because custom congestion control can be deployed by any Exit or onion
service that desires better service, we will need to be particularly careful
about how congestion control algorithms interact with rogue implementations
that more aggressively increase their window sizes. During these
adversarial-style experiments, we must verify that cheaters do not get