linux-hardened/net
Eric Dumazet afe4fd0624 pkt_sched: fq: Fair Queue packet scheduler
- Uses perfect flow match (not stochastic hash like SFQ/FQ_codel)
- Uses the new_flow/old_flow separation from FQ_codel
- New flows get an initial credit allowing IW10 without added delay.
- Special FIFO queue for high prio packets (no need for PRIO + FQ)
- Uses a hash table of RB trees to locate the flows at enqueue() time
- Smart on demand gc (at enqueue() time, RB tree lookup evicts old
  unused flows)
- Dynamic memory allocations.
- Designed to allow millions of concurrent flows per Qdisc.
- Small memory footprint : ~8K per Qdisc, and 104 bytes per flow.
- Single high resolution timer for throttled flows (if any).
- One RB tree to link throttled flows.
- Ability to have a max rate per flow. We might add a socket option
  to add per socket limitation.

Attempts have been made to add TCP pacing in TCP stack, but this
seems to add complex code to an already complex stack.

TCP pacing is welcomed for flows having idle times, as the cwnd
permits TCP stack to queue a possibly large number of packets.

This removes the 'slow start after idle' choice, hitting badly
large BDP flows, and applications delivering chunks of data
as video streams.

Nicely spaced packets :
Here interface is 10Gbit, but flow bottleneck is ~20Mbit

cwin is big, yet FQ avoids the typical bursts generated by TCP
(as in netperf TCP_RR -- -r 100000,100000)

15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115>
15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115>
15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115>
15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115>
15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115>
15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115>
15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115>
15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805>
15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115>
15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>
15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115>

TCP timestamps show that most packets from B were queued in the same ms
timeframe (TSval 1159799{3,4}), but FQ managed to send them right
in time to avoid a big burst.

In slow start or steady state, very few packets are throttled [1]

FQ gets a bunch of tunables as :

  limit : max number of packets on whole Qdisc (default 10000)

  flow_limit : max number of packets per flow (default 100)

  quantum : the credit per RR round (default is 2 MTU)

  initial_quantum : initial credit for new flows (default is 10 MTU)

  maxrate : max per flow rate (default : unlimited)

  buckets : number of RB trees (default : 1024) in hash table.
               (consumes 8 bytes per bucket)

  [no]pacing : disable/enable pacing (default is enable)

All of them can be changed on a live qdisc.

$ tc qd add dev eth0 root fq help
Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ]
              [ quantum BYTES ] [ initial_quantum BYTES ]
              [ maxrate RATE  ] [ buckets NUMBER ]
              [ [no]pacing ]

$ tc -s -d qd
qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140
 Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14)
 backlog 0b 0p requeues 14
  511 flows, 511 inactive, 0 throttled
  110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit

[1] Except if initial srtt is overestimated, as if using
cached srtt in tcp metrics. We'll provide a fix for this issue.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29 21:38:31 -04:00
..
9p 9p: client: remove unused code and any reference to "cancelled" function 2013-07-30 15:54:28 -07:00
802 net/802/mrp: fix lockdep splat 2013-05-14 13:02:30 -07:00
8021q Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-08-16 15:37:26 -07:00
appletalk net: proc_fs: trivial: print UIDs as unsigned int 2013-08-15 14:37:46 -07:00
atm net: always pass struct netdev_notifier_info to netdevice notifiers 2013-05-28 21:58:54 -07:00
ax25 net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
batman-adv batman-adv: send GW_DEL event when the gw client mode is deselected 2013-08-28 11:33:00 +02:00
bluetooth Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth 2013-07-31 15:11:50 -04:00
bridge bridge: inherit slave devices needed_headroom 2013-08-29 15:17:09 -04:00
caif net: pass info struct via netdevice notifier 2013-05-28 13:11:01 -07:00
can net: pass info struct via netdevice notifier 2013-05-28 13:11:01 -07:00
ceph net: add sk_stream_is_writeable() helper 2013-07-24 17:54:48 -07:00
core net: add netdev_upper_get_next_dev_rcu(dev, iter) 2013-08-29 16:19:42 -04:00
dcb rtnetlink: Remove passing of attributes into rtnl_doit functions 2013-03-22 10:31:16 -04:00
dccp net: add sk_stream_is_writeable() helper 2013-07-24 17:54:48 -07:00
decnet net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
dns_resolver net: strict_strtoul is obsolete, use kstrtoul instead 2013-07-12 16:09:14 -07:00
dsa dsa: fix freeing of sparse port allocation 2013-03-25 12:23:41 -04:00
ethernet net: Fix sysfs_format_mac() code duplication. 2013-07-16 17:09:22 -07:00
ieee802154 6lowpan: handle context based source address 2013-08-20 13:23:12 -07:00
ipv4 tcp: TSO packets automatic sizing 2013-08-29 15:50:06 -04:00
ipv6 ipv6: drop fragmented ndisc packets by default (RFC 6980) 2013-08-29 15:32:08 -04:00
ipx net: proc_fs: trivial: print UIDs as unsigned int 2013-08-15 14:37:46 -07:00
irda net/irda: fixed style issues in irttp 2013-07-19 17:34:40 -07:00
iucv net: delete __cpuinit usage from all net files 2013-07-14 19:36:58 -04:00
key xfrm: Remove rebundant address family checking 2013-08-07 10:12:58 +02:00
l2tp l2tp: make datapath resilient to packet loss when sequence numbers enabled 2013-07-02 16:33:25 -07:00
lapb
llc net: proc_fs: trivial: print UIDs as unsigned int 2013-08-15 14:37:46 -07:00
mac80211 Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next 2013-08-09 15:08:10 -04:00
mac802154 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-04-30 03:55:20 -04:00
mpls MPLS: Add limited GSO support 2013-05-27 22:50:59 -07:00
netfilter netfilter: ctnetlink: fix uninitialized variable 2013-08-28 00:28:19 +02:00
netlabel netlabel: use domain based selectors when address based selectors are not available 2013-08-02 16:57:01 -07:00
netlink Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-08-26 16:37:08 -04:00
netrom net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
nfc NFC: netlink: Rename CMD_FW_UPLOAD to CMD_FW_DOWNLOAD 2013-07-31 01:19:43 +02:00
openvswitch openvswitch: optimize flow compare and mask functions 2013-08-27 13:13:09 -07:00
packet net: packet: use reciprocal_divide in fanout_demux_hash 2013-08-29 16:43:29 -04:00
phonet net: proc_fs: trivial: print UIDs as unsigned int 2013-08-15 14:37:46 -07:00
rds net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
rfkill Merge branch 'for-john' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next 2013-04-22 14:58:14 -04:00
rose net: Convert uses of typedef ctl_table to struct ctl_table 2013-06-13 02:36:09 -07:00
rxrpc
sched pkt_sched: fq: Fair Queue packet scheduler 2013-08-29 21:38:31 -04:00
sctp net: sctp: sctp_verify_init: clean up mandatory checks and add comment 2013-08-29 15:54:48 -04:00
sunrpc Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-08-16 15:37:26 -07:00
tipc tipc: avoid possible deadlock while enable and disable bearer 2013-08-11 21:58:41 -07:00
unix af_unix: fix bug on large send() 2013-08-11 22:02:36 -07:00
vmw_vsock Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-08-16 15:37:26 -07:00
wimax
wireless Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-08-26 16:37:08 -04:00
x25 x25: Fix broken locking in ioctl error paths. 2013-07-01 18:15:25 -07:00
xfrm xfrm: remove irrelevant comment in xfrm_input(). 2013-08-19 12:45:16 +02:00
compat.c net: Unbreak compat_sys_{send,recv}msg 2013-06-06 11:52:14 -07:00
Kconfig Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2013-08-03 21:36:46 -07:00
Makefile MPLS: Add limited GSO support 2013-05-27 22:50:59 -07:00
nonet.c
socket.c net: rename CONFIG_NET_LL_RX_POLL to CONFIG_NET_RX_BUSY_POLL 2013-08-01 15:11:17 -07:00
sysctl_net.c