linux-hardened/net/sched
Daniel Borkmann 1f211a1b92 net, sched: add clsact qdisc
This work adds a generalization of the ingress qdisc as a qdisc holding
only classifiers. The clsact qdisc works on ingress, but also on egress.
In both cases, it's execution happens without taking the qdisc lock, and
the main difference for the egress part compared to prior version of [1]
is that this can be applied with _any_ underlying real egress qdisc (also
classless ones).

Besides solving the use-case of [1], that is, allowing for more programmability
on assigning skb->priority for the mqprio case that is supported by most
popular 10G+ NICs, it also opens up a lot more flexibility for other tc
applications. The main work on classification can already be done at clsact
egress time if the use-case allows and state stored for later retrieval
f.e. again in skb->priority with major/minors (which is checked by most
classful qdiscs before consulting tc_classify()) and/or in other skb fields
like skb->tc_index for some light-weight post-processing to get to the
eventual classid in case of a classful qdisc. Another use case is that
the clsact egress part allows to have a central egress counterpart to
the ingress classifiers, so that classifiers can easily share state (e.g.
in cls_bpf via eBPF maps) for ingress and egress.

Currently, default setups like mq + pfifo_fast would require for this to
use, for example, prio qdisc instead (to get a tc_classify() run) and to
duplicate the egress classifier for each queue. With clsact, it allows
for leaving the setup as is, it can additionally assign skb->priority to
put the skb in one of pfifo_fast's bands and it can share state with maps.
Moreover, we can access the skb's dst entry (f.e. to retrieve tclassid)
w/o the need to perform a skb_dst_force() to hold on to it any longer. In
lwt case, we can also use this facility to setup dst metadata via cls_bpf
(bpf_skb_set_tunnel_key()) without needing a real egress qdisc just for
that (case of IFF_NO_QUEUE devices, for example).

The realization can be done without any changes to the scheduler core
framework. All it takes is that we have two a-priori defined minors/child
classes, where we can mux between ingress and egress classifier list
(dev->ingress_cl_list and dev->egress_cl_list, latter stored close to
dev->_tx to avoid extra cacheline miss for moderate loads). The egress
part is a bit similar modelled to handle_ing() and patched to a noop in
case the functionality is not used. Both handlers are now called
sch_handle_ingress() and sch_handle_egress(), code sharing among the two
doesn't seem practical as there are various minor differences in both
paths, so that making them conditional in a single handler would rather
slow things down.

Full compatibility to ingress qdisc is provided as well. Since both
piggyback on TC_H_CLSACT, only one of them (ingress/clsact) can exist
per netdevice, and thus ingress qdisc specific behaviour can be retained
for user space. This means, either a user does 'tc qdisc add dev foo ingress'
and configures ingress qdisc as usual, or the 'tc qdisc add dev foo clsact'
alternative, where both, ingress and egress classifier can be configured
as in the below example. ingress qdisc supports attaching classifier to any
minor number whereas clsact has two fixed minors for muxing between the
lists, therefore to not break user space setups, they are better done as
two separate qdiscs.

I decided to extend the sch_ingress module with clsact functionality so
that commonly used code can be reused, the module is being aliased with
sch_clsact so that it can be auto-loaded properly. Alternative would have been
to add a flag when initializing ingress to alter its behaviour plus aliasing
to a different name (as it's more than just ingress). However, the first would
end up, based on the flag, choosing the new/old behaviour by calling different
function implementations to handle each anyway, the latter would require to
register ingress qdisc once again under different alias. So, this really begs
to provide a minimal, cleaner approach to have Qdisc_ops and Qdisc_class_ops
by its own that share callbacks used by both.

Example, adding qdisc:

   # tc qdisc add dev foo clsact
   # tc qdisc show dev foo
   qdisc mq 0: root
   qdisc pfifo_fast 0: parent :1 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :3 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc pfifo_fast 0: parent :4 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
   qdisc clsact ffff: parent ffff:fff1

Adding filters (deleting, etc works analogous by specifying ingress/egress):

   # tc filter add dev foo ingress bpf da obj bar.o sec ingress
   # tc filter add dev foo egress  bpf da obj bar.o sec egress
   # tc filter show dev foo ingress
   filter protocol all pref 49152 bpf
   filter protocol all pref 49152 bpf handle 0x1 bar.o:[ingress] direct-action
   # tc filter show dev foo egress
   filter protocol all pref 49152 bpf
   filter protocol all pref 49152 bpf handle 0x1 bar.o:[egress] direct-action

A 'tc filter show dev foo' or 'tc filter show dev foo parent ffff:' will
show an empty list for clsact. Either using the parent names (ingress/egress)
or specifying the full major/minor will then show the related filter lists.

Prior work on a mqprio prequeue() facility [1] was done mainly by John Fastabend.

  [1] http://patchwork.ozlabs.org/patch/512949/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-10 22:13:15 -05:00
..
act_api.c net_sched: make tcf_hash_destroy() static 2015-08-26 11:01:44 -07:00
act_bpf.c bpf: add bpf_redirect() helper 2015-09-17 21:09:07 -07:00
act_connmark.c netfilter: nf_conntrack: Add a struct net parameter to l4_pkt_to_tuple 2015-09-18 22:00:04 +02:00
act_csum.c net: sched: add percpu stats to actions 2015-07-08 13:50:41 -07:00
act_gact.c net_sched: act_gact: remove spinlock in fast path 2015-07-08 13:50:42 -07:00
act_ipt.c netfilter: x_tables: Pass struct net in xt_action_param 2015-09-18 21:58:14 +02:00
act_mirred.c act_mirred: clear sender cpu before sending to tx 2015-10-08 04:59:04 -07:00
act_nat.c net: Change pseudohdr argument of inet_proto_csum_replace* to be a bool 2015-08-17 21:33:06 -07:00
act_pedit.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-07-31 23:52:20 -07:00
act_police.c sched: fix act file names in header comment 2014-11-06 15:04:41 -05:00
act_simple.c net: sched: add percpu stats to actions 2015-07-08 13:50:41 -07:00
act_skbedit.c net: sched: add percpu stats to actions 2015-07-08 13:50:41 -07:00
act_vlan.c net: sched: add percpu stats to actions 2015-07-08 13:50:41 -07:00
cls_api.c net: sched: fix call_rcu() race on classifier module unloads 2015-05-21 18:48:18 -04:00
cls_basic.c net_sched: destroy proto tp when all filters are gone 2015-03-09 15:35:55 -04:00
cls_bpf.c net, sched: add clsact qdisc 2016-01-10 22:13:15 -05:00
cls_cgroup.c cls_cgroup: factor out classid retrieval 2015-07-20 12:41:30 -07:00
cls_flow.c sched: cls_flow: use skb_to_full_sk() helper 2015-11-08 20:56:39 -05:00
cls_flower.c flow_dissector: Add flags argument to skb_flow_dissector functions 2015-09-01 15:06:22 -07:00
cls_fw.c net: revert "net_sched: move tp->root allocation into fw_init()" 2015-09-24 14:33:30 -07:00
cls_route.c net_sched: destroy proto tp when all filters are gone 2015-03-09 15:35:55 -04:00
cls_rsvp.c
cls_rsvp.h net_sched: convert rsvp to call tcf_exts_destroy from rcu callback 2015-08-26 11:01:45 -07:00
cls_rsvp6.c
cls_tcindex.c net_sched: convert tcindex to call tcf_exts_destroy from rcu callback 2015-08-26 11:01:44 -07:00
cls_u32.c cls_u32: complete the check for non-forced case in u32_destroy() 2015-08-25 17:02:48 -07:00
em_canid.c net: sched: remove tcf_proto from ematch calls 2014-10-06 18:02:32 -04:00
em_cmp.c
em_ipset.c netfilter: x_tables: Pass struct net in xt_action_param 2015-09-18 21:58:14 +02:00
em_meta.c net_sched: em_meta: use skb_to_full_sk() helper 2015-11-08 20:56:39 -05:00
em_nbyte.c net: sched: remove tcf_proto from ematch calls 2014-10-06 18:02:32 -04:00
em_text.c net: Remove state argument from skb_find_text() 2015-02-22 15:59:54 -05:00
em_u32.c
ematch.c ematch: Fix auto-loading of ematch modules. 2015-02-20 15:30:56 -05:00
Kconfig net, sched: add clsact qdisc 2016-01-10 22:13:15 -05:00
Makefile tc: introduce Flower classifier 2015-05-13 15:19:48 -04:00
sch_api.c net_sched: make qdisc_tree_decrease_qlen() work for non mq 2015-12-15 13:41:52 -05:00
sch_atm.c net: sched: consolidate tc_classify{,_compat} 2015-08-27 14:18:48 -07:00
sch_blackhole.c net/sched: make sch_blackhole.c explicitly non-modular 2015-10-09 07:52:28 -07:00
sch_cbq.c net: sched: consolidate tc_classify{,_compat} 2015-08-27 14:18:48 -07:00
sch_choke.c net: sched: kill dead code in sch_choke.c 2015-11-03 13:30:47 -05:00
sch_codel.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2015-05-13 14:31:43 -04:00
sch_drr.c net: sched: consolidate tc_classify{,_compat} 2015-08-27 14:18:48 -07:00
sch_dsmark.c sch_dsmark: improve memory locality 2015-09-17 22:37:19 -07:00
sch_fifo.c net: sched: drop all special handling of tx_queue_len == 0 2015-08-18 11:55:08 -07:00
sch_fq.c net: synack packets can be attached to request sockets 2015-10-11 05:05:06 -07:00
sch_fq_codel.c net: sched: consolidate tc_classify{,_compat} 2015-08-27 14:18:48 -07:00
sch_generic.c net: sched: fix missing free per cpu on qstats 2016-01-06 01:40:21 -05:00
sch_gred.c net: sched: drop all special handling of tx_queue_len == 0 2015-08-18 11:55:08 -07:00
sch_hfsc.c net: sched: consolidate tc_classify{,_compat} 2015-08-27 14:18:48 -07:00
sch_hhf.c sch_hhf: fix return value of hhf_drop() 2015-10-11 04:49:33 -07:00
sch_htb.c net: sched: consolidate tc_classify{,_compat} 2015-08-27 14:18:48 -07:00
sch_ingress.c net, sched: add clsact qdisc 2016-01-10 22:13:15 -05:00
sch_mq.c net_sched: fix qdisc_tree_decrease_qlen() races 2015-12-03 14:59:05 -05:00
sch_mqprio.c net_sched: fix qdisc_tree_decrease_qlen() races 2015-12-03 14:59:05 -05:00
sch_multiq.c net: sched: consolidate tc_classify{,_compat} 2015-08-27 14:18:48 -07:00
sch_netem.c net: sched: deprecate enqueue_root() 2015-05-11 14:17:32 -04:00
sch_pie.c sch_pie: schedule the timer after all init succeed 2014-10-29 14:28:01 -04:00
sch_plug.c net: sched: drop all special handling of tx_queue_len == 0 2015-08-18 11:55:08 -07:00
sch_prio.c net: sched: consolidate tc_classify{,_compat} 2015-08-27 14:18:48 -07:00
sch_qfq.c net: sched: consolidate tc_classify{,_compat} 2015-08-27 14:18:48 -07:00
sch_red.c net: sched: implement qstat helper routines 2014-09-30 01:02:26 -04:00
sch_sfb.c net: sched: consolidate tc_classify{,_compat} 2015-08-27 14:18:48 -07:00
sch_sfq.c net: sched: consolidate tc_classify{,_compat} 2015-08-27 14:18:48 -07:00
sch_tbf.c net: sched: avoid costly atomic operation in fq_dequeue() 2014-10-06 00:55:10 -04:00
sch_teql.c net: sched: fix skb->protocol use in case of accelerated vlan path 2015-01-13 17:51:08 -05:00