linux-hardened/net/core
Martin KaFai Lau 2dbb9b9e6d bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.

BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
to store the bpf context instead of using the skb->cb[48].

At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
point,  it is not always clear where the bpf context can be appended
in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
clear if the lower layer is only ipv4 and ipv6 in the future and
will it not touch the cb[] again before transiting to the upper
layer.

For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
instead of IP[6]CB and it may still modify the cb[] after calling
the udp[46]_lib_lookup_skb().  Because of the above reason, if
sk->cb is used for the bpf ctx, saving-and-restoring is needed
and likely the whole 48 bytes cb[] has to be saved and restored.

Instead of saving, setting and restoring the cb[], this patch opts
to create a new "struct sk_reuseport_kern" and setting the needed
values in there.

The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
specific usage at this point and it is also inline with the current
sock_reuseport.c implementation (i.e. no protocol specific requirement).

In "struct sk_reuseport_md", this patch exposes data/data_end/len
with semantic similar to other existing usages.  Together
with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
the bpf prog that the reuseport group is bind-ed to a local
INANY address which cannot be learned from skb.

The new "bind_inany" is added to "struct sock_reuseport" which will be
used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
to avoid repeating the "bind INANY" test on
"sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
only be properly initialized when a "sk->sk_reuseport" enabled sk is
adding to a hashtable (i.e. during "reuseport_alloc()" and
"reuseport_add_sock()").

The new "sk_select_reuseport()" is the main helper that the
bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
the earlier patch, the validity of a selected sk is checked in
run time in "sk_select_reuseport()".  Doing the check in
verification time is difficult and inflexible (consider the map-in-map
use case).  The runtime check is to compare the selected sk's reuseport_id
with the reuseport_id that we want.  This helper will return -EXXX if the
selected sk cannot serve the incoming request (e.g. reuseport_id
not match).  The bpf prog can decide if it wants to do SK_DROP as its
discretion.

When the bpf prog returns SK_PASS, the kernel will check if a
valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
If it does , it will use the selected sk.  If not, the kernel
will select one from "reuse->socks[]" (as before this patch).

The SK_DROP and SK_PASS handling logic will be in the next patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:46 +02:00
..
datagram.c net: simplify sock_poll_wait 2018-07-30 09:10:25 -07:00
dev.c net: check extack._msg before print 2018-08-05 17:24:45 -07:00
dev_addr_lists.c net: change the comment of dev_mc_init 2018-04-19 12:58:20 -04:00
dev_ioctl.c net: remove redundant input checks in SIOCSIFTXQLEN case of dev_ifsioc 2018-07-24 11:36:15 -07:00
devlink.c devlink: Add generic parameters region_snapshot 2018-07-12 17:37:13 -07:00
drop_monitor.c treewide: setup_timer() -> timer_setup() 2017-11-21 15:57:07 -08:00
dst.c netfilter: nf_tables: add tunnel support 2018-08-03 21:12:12 +02:00
dst_cache.c net: core: dst_cache_set_ip6: Rename 'addr' parameter to 'saddr' for consistency 2018-03-05 12:52:45 -05:00
ethtool.c net: Add TLS RX offload feature 2018-07-16 00:12:09 -07:00
failover.c net: Introduce generic failover module 2018-05-28 22:59:54 -04:00
fib_notifier.c net: Fix fib notifer to return errno 2018-03-29 14:10:30 -04:00
fib_rules.c fib_rules: NULL check before kfree is not needed 2018-07-30 09:44:06 -07:00
filter.c bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT 2018-08-11 01:58:46 +02:00
flow_dissector.c flow_dissector: allow dissection of tunnel options from metadata 2018-08-07 12:22:14 -07:00
gen_estimator.c net_sched: gen_estimator: fix broken estimators based on percpu stats 2018-02-23 12:35:46 -05:00
gen_stats.c gen_stats: Fix netlink stats dumping in the presence of padding 2018-07-04 14:44:45 +09:00
gro_cells.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
hwbm.c net: hwbm: Fix unbalanced spinlock in error case 2016-05-25 12:35:09 -07:00
link_watch.c net: link_watch: mark bonding link events urgent 2018-01-23 19:43:30 -05:00
lwt_bpf.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2018-08-07 11:02:05 -07:00
lwtunnel.c ipv6: sr: define core operations for seg6local lightweight tunnel 2017-08-07 14:16:22 -07:00
Makefile net: Introduce generic failover module 2018-05-28 22:59:54 -04:00
neighbour.c net: remove blank lines at end of file 2018-07-24 14:10:43 -07:00
net-procfs.c proc: introduce proc_create_net{,_data} 2018-05-16 07:24:30 +02:00
net-sysfs.c net: create reusable function for getting ownership info of sysfs inodes 2018-07-20 23:44:36 -07:00
net-sysfs.h License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
net-traces.c net/ipv6: Udate fib6_table_lookup tracepoint 2018-05-24 23:01:15 -04:00
net_namespace.c net: create reusable function for getting ownership info of sysfs inodes 2018-07-20 23:44:36 -07:00
netclassid_cgroup.c cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS 2017-07-21 11:14:51 -04:00
netevent.c
netpoll.c netpoll: Use lockdep to assert IRQs are disabled/enabled 2017-11-08 11:13:54 +01:00
netprio_cgroup.c net: remove duplicate includes 2017-12-13 13:18:46 -05:00
page_pool.c net/page_pool: Fix inconsistent lock state warning 2018-07-19 23:23:01 -07:00
pktgen.c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next 2018-07-27 09:33:37 -07:00
ptp_classifier.c ptp: Change ptp_class to a proper bitmask 2015-11-03 11:08:22 -05:00
request_sock.c ipv4: Namespaceify tcp_max_syn_backlog knob 2016-12-29 11:38:31 -05:00
rtnetlink.c net: report invalid mtu value via netlink extack 2018-07-29 12:57:26 -07:00
scm.c sched/headers: Prepare for new header dependencies before moving code to <linux/sched/user.h> 2017-03-02 08:42:29 +01:00
secure_seq.c tcp: Namespaceify sysctl_tcp_timestamps 2017-06-08 10:53:29 -04:00
skbuff.c net: Export skb_headers_offset_update 2018-08-10 16:12:20 +02:00
sock.c net: avoid unnecessary sock_flag() check when enable timestamp 2018-08-06 10:42:48 -07:00
sock_diag.c net: Drop pernet_operations::async 2018-03-27 13:18:09 -04:00
sock_reuseport.c bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT 2018-08-11 01:58:46 +02:00
stream.c vfs: do bulk POLL* -> EPOLL* replacement 2018-02-11 14:34:03 -08:00
sysctl_net_core.c headers: untangle kmemleak.h from mm.h 2018-04-05 21:36:27 -07:00
timestamping.c
tso.c License cleanup: add SPDX GPL-2.0 license identifier to files with no license 2017-11-02 11:10:55 +01:00
utils.c net: Remove some unneeded semicolon 2018-08-04 13:05:39 -07:00
xdp.c xdp: Helpers for disabling napi_direct of xdp_return_frame 2018-08-10 16:12:21 +02:00