Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2018-05-17

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Provide a new BPF helper for doing a FIB and neighbor lookup
   in the kernel tables from an XDP or tc BPF program. The helper
   provides a fast-path for forwarding packets. The API supports
   IPv4, IPv6 and MPLS protocols, but currently IPv4 and IPv6 are
   implemented in this initial work, from David (Ahern).

2) Just a tiny diff but huge feature enabled for nfp driver by
   extending the BPF offload beyond a pure host processing offload.
   Offloaded XDP programs are allowed to set the RX queue index and
   thus opening the door for defining a fully programmable RSS/n-tuple
   filter replacement. Once BPF decided on a queue already, the device
   data-path will skip the conventional RSS processing completely,
   from Jakub.

3) The original sockmap implementation was array based similar to
   devmap. However unlike devmap where an ifindex has a 1:1 mapping
   into the map there are use cases with sockets that need to be
   referenced using longer keys. Hence, sockhash map is added reusing
   as much of the sockmap code as possible, from John.

4) Introduce BTF ID. The ID is allocatd through an IDR similar as
   with BPF maps and progs. It also makes BTF accessible to user
   space via BPF_BTF_GET_FD_BY_ID and adds exposure of the BTF data
   through BPF_OBJ_GET_INFO_BY_FD, from Martin.

5) Enable BPF stackmap with build_id also in NMI context. Due to the
   up_read() of current->mm->mmap_sem build_id cannot be parsed.
   This work defers the up_read() via a per-cpu irq_work so that
   at least limited support can be enabled, from Song.

6) Various BPF JIT follow-up cleanups and fixups after the LD_ABS/LD_IND
   JIT conversion as well as implementation of an optimized 32/64 bit
   immediate load in the arm64 JIT that allows to reduce the number of
   emitted instructions; in case of tested real-world programs they
   were shrinking by three percent, from Daniel.

7) Add ifindex parameter to the libbpf loader in order to enable
   BPF offload support. Right now only iproute2 can load offloaded
   BPF and this will also enable libbpf for direct integration into
   other applications, from David (Beckett).

8) Convert the plain text documentation under Documentation/bpf/ into
   RST format since this is the appropriate standard the kernel is
   moving to for all documentation. Also add an overview README.rst,
   from Jesper.

9) Add __printf verification attribute to the bpf_verifier_vlog()
   helper. Though it uses va_list we can still allow gcc to check
   the format string, from Mathieu.

10) Fix a bash reference in the BPF selftest's Makefile. The '|& ...'
    is a bash 4.0+ feature which is not guaranteed to be available
    when calling out to shell, therefore use a more portable variant,
    from Joe.

11) Fix a 64 bit division in xdp_umem_reg() by using div_u64()
    instead of relying on the gcc built-in, from Björn.

12) Fix a sock hashmap kmalloc warning reported by syzbot when an
    overly large key size is used in hashmap then causing overflows
    in htab->elem_size. Reject bogus attr->key_size early in the
    sock_hash_alloc(), from Yonghong.

13) Ensure in BPF selftests when urandom_read is being linked that
    --build-id is always enabled so that test_stacktrace_build_id[_nmi]
    won't be failing, from Alexei.

14) Add bitsperlong.h as well as errno.h uapi headers into the tools
    header infrastructure which point to one of the arch specific
    uapi headers. This was needed in order to fix a build error on
    some systems for the BPF selftests, from Sirio.

15) Allow for short options to be used in the xdp_monitor BPF sample
    code. And also a bpf.h tools uapi header sync in order to fix a
    selftest build failure. Both from Prashant.

16) More formally clarify the meaning of ID in the direct packet access
    section of the BPF documentation, from Wang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
This commit is contained in:
David S. Miller 2018-05-16 22:47:11 -04:00
commit b9f672af14
114 changed files with 4601 additions and 1866 deletions

View file

@ -0,0 +1,36 @@
=================
BPF documentation
=================
This directory contains documentation for the BPF (Berkeley Packet
Filter) facility, with a focus on the extended BPF version (eBPF).
This kernel side documentation is still work in progress. The main
textual documentation is (for historical reasons) described in
`Documentation/networking/filter.txt`_, which describe both classical
and extended BPF instruction-set.
The Cilium project also maintains a `BPF and XDP Reference Guide`_
that goes into great technical depth about the BPF Architecture.
The primary info for the bpf syscall is available in the `man-pages`_
for `bpf(2)`_.
Frequently asked questions (FAQ)
================================
Two sets of Questions and Answers (Q&A) are maintained.
* QA for common questions about BPF see: bpf_design_QA_
* QA for developers interacting with BPF subsystem: bpf_devel_QA_
.. Links:
.. _bpf_design_QA: bpf_design_QA.rst
.. _bpf_devel_QA: bpf_devel_QA.rst
.. _Documentation/networking/filter.txt: ../networking/filter.txt
.. _man-pages: https://www.kernel.org/doc/man-pages/
.. _bpf(2): http://man7.org/linux/man-pages/man2/bpf.2.html
.. _BPF and XDP Reference Guide: http://cilium.readthedocs.io/en/latest/bpf/

View file

@ -0,0 +1,221 @@
==============
BPF Design Q&A
==============
BPF extensibility and applicability to networking, tracing, security
in the linux kernel and several user space implementations of BPF
virtual machine led to a number of misunderstanding on what BPF actually is.
This short QA is an attempt to address that and outline a direction
of where BPF is heading long term.
.. contents::
:local:
:depth: 3
Questions and Answers
=====================
Q: Is BPF a generic instruction set similar to x64 and arm64?
-------------------------------------------------------------
A: NO.
Q: Is BPF a generic virtual machine ?
-------------------------------------
A: NO.
BPF is generic instruction set *with* C calling convention.
-----------------------------------------------------------
Q: Why C calling convention was chosen?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: Because BPF programs are designed to run in the linux kernel
which is written in C, hence BPF defines instruction set compatible
with two most used architectures x64 and arm64 (and takes into
consideration important quirks of other architectures) and
defines calling convention that is compatible with C calling
convention of the linux kernel on those architectures.
Q: can multiple return values be supported in the future?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: NO. BPF allows only register R0 to be used as return value.
Q: can more than 5 function arguments be supported in the future?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: NO. BPF calling convention only allows registers R1-R5 to be used
as arguments. BPF is not a standalone instruction set.
(unlike x64 ISA that allows msft, cdecl and other conventions)
Q: can BPF programs access instruction pointer or return address?
-----------------------------------------------------------------
A: NO.
Q: can BPF programs access stack pointer ?
------------------------------------------
A: NO.
Only frame pointer (register R10) is accessible.
From compiler point of view it's necessary to have stack pointer.
For example LLVM defines register R11 as stack pointer in its
BPF backend, but it makes sure that generated code never uses it.
Q: Does C-calling convention diminishes possible use cases?
-----------------------------------------------------------
A: YES.
BPF design forces addition of major functionality in the form
of kernel helper functions and kernel objects like BPF maps with
seamless interoperability between them. It lets kernel call into
BPF programs and programs call kernel helpers with zero overhead.
As all of them were native C code. That is particularly the case
for JITed BPF programs that are indistinguishable from
native kernel C code.
Q: Does it mean that 'innovative' extensions to BPF code are disallowed?
------------------------------------------------------------------------
A: Soft yes.
At least for now until BPF core has support for
bpf-to-bpf calls, indirect calls, loops, global variables,
jump tables, read only sections and all other normal constructs
that C code can produce.
Q: Can loops be supported in a safe way?
----------------------------------------
A: It's not clear yet.
BPF developers are trying to find a way to
support bounded loops where the verifier can guarantee that
the program terminates in less than 4096 instructions.
Instruction level questions
---------------------------
Q: LD_ABS and LD_IND instructions vs C code
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Q: How come LD_ABS and LD_IND instruction are present in BPF whereas
C code cannot express them and has to use builtin intrinsics?
A: This is artifact of compatibility with classic BPF. Modern
networking code in BPF performs better without them.
See 'direct packet access'.
Q: BPF instructions mapping not one-to-one to native CPU
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Q: It seems not all BPF instructions are one-to-one to native CPU.
For example why BPF_JNE and other compare and jumps are not cpu-like?
A: This was necessary to avoid introducing flags into ISA which are
impossible to make generic and efficient across CPU architectures.
Q: why BPF_DIV instruction doesn't map to x64 div?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: Because if we picked one-to-one relationship to x64 it would have made
it more complicated to support on arm64 and other archs. Also it
needs div-by-zero runtime check.
Q: why there is no BPF_SDIV for signed divide operation?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: Because it would be rarely used. llvm errors in such case and
prints a suggestion to use unsigned divide instead
Q: Why BPF has implicit prologue and epilogue?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: Because architectures like sparc have register windows and in general
there are enough subtle differences between architectures, so naive
store return address into stack won't work. Another reason is BPF has
to be safe from division by zero (and legacy exception path
of LD_ABS insn). Those instructions need to invoke epilogue and
return implicitly.
Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
A: Because classic BPF didn't have them and BPF authors felt that compiler
workaround would be acceptable. Turned out that programs lose performance
due to lack of these compare instructions and they were added.
These two instructions is a perfect example what kind of new BPF
instructions are acceptable and can be added in the future.
These two already had equivalent instructions in native CPUs.
New instructions that don't have one-to-one mapping to HW instructions
will not be accepted.
Q: BPF 32-bit subregister requirements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF
registers which makes BPF inefficient virtual machine for 32-bit
CPU architectures and 32-bit HW accelerators. Can true 32-bit registers
be added to BPF in the future?
A: NO. The first thing to improve performance on 32-bit archs is to teach
LLVM to generate code that uses 32-bit subregisters. Then second step
is to teach verifier to mark operations where zero-ing upper bits
is unnecessary. Then JITs can take advantage of those markings and
drastically reduce size of generated code and improve performance.
Q: Does BPF have a stable ABI?
------------------------------
A: YES. BPF instructions, arguments to BPF programs, set of helper
functions and their arguments, recognized return codes are all part
of ABI. However when tracing programs are using bpf_probe_read() helper
to walk kernel internal datastructures and compile with kernel
internal headers these accesses can and will break with newer
kernels. The union bpf_attr -> kern_version is checked at load time
to prevent accidentally loading kprobe-based bpf programs written
for a different kernel. Networking programs don't do kern_version check.
Q: How much stack space a BPF program uses?
-------------------------------------------
A: Currently all program types are limited to 512 bytes of stack
space, but the verifier computes the actual amount of stack used
and both interpreter and most JITed code consume necessary amount.
Q: Can BPF be offloaded to HW?
------------------------------
A: YES. BPF HW offload is supported by NFP driver.
Q: Does classic BPF interpreter still exist?
--------------------------------------------
A: NO. Classic BPF programs are converted into extend BPF instructions.
Q: Can BPF call arbitrary kernel functions?
-------------------------------------------
A: NO. BPF programs can only call a set of helper functions which
is defined for every program type.
Q: Can BPF overwrite arbitrary kernel memory?
---------------------------------------------
A: NO.
Tracing bpf programs can *read* arbitrary memory with bpf_probe_read()
and bpf_probe_read_str() helpers. Networking programs cannot read
arbitrary memory, since they don't have access to these helpers.
Programs can never read or write arbitrary memory directly.
Q: Can BPF overwrite arbitrary user memory?
-------------------------------------------
A: Sort-of.
Tracing BPF programs can overwrite the user memory
of the current task with bpf_probe_write_user(). Every time such
program is loaded the kernel will print warning message, so
this helper is only useful for experiments and prototypes.
Tracing BPF programs are root only.
Q: bpf_trace_printk() helper warning
------------------------------------
Q: When bpf_trace_printk() helper is used the kernel prints nasty
warning message. Why is that?
A: This is done to nudge program authors into better interfaces when
programs need to pass data to user space. Like bpf_perf_event_output()
can be used to efficiently stream data via perf ring buffer.
BPF maps can be used for asynchronous data sharing between kernel
and user space. bpf_trace_printk() should only be used for debugging.
Q: New functionality via kernel modules?
----------------------------------------
Q: Can BPF functionality such as new program or map types, new
helpers, etc be added out of kernel module code?
A: NO.

View file

@ -1,156 +0,0 @@
BPF extensibility and applicability to networking, tracing, security
in the linux kernel and several user space implementations of BPF
virtual machine led to a number of misunderstanding on what BPF actually is.
This short QA is an attempt to address that and outline a direction
of where BPF is heading long term.
Q: Is BPF a generic instruction set similar to x64 and arm64?
A: NO.
Q: Is BPF a generic virtual machine ?
A: NO.
BPF is generic instruction set _with_ C calling convention.
Q: Why C calling convention was chosen?
A: Because BPF programs are designed to run in the linux kernel
which is written in C, hence BPF defines instruction set compatible
with two most used architectures x64 and arm64 (and takes into
consideration important quirks of other architectures) and
defines calling convention that is compatible with C calling
convention of the linux kernel on those architectures.
Q: can multiple return values be supported in the future?
A: NO. BPF allows only register R0 to be used as return value.
Q: can more than 5 function arguments be supported in the future?
A: NO. BPF calling convention only allows registers R1-R5 to be used
as arguments. BPF is not a standalone instruction set.
(unlike x64 ISA that allows msft, cdecl and other conventions)
Q: can BPF programs access instruction pointer or return address?
A: NO.
Q: can BPF programs access stack pointer ?
A: NO. Only frame pointer (register R10) is accessible.
From compiler point of view it's necessary to have stack pointer.
For example LLVM defines register R11 as stack pointer in its
BPF backend, but it makes sure that generated code never uses it.
Q: Does C-calling convention diminishes possible use cases?
A: YES. BPF design forces addition of major functionality in the form
of kernel helper functions and kernel objects like BPF maps with
seamless interoperability between them. It lets kernel call into
BPF programs and programs call kernel helpers with zero overhead.
As all of them were native C code. That is particularly the case
for JITed BPF programs that are indistinguishable from
native kernel C code.
Q: Does it mean that 'innovative' extensions to BPF code are disallowed?
A: Soft yes. At least for now until BPF core has support for
bpf-to-bpf calls, indirect calls, loops, global variables,
jump tables, read only sections and all other normal constructs
that C code can produce.
Q: Can loops be supported in a safe way?
A: It's not clear yet. BPF developers are trying to find a way to
support bounded loops where the verifier can guarantee that
the program terminates in less than 4096 instructions.
Q: How come LD_ABS and LD_IND instruction are present in BPF whereas
C code cannot express them and has to use builtin intrinsics?
A: This is artifact of compatibility with classic BPF. Modern
networking code in BPF performs better without them.
See 'direct packet access'.
Q: It seems not all BPF instructions are one-to-one to native CPU.
For example why BPF_JNE and other compare and jumps are not cpu-like?
A: This was necessary to avoid introducing flags into ISA which are
impossible to make generic and efficient across CPU architectures.
Q: why BPF_DIV instruction doesn't map to x64 div?
A: Because if we picked one-to-one relationship to x64 it would have made
it more complicated to support on arm64 and other archs. Also it
needs div-by-zero runtime check.
Q: why there is no BPF_SDIV for signed divide operation?
A: Because it would be rarely used. llvm errors in such case and
prints a suggestion to use unsigned divide instead
Q: Why BPF has implicit prologue and epilogue?
A: Because architectures like sparc have register windows and in general
there are enough subtle differences between architectures, so naive
store return address into stack won't work. Another reason is BPF has
to be safe from division by zero (and legacy exception path
of LD_ABS insn). Those instructions need to invoke epilogue and
return implicitly.
Q: Why BPF_JLT and BPF_JLE instructions were not introduced in the beginning?
A: Because classic BPF didn't have them and BPF authors felt that compiler
workaround would be acceptable. Turned out that programs lose performance
due to lack of these compare instructions and they were added.
These two instructions is a perfect example what kind of new BPF
instructions are acceptable and can be added in the future.
These two already had equivalent instructions in native CPUs.
New instructions that don't have one-to-one mapping to HW instructions
will not be accepted.
Q: BPF 32-bit subregisters have a requirement to zero upper 32-bits of BPF
registers which makes BPF inefficient virtual machine for 32-bit
CPU architectures and 32-bit HW accelerators. Can true 32-bit registers
be added to BPF in the future?
A: NO. The first thing to improve performance on 32-bit archs is to teach
LLVM to generate code that uses 32-bit subregisters. Then second step
is to teach verifier to mark operations where zero-ing upper bits
is unnecessary. Then JITs can take advantage of those markings and
drastically reduce size of generated code and improve performance.
Q: Does BPF have a stable ABI?
A: YES. BPF instructions, arguments to BPF programs, set of helper
functions and their arguments, recognized return codes are all part
of ABI. However when tracing programs are using bpf_probe_read() helper
to walk kernel internal datastructures and compile with kernel
internal headers these accesses can and will break with newer
kernels. The union bpf_attr -> kern_version is checked at load time
to prevent accidentally loading kprobe-based bpf programs written
for a different kernel. Networking programs don't do kern_version check.
Q: How much stack space a BPF program uses?
A: Currently all program types are limited to 512 bytes of stack
space, but the verifier computes the actual amount of stack used
and both interpreter and most JITed code consume necessary amount.
Q: Can BPF be offloaded to HW?
A: YES. BPF HW offload is supported by NFP driver.
Q: Does classic BPF interpreter still exist?
A: NO. Classic BPF programs are converted into extend BPF instructions.
Q: Can BPF call arbitrary kernel functions?
A: NO. BPF programs can only call a set of helper functions which
is defined for every program type.
Q: Can BPF overwrite arbitrary kernel memory?
A: NO. Tracing bpf programs can _read_ arbitrary memory with bpf_probe_read()
and bpf_probe_read_str() helpers. Networking programs cannot read
arbitrary memory, since they don't have access to these helpers.
Programs can never read or write arbitrary memory directly.
Q: Can BPF overwrite arbitrary user memory?
A: Sort-of. Tracing BPF programs can overwrite the user memory
of the current task with bpf_probe_write_user(). Every time such
program is loaded the kernel will print warning message, so
this helper is only useful for experiments and prototypes.
Tracing BPF programs are root only.
Q: When bpf_trace_printk() helper is used the kernel prints nasty
warning message. Why is that?
A: This is done to nudge program authors into better interfaces when
programs need to pass data to user space. Like bpf_perf_event_output()
can be used to efficiently stream data via perf ring buffer.
BPF maps can be used for asynchronous data sharing between kernel
and user space. bpf_trace_printk() should only be used for debugging.
Q: Can BPF functionality such as new program or map types, new
helpers, etc be added out of kernel module code?
A: NO.

View file

@ -0,0 +1,640 @@
=================================
HOWTO interact with BPF subsystem
=================================
This document provides information for the BPF subsystem about various
workflows related to reporting bugs, submitting patches, and queueing
patches for stable kernels.
For general information about submitting patches, please refer to
`Documentation/process/`_. This document only describes additional specifics
related to BPF.
.. contents::
:local:
:depth: 2
Reporting bugs
==============
Q: How do I report bugs for BPF kernel code?
--------------------------------------------
A: Since all BPF kernel development as well as bpftool and iproute2 BPF
loader development happens through the netdev kernel mailing list,
please report any found issues around BPF to the following mailing
list:
netdev@vger.kernel.org
This may also include issues related to XDP, BPF tracing, etc.
Given netdev has a high volume of traffic, please also add the BPF
maintainers to Cc (from kernel MAINTAINERS_ file):
* Alexei Starovoitov <ast@kernel.org>
* Daniel Borkmann <daniel@iogearbox.net>
In case a buggy commit has already been identified, make sure to keep
the actual commit authors in Cc as well for the report. They can
typically be identified through the kernel's git tree.
**Please do NOT report BPF issues to bugzilla.kernel.org since it
is a guarantee that the reported issue will be overlooked.**
Submitting patches
==================
Q: To which mailing list do I need to submit my BPF patches?
------------------------------------------------------------
A: Please submit your BPF patches to the netdev kernel mailing list:
netdev@vger.kernel.org
Historically, BPF came out of networking and has always been maintained
by the kernel networking community. Although these days BPF touches
many other subsystems as well, the patches are still routed mainly
through the networking community.
In case your patch has changes in various different subsystems (e.g.
tracing, security, etc), make sure to Cc the related kernel mailing
lists and maintainers from there as well, so they are able to review
the changes and provide their Acked-by's to the patches.
Q: Where can I find patches currently under discussion for BPF subsystem?
-------------------------------------------------------------------------
A: All patches that are Cc'ed to netdev are queued for review under netdev
patchwork project:
http://patchwork.ozlabs.org/project/netdev/list/
Those patches which target BPF, are assigned to a 'bpf' delegate for
further processing from BPF maintainers. The current queue with
patches under review can be found at:
https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147
Once the patches have been reviewed by the BPF community as a whole
and approved by the BPF maintainers, their status in patchwork will be
changed to 'Accepted' and the submitter will be notified by mail. This
means that the patches look good from a BPF perspective and have been
applied to one of the two BPF kernel trees.
In case feedback from the community requires a respin of the patches,
their status in patchwork will be set to 'Changes Requested', and purged
from the current review queue. Likewise for cases where patches would
get rejected or are not applicable to the BPF trees (but assigned to
the 'bpf' delegate).
Q: How do the changes make their way into Linux?
------------------------------------------------
A: There are two BPF kernel trees (git repositories). Once patches have
been accepted by the BPF maintainers, they will be applied to one
of the two BPF trees:
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/
The bpf tree itself is for fixes only, whereas bpf-next for features,
cleanups or other kind of improvements ("next-like" content). This is
analogous to net and net-next trees for networking. Both bpf and
bpf-next will only have a master branch in order to simplify against
which branch patches should get rebased to.
Accumulated BPF patches in the bpf tree will regularly get pulled
into the net kernel tree. Likewise, accumulated BPF patches accepted
into the bpf-next tree will make their way into net-next tree. net and
net-next are both run by David S. Miller. From there, they will go
into the kernel mainline tree run by Linus Torvalds. To read up on the
process of net and net-next being merged into the mainline tree, see
the `netdev FAQ`_ under:
`Documentation/networking/netdev-FAQ.txt`_
Occasionally, to prevent merge conflicts, we might send pull requests
to other trees (e.g. tracing) with a small subset of the patches, but
net and net-next are always the main trees targeted for integration.
The pull requests will contain a high-level summary of the accumulated
patches and can be searched on netdev kernel mailing list through the
following subject lines (``yyyy-mm-dd`` is the date of the pull
request)::
pull-request: bpf yyyy-mm-dd
pull-request: bpf-next yyyy-mm-dd
Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be applied to?
---------------------------------------------------------------------------------
A: The process is the very same as described in the `netdev FAQ`_, so
please read up on it. The subject line must indicate whether the
patch is a fix or rather "next-like" content in order to let the
maintainers know whether it is targeted at bpf or bpf-next.
For fixes eventually landing in bpf -> net tree, the subject must
look like::
git format-patch --subject-prefix='PATCH bpf' start..finish
For features/improvements/etc that should eventually land in
bpf-next -> net-next, the subject must look like::
git format-patch --subject-prefix='PATCH bpf-next' start..finish
If unsure whether the patch or patch series should go into bpf
or net directly, or bpf-next or net-next directly, it is not a
problem either if the subject line says net or net-next as target.
It is eventually up to the maintainers to do the delegation of
the patches.
If it is clear that patches should go into bpf or bpf-next tree,
please make sure to rebase the patches against those trees in
order to reduce potential conflicts.
In case the patch or patch series has to be reworked and sent out
again in a second or later revision, it is also required to add a
version number (``v2``, ``v3``, ...) into the subject prefix::
git format-patch --subject-prefix='PATCH net-next v2' start..finish
When changes have been requested to the patch series, always send the
whole patch series again with the feedback incorporated (never send
individual diffs on top of the old series).
Q: What does it mean when a patch gets applied to bpf or bpf-next tree?
-----------------------------------------------------------------------
A: It means that the patch looks good for mainline inclusion from
a BPF point of view.
Be aware that this is not a final verdict that the patch will
automatically get accepted into net or net-next trees eventually:
On the netdev kernel mailing list reviews can come in at any point
in time. If discussions around a patch conclude that they cannot
get included as-is, we will either apply a follow-up fix or drop
them from the trees entirely. Therefore, we also reserve to rebase
the trees when deemed necessary. After all, the purpose of the tree
is to:
i) accumulate and stage BPF patches for integration into trees
like net and net-next, and
ii) run extensive BPF test suite and
workloads on the patches before they make their way any further.
Once the BPF pull request was accepted by David S. Miller, then
the patches end up in net or net-next tree, respectively, and
make their way from there further into mainline. Again, see the
`netdev FAQ`_ for additional information e.g. on how often they are
merged to mainline.
Q: How long do I need to wait for feedback on my BPF patches?
-------------------------------------------------------------
A: We try to keep the latency low. The usual time to feedback will
be around 2 or 3 business days. It may vary depending on the
complexity of changes and current patch load.
Q: How often do you send pull requests to major kernel trees like net or net-next?
----------------------------------------------------------------------------------
A: Pull requests will be sent out rather often in order to not
accumulate too many patches in bpf or bpf-next.
As a rule of thumb, expect pull requests for each tree regularly
at the end of the week. In some cases pull requests could additionally
come also in the middle of the week depending on the current patch
load or urgency.
Q: Are patches applied to bpf-next when the merge window is open?
-----------------------------------------------------------------
A: For the time when the merge window is open, bpf-next will not be
processed. This is roughly analogous to net-next patch processing,
so feel free to read up on the `netdev FAQ`_ about further details.
During those two weeks of merge window, we might ask you to resend
your patch series once bpf-next is open again. Once Linus released
a ``v*-rc1`` after the merge window, we continue processing of bpf-next.
For non-subscribers to kernel mailing lists, there is also a status
page run by David S. Miller on net-next that provides guidance:
http://vger.kernel.org/~davem/net-next.html
Q: Verifier changes and test cases
----------------------------------
Q: I made a BPF verifier change, do I need to add test cases for
BPF kernel selftests_?
A: If the patch has changes to the behavior of the verifier, then yes,
it is absolutely necessary to add test cases to the BPF kernel
selftests_ suite. If they are not present and we think they are
needed, then we might ask for them before accepting any changes.
In particular, test_verifier.c is tracking a high number of BPF test
cases, including a lot of corner cases that LLVM BPF back end may
generate out of the restricted C code. Thus, adding test cases is
absolutely crucial to make sure future changes do not accidentally
affect prior use-cases. Thus, treat those test cases as: verifier
behavior that is not tracked in test_verifier.c could potentially
be subject to change.
Q: samples/bpf preference vs selftests?
---------------------------------------
Q: When should I add code to `samples/bpf/`_ and when to BPF kernel
selftests_ ?
A: In general, we prefer additions to BPF kernel selftests_ rather than
`samples/bpf/`_. The rationale is very simple: kernel selftests are
regularly run by various bots to test for kernel regressions.
The more test cases we add to BPF selftests, the better the coverage
and the less likely it is that those could accidentally break. It is
not that BPF kernel selftests cannot demo how a specific feature can
be used.
That said, `samples/bpf/`_ may be a good place for people to get started,
so it might be advisable that simple demos of features could go into
`samples/bpf/`_, but advanced functional and corner-case testing rather
into kernel selftests.
If your sample looks like a test case, then go for BPF kernel selftests
instead!
Q: When should I add code to the bpftool?
-----------------------------------------
A: The main purpose of bpftool (under tools/bpf/bpftool/) is to provide
a central user space tool for debugging and introspection of BPF programs
and maps that are active in the kernel. If UAPI changes related to BPF
enable for dumping additional information of programs or maps, then
bpftool should be extended as well to support dumping them.
Q: When should I add code to iproute2's BPF loader?
---------------------------------------------------
A: For UAPI changes related to the XDP or tc layer (e.g. ``cls_bpf``),
the convention is that those control-path related changes are added to
iproute2's BPF loader as well from user space side. This is not only
useful to have UAPI changes properly designed to be usable, but also
to make those changes available to a wider user base of major
downstream distributions.
Q: Do you accept patches as well for iproute2's BPF loader?
-----------------------------------------------------------
A: Patches for the iproute2's BPF loader have to be sent to:
netdev@vger.kernel.org
While those patches are not processed by the BPF kernel maintainers,
please keep them in Cc as well, so they can be reviewed.
The official git repository for iproute2 is run by Stephen Hemminger
and can be found at:
https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/
The patches need to have a subject prefix of '``[PATCH iproute2
master]``' or '``[PATCH iproute2 net-next]``'. '``master``' or
'``net-next``' describes the target branch where the patch should be
applied to. Meaning, if kernel changes went into the net-next kernel
tree, then the related iproute2 changes need to go into the iproute2
net-next branch, otherwise they can be targeted at master branch. The
iproute2 net-next branch will get merged into the master branch after
the current iproute2 version from master has been released.
Like BPF, the patches end up in patchwork under the netdev project and
are delegated to 'shemminger' for further processing:
http://patchwork.ozlabs.org/project/netdev/list/?delegate=389
Q: What is the minimum requirement before I submit my BPF patches?
------------------------------------------------------------------
A: When submitting patches, always take the time and properly test your
patches *prior* to submission. Never rush them! If maintainers find
that your patches have not been properly tested, it is a good way to
get them grumpy. Testing patch submissions is a hard requirement!
Note, fixes that go to bpf tree *must* have a ``Fixes:`` tag included.
The same applies to fixes that target bpf-next, where the affected
commit is in net-next (or in some cases bpf-next). The ``Fixes:`` tag is
crucial in order to identify follow-up commits and tremendously helps
for people having to do backporting, so it is a must have!
We also don't accept patches with an empty commit message. Take your
time and properly write up a high quality commit message, it is
essential!
Think about it this way: other developers looking at your code a month
from now need to understand *why* a certain change has been done that
way, and whether there have been flaws in the analysis or assumptions
that the original author did. Thus providing a proper rationale and
describing the use-case for the changes is a must.
Patch submissions with >1 patch must have a cover letter which includes
a high level description of the series. This high level summary will
then be placed into the merge commit by the BPF maintainers such that
it is also accessible from the git log for future reference.
Q: Features changing BPF JIT and/or LLVM
----------------------------------------
Q: What do I need to consider when adding a new instruction or feature
that would require BPF JIT and/or LLVM integration as well?
A: We try hard to keep all BPF JITs up to date such that the same user
experience can be guaranteed when running BPF programs on different
architectures without having the program punt to the less efficient
interpreter in case the in-kernel BPF JIT is enabled.
If you are unable to implement or test the required JIT changes for
certain architectures, please work together with the related BPF JIT
developers in order to get the feature implemented in a timely manner.
Please refer to the git log (``arch/*/net/``) to locate the necessary
people for helping out.
Also always make sure to add BPF test cases (e.g. test_bpf.c and
test_verifier.c) for new instructions, so that they can receive
broad test coverage and help run-time testing the various BPF JITs.
In case of new BPF instructions, once the changes have been accepted
into the Linux kernel, please implement support into LLVM's BPF back
end. See LLVM_ section below for further information.
Stable submission
=================
Q: I need a specific BPF commit in stable kernels. What should I do?
--------------------------------------------------------------------
A: In case you need a specific fix in stable kernels, first check whether
the commit has already been applied in the related ``linux-*.y`` branches:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/
If not the case, then drop an email to the BPF maintainers with the
netdev kernel mailing list in Cc and ask for the fix to be queued up:
netdev@vger.kernel.org
The process in general is the same as on netdev itself, see also the
`netdev FAQ`_ document.
Q: Do you also backport to kernels not currently maintained as stable?
----------------------------------------------------------------------
A: No. If you need a specific BPF commit in kernels that are currently not
maintained by the stable maintainers, then you are on your own.
The current stable and longterm stable kernels are all listed here:
https://www.kernel.org/
Q: The BPF patch I am about to submit needs to go to stable as well
-------------------------------------------------------------------
What should I do?
A: The same rules apply as with netdev patch submissions in general, see
`netdev FAQ`_ under:
`Documentation/networking/netdev-FAQ.txt`_
Never add "``Cc: stable@vger.kernel.org``" to the patch description, but
ask the BPF maintainers to queue the patches instead. This can be done
with a note, for example, under the ``---`` part of the patch which does
not go into the git log. Alternatively, this can be done as a simple
request by mail instead.
Q: Queue stable patches
-----------------------
Q: Where do I find currently queued BPF patches that will be submitted
to stable?
A: Once patches that fix critical bugs got applied into the bpf tree, they
are queued up for stable submission under:
http://patchwork.ozlabs.org/bundle/bpf/stable/?state=*
They will be on hold there at minimum until the related commit made its
way into the mainline kernel tree.
After having been under broader exposure, the queued patches will be
submitted by the BPF maintainers to the stable maintainers.
Testing patches
===============
Q: How to run BPF selftests
---------------------------
A: After you have booted into the newly compiled kernel, navigate to
the BPF selftests_ suite in order to test BPF functionality (current
working directory points to the root of the cloned git tree)::
$ cd tools/testing/selftests/bpf/
$ make
To run the verifier tests::
$ sudo ./test_verifier
The verifier tests print out all the current checks being
performed. The summary at the end of running all tests will dump
information of test successes and failures::
Summary: 418 PASSED, 0 FAILED
In order to run through all BPF selftests, the following command is
needed::
$ sudo make run_tests
See the kernels selftest `Documentation/dev-tools/kselftest.rst`_
document for further documentation.
Q: Which BPF kernel selftests version should I run my kernel against?
---------------------------------------------------------------------
A: If you run a kernel ``xyz``, then always run the BPF kernel selftests
from that kernel ``xyz`` as well. Do not expect that the BPF selftest
from the latest mainline tree will pass all the time.
In particular, test_bpf.c and test_verifier.c have a large number of
test cases and are constantly updated with new BPF test sequences, or
existing ones are adapted to verifier changes e.g. due to verifier
becoming smarter and being able to better track certain things.
LLVM
====
Q: Where do I find LLVM with BPF support?
-----------------------------------------
A: The BPF back end for LLVM is upstream in LLVM since version 3.7.1.
All major distributions these days ship LLVM with BPF back end enabled,
so for the majority of use-cases it is not required to compile LLVM by
hand anymore, just install the distribution provided package.
LLVM's static compiler lists the supported targets through
``llc --version``, make sure BPF targets are listed. Example::
$ llc --version
LLVM (http://llvm.org/):
LLVM version 6.0.0svn
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: skylake
Registered Targets:
bpf - BPF (host endian)
bpfeb - BPF (big endian)
bpfel - BPF (little endian)
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
For developers in order to utilize the latest features added to LLVM's
BPF back end, it is advisable to run the latest LLVM releases. Support
for new BPF kernel features such as additions to the BPF instruction
set are often developed together.
All LLVM releases can be found at: http://releases.llvm.org/
Q: Got it, so how do I build LLVM manually anyway?
--------------------------------------------------
A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have
that set up, proceed with building the latest LLVM and clang version
from the git repositories::
$ git clone http://llvm.org/git/llvm.git
$ cd llvm/tools
$ git clone --depth 1 http://llvm.org/git/clang.git
$ cd ..; mkdir build; cd build
$ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DLLVM_BUILD_RUNTIME=OFF
$ make -j $(getconf _NPROCESSORS_ONLN)
The built binaries can then be found in the build/bin/ directory, where
you can point the PATH variable to.
Q: Reporting LLVM BPF issues
----------------------------
Q: Should I notify BPF kernel maintainers about issues in LLVM's BPF code
generation back end or about LLVM generated code that the verifier
refuses to accept?
A: Yes, please do!
LLVM's BPF back end is a key piece of the whole BPF
infrastructure and it ties deeply into verification of programs from the
kernel side. Therefore, any issues on either side need to be investigated
and fixed whenever necessary.
Therefore, please make sure to bring them up at netdev kernel mailing
list and Cc BPF maintainers for LLVM and kernel bits:
* Yonghong Song <yhs@fb.com>
* Alexei Starovoitov <ast@kernel.org>
* Daniel Borkmann <daniel@iogearbox.net>
LLVM also has an issue tracker where BPF related bugs can be found:
https://bugs.llvm.org/buglist.cgi?quicksearch=bpf
However, it is better to reach out through mailing lists with having
maintainers in Cc.
Q: New BPF instruction for kernel and LLVM
------------------------------------------
Q: I have added a new BPF instruction to the kernel, how can I integrate
it into LLVM?
A: LLVM has a ``-mcpu`` selector for the BPF back end in order to allow
the selection of BPF instruction set extensions. By default the
``generic`` processor target is used, which is the base instruction set
(v1) of BPF.
LLVM has an option to select ``-mcpu=probe`` where it will probe the host
kernel for supported BPF instruction set extensions and selects the
optimal set automatically.
For cross-compilation, a specific version can be select manually as well ::
$ llc -march bpf -mcpu=help
Available CPUs for this target:
generic - Select the generic processor.
probe - Select the probe processor.
v1 - Select the v1 processor.
v2 - Select the v2 processor.
[...]
Newly added BPF instructions to the Linux kernel need to follow the same
scheme, bump the instruction set version and implement probing for the
extensions such that ``-mcpu=probe`` users can benefit from the
optimization transparently when upgrading their kernels.
If you are unable to implement support for the newly added BPF instruction
please reach out to BPF developers for help.
By the way, the BPF kernel selftests run with ``-mcpu=probe`` for better
test coverage.
Q: clang flag for target bpf?
-----------------------------
Q: In some cases clang flag ``-target bpf`` is used but in other cases the
default clang target, which matches the underlying architecture, is used.
What is the difference and when I should use which?
A: Although LLVM IR generation and optimization try to stay architecture
independent, ``-target <arch>`` still has some impact on generated code:
- BPF program may recursively include header file(s) with file scope
inline assembly codes. The default target can handle this well,
while ``bpf`` target may fail if bpf backend assembler does not
understand these assembly codes, which is true in most cases.
- When compiled without ``-g``, additional elf sections, e.g.,
.eh_frame and .rela.eh_frame, may be present in the object file
with default target, but not with ``bpf`` target.
- The default target may turn a C switch statement into a switch table
lookup and jump operation. Since the switch table is placed
in the global readonly section, the bpf program will fail to load.
The bpf target does not support switch table optimization.
The clang option ``-fno-jump-tables`` can be used to disable
switch table generation.
- For clang ``-target bpf``, it is guaranteed that pointer or long /
unsigned long types will always have a width of 64 bit, no matter
whether underlying clang binary or default target (or kernel) is
32 bit. However, when native clang target is used, then it will
compile these types based on the underlying architecture's conventions,
meaning in case of 32 bit architecture, pointer or long / unsigned
long types e.g. in BPF context structure will have width of 32 bit
while the BPF LLVM back end still operates in 64 bit. The native
target is mostly needed in tracing for the case of walking ``pt_regs``
or other kernel structures where CPU's register width matters.
Otherwise, ``clang -target bpf`` is generally recommended.
You should use default target when:
- Your program includes a header file, e.g., ptrace.h, which eventually
pulls in some header files containing file scope host assembly codes.
- You can add ``-fno-jump-tables`` to work around the switch table issue.
Otherwise, you can use ``bpf`` target. Additionally, you *must* use bpf target
when:
- Your program uses data structures with pointer or long / unsigned long
types that interface with BPF helpers or context data structures. Access
into these structures is verified by the BPF verifier and may result
in verification failures if the native architecture is not aligned with
the BPF architecture, e.g. 64-bit. An example of this is
BPF_PROG_TYPE_SK_MSG require ``-target bpf``
.. Links
.. _Documentation/process/: https://www.kernel.org/doc/html/latest/process/
.. _MAINTAINERS: ../../MAINTAINERS
.. _Documentation/networking/netdev-FAQ.txt: ../networking/netdev-FAQ.txt
.. _netdev FAQ: ../networking/netdev-FAQ.txt
.. _samples/bpf/: ../../samples/bpf/
.. _selftests: ../../tools/testing/selftests/bpf/
.. _Documentation/dev-tools/kselftest.rst:
https://www.kernel.org/doc/html/latest/dev-tools/kselftest.html
Happy BPF hacking!

View file

@ -1,570 +0,0 @@
This document provides information for the BPF subsystem about various
workflows related to reporting bugs, submitting patches, and queueing
patches for stable kernels.
For general information about submitting patches, please refer to
Documentation/process/. This document only describes additional specifics
related to BPF.
Reporting bugs:
---------------
Q: How do I report bugs for BPF kernel code?
A: Since all BPF kernel development as well as bpftool and iproute2 BPF
loader development happens through the netdev kernel mailing list,
please report any found issues around BPF to the following mailing
list:
netdev@vger.kernel.org
This may also include issues related to XDP, BPF tracing, etc.
Given netdev has a high volume of traffic, please also add the BPF
maintainers to Cc (from kernel MAINTAINERS file):
Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann <daniel@iogearbox.net>
In case a buggy commit has already been identified, make sure to keep
the actual commit authors in Cc as well for the report. They can
typically be identified through the kernel's git tree.
Please do *not* report BPF issues to bugzilla.kernel.org since it
is a guarantee that the reported issue will be overlooked.
Submitting patches:
-------------------
Q: To which mailing list do I need to submit my BPF patches?
A: Please submit your BPF patches to the netdev kernel mailing list:
netdev@vger.kernel.org
Historically, BPF came out of networking and has always been maintained
by the kernel networking community. Although these days BPF touches
many other subsystems as well, the patches are still routed mainly
through the networking community.
In case your patch has changes in various different subsystems (e.g.
tracing, security, etc), make sure to Cc the related kernel mailing
lists and maintainers from there as well, so they are able to review
the changes and provide their Acked-by's to the patches.
Q: Where can I find patches currently under discussion for BPF subsystem?
A: All patches that are Cc'ed to netdev are queued for review under netdev
patchwork project:
http://patchwork.ozlabs.org/project/netdev/list/
Those patches which target BPF, are assigned to a 'bpf' delegate for
further processing from BPF maintainers. The current queue with
patches under review can be found at:
https://patchwork.ozlabs.org/project/netdev/list/?delegate=77147
Once the patches have been reviewed by the BPF community as a whole
and approved by the BPF maintainers, their status in patchwork will be
changed to 'Accepted' and the submitter will be notified by mail. This
means that the patches look good from a BPF perspective and have been
applied to one of the two BPF kernel trees.
In case feedback from the community requires a respin of the patches,
their status in patchwork will be set to 'Changes Requested', and purged
from the current review queue. Likewise for cases where patches would
get rejected or are not applicable to the BPF trees (but assigned to
the 'bpf' delegate).
Q: How do the changes make their way into Linux?
A: There are two BPF kernel trees (git repositories). Once patches have
been accepted by the BPF maintainers, they will be applied to one
of the two BPF trees:
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git/
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/
The bpf tree itself is for fixes only, whereas bpf-next for features,
cleanups or other kind of improvements ("next-like" content). This is
analogous to net and net-next trees for networking. Both bpf and
bpf-next will only have a master branch in order to simplify against
which branch patches should get rebased to.
Accumulated BPF patches in the bpf tree will regularly get pulled
into the net kernel tree. Likewise, accumulated BPF patches accepted
into the bpf-next tree will make their way into net-next tree. net and
net-next are both run by David S. Miller. From there, they will go
into the kernel mainline tree run by Linus Torvalds. To read up on the
process of net and net-next being merged into the mainline tree, see
the netdev FAQ under:
Documentation/networking/netdev-FAQ.txt
Occasionally, to prevent merge conflicts, we might send pull requests
to other trees (e.g. tracing) with a small subset of the patches, but
net and net-next are always the main trees targeted for integration.
The pull requests will contain a high-level summary of the accumulated
patches and can be searched on netdev kernel mailing list through the
following subject lines (yyyy-mm-dd is the date of the pull request):
pull-request: bpf yyyy-mm-dd
pull-request: bpf-next yyyy-mm-dd
Q: How do I indicate which tree (bpf vs. bpf-next) my patch should be
applied to?
A: The process is the very same as described in the netdev FAQ, so
please read up on it. The subject line must indicate whether the
patch is a fix or rather "next-like" content in order to let the
maintainers know whether it is targeted at bpf or bpf-next.
For fixes eventually landing in bpf -> net tree, the subject must
look like:
git format-patch --subject-prefix='PATCH bpf' start..finish
For features/improvements/etc that should eventually land in
bpf-next -> net-next, the subject must look like:
git format-patch --subject-prefix='PATCH bpf-next' start..finish
If unsure whether the patch or patch series should go into bpf
or net directly, or bpf-next or net-next directly, it is not a
problem either if the subject line says net or net-next as target.
It is eventually up to the maintainers to do the delegation of
the patches.
If it is clear that patches should go into bpf or bpf-next tree,
please make sure to rebase the patches against those trees in
order to reduce potential conflicts.
In case the patch or patch series has to be reworked and sent out
again in a second or later revision, it is also required to add a
version number (v2, v3, ...) into the subject prefix:
git format-patch --subject-prefix='PATCH net-next v2' start..finish
When changes have been requested to the patch series, always send the
whole patch series again with the feedback incorporated (never send
individual diffs on top of the old series).
Q: What does it mean when a patch gets applied to bpf or bpf-next tree?
A: It means that the patch looks good for mainline inclusion from
a BPF point of view.
Be aware that this is not a final verdict that the patch will
automatically get accepted into net or net-next trees eventually:
On the netdev kernel mailing list reviews can come in at any point
in time. If discussions around a patch conclude that they cannot
get included as-is, we will either apply a follow-up fix or drop
them from the trees entirely. Therefore, we also reserve to rebase
the trees when deemed necessary. After all, the purpose of the tree
is to i) accumulate and stage BPF patches for integration into trees
like net and net-next, and ii) run extensive BPF test suite and
workloads on the patches before they make their way any further.
Once the BPF pull request was accepted by David S. Miller, then
the patches end up in net or net-next tree, respectively, and
make their way from there further into mainline. Again, see the
netdev FAQ for additional information e.g. on how often they are
merged to mainline.
Q: How long do I need to wait for feedback on my BPF patches?
A: We try to keep the latency low. The usual time to feedback will
be around 2 or 3 business days. It may vary depending on the
complexity of changes and current patch load.
Q: How often do you send pull requests to major kernel trees like
net or net-next?
A: Pull requests will be sent out rather often in order to not
accumulate too many patches in bpf or bpf-next.
As a rule of thumb, expect pull requests for each tree regularly
at the end of the week. In some cases pull requests could additionally
come also in the middle of the week depending on the current patch
load or urgency.
Q: Are patches applied to bpf-next when the merge window is open?
A: For the time when the merge window is open, bpf-next will not be
processed. This is roughly analogous to net-next patch processing,
so feel free to read up on the netdev FAQ about further details.
During those two weeks of merge window, we might ask you to resend
your patch series once bpf-next is open again. Once Linus released
a v*-rc1 after the merge window, we continue processing of bpf-next.
For non-subscribers to kernel mailing lists, there is also a status
page run by David S. Miller on net-next that provides guidance:
http://vger.kernel.org/~davem/net-next.html
Q: I made a BPF verifier change, do I need to add test cases for
BPF kernel selftests?
A: If the patch has changes to the behavior of the verifier, then yes,
it is absolutely necessary to add test cases to the BPF kernel
selftests suite. If they are not present and we think they are
needed, then we might ask for them before accepting any changes.
In particular, test_verifier.c is tracking a high number of BPF test
cases, including a lot of corner cases that LLVM BPF back end may
generate out of the restricted C code. Thus, adding test cases is
absolutely crucial to make sure future changes do not accidentally
affect prior use-cases. Thus, treat those test cases as: verifier
behavior that is not tracked in test_verifier.c could potentially
be subject to change.
Q: When should I add code to samples/bpf/ and when to BPF kernel
selftests?
A: In general, we prefer additions to BPF kernel selftests rather than
samples/bpf/. The rationale is very simple: kernel selftests are
regularly run by various bots to test for kernel regressions.
The more test cases we add to BPF selftests, the better the coverage
and the less likely it is that those could accidentally break. It is
not that BPF kernel selftests cannot demo how a specific feature can
be used.
That said, samples/bpf/ may be a good place for people to get started,
so it might be advisable that simple demos of features could go into
samples/bpf/, but advanced functional and corner-case testing rather
into kernel selftests.
If your sample looks like a test case, then go for BPF kernel selftests
instead!
Q: When should I add code to the bpftool?
A: The main purpose of bpftool (under tools/bpf/bpftool/) is to provide
a central user space tool for debugging and introspection of BPF programs
and maps that are active in the kernel. If UAPI changes related to BPF
enable for dumping additional information of programs or maps, then
bpftool should be extended as well to support dumping them.
Q: When should I add code to iproute2's BPF loader?
A: For UAPI changes related to the XDP or tc layer (e.g. cls_bpf), the
convention is that those control-path related changes are added to
iproute2's BPF loader as well from user space side. This is not only
useful to have UAPI changes properly designed to be usable, but also
to make those changes available to a wider user base of major
downstream distributions.
Q: Do you accept patches as well for iproute2's BPF loader?
A: Patches for the iproute2's BPF loader have to be sent to:
netdev@vger.kernel.org
While those patches are not processed by the BPF kernel maintainers,
please keep them in Cc as well, so they can be reviewed.
The official git repository for iproute2 is run by Stephen Hemminger
and can be found at:
https://git.kernel.org/pub/scm/linux/kernel/git/shemminger/iproute2.git/
The patches need to have a subject prefix of '[PATCH iproute2 master]'
or '[PATCH iproute2 net-next]'. 'master' or 'net-next' describes the
target branch where the patch should be applied to. Meaning, if kernel
changes went into the net-next kernel tree, then the related iproute2
changes need to go into the iproute2 net-next branch, otherwise they
can be targeted at master branch. The iproute2 net-next branch will get
merged into the master branch after the current iproute2 version from
master has been released.
Like BPF, the patches end up in patchwork under the netdev project and
are delegated to 'shemminger' for further processing:
http://patchwork.ozlabs.org/project/netdev/list/?delegate=389
Q: What is the minimum requirement before I submit my BPF patches?
A: When submitting patches, always take the time and properly test your
patches *prior* to submission. Never rush them! If maintainers find
that your patches have not been properly tested, it is a good way to
get them grumpy. Testing patch submissions is a hard requirement!
Note, fixes that go to bpf tree *must* have a Fixes: tag included. The
same applies to fixes that target bpf-next, where the affected commit
is in net-next (or in some cases bpf-next). The Fixes: tag is crucial
in order to identify follow-up commits and tremendously helps for people
having to do backporting, so it is a must have!
We also don't accept patches with an empty commit message. Take your
time and properly write up a high quality commit message, it is
essential!
Think about it this way: other developers looking at your code a month
from now need to understand *why* a certain change has been done that
way, and whether there have been flaws in the analysis or assumptions
that the original author did. Thus providing a proper rationale and
describing the use-case for the changes is a must.
Patch submissions with >1 patch must have a cover letter which includes
a high level description of the series. This high level summary will
then be placed into the merge commit by the BPF maintainers such that
it is also accessible from the git log for future reference.
Q: What do I need to consider when adding a new instruction or feature
that would require BPF JIT and/or LLVM integration as well?
A: We try hard to keep all BPF JITs up to date such that the same user
experience can be guaranteed when running BPF programs on different
architectures without having the program punt to the less efficient
interpreter in case the in-kernel BPF JIT is enabled.
If you are unable to implement or test the required JIT changes for
certain architectures, please work together with the related BPF JIT
developers in order to get the feature implemented in a timely manner.
Please refer to the git log (arch/*/net/) to locate the necessary
people for helping out.
Also always make sure to add BPF test cases (e.g. test_bpf.c and
test_verifier.c) for new instructions, so that they can receive
broad test coverage and help run-time testing the various BPF JITs.
In case of new BPF instructions, once the changes have been accepted
into the Linux kernel, please implement support into LLVM's BPF back
end. See LLVM section below for further information.
Stable submission:
------------------
Q: I need a specific BPF commit in stable kernels. What should I do?
A: In case you need a specific fix in stable kernels, first check whether
the commit has already been applied in the related linux-*.y branches:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git/
If not the case, then drop an email to the BPF maintainers with the
netdev kernel mailing list in Cc and ask for the fix to be queued up:
netdev@vger.kernel.org
The process in general is the same as on netdev itself, see also the
netdev FAQ document.
Q: Do you also backport to kernels not currently maintained as stable?
A: No. If you need a specific BPF commit in kernels that are currently not
maintained by the stable maintainers, then you are on your own.
The current stable and longterm stable kernels are all listed here:
https://www.kernel.org/
Q: The BPF patch I am about to submit needs to go to stable as well. What
should I do?
A: The same rules apply as with netdev patch submissions in general, see
netdev FAQ under:
Documentation/networking/netdev-FAQ.txt
Never add "Cc: stable@vger.kernel.org" to the patch description, but
ask the BPF maintainers to queue the patches instead. This can be done
with a note, for example, under the "---" part of the patch which does
not go into the git log. Alternatively, this can be done as a simple
request by mail instead.
Q: Where do I find currently queued BPF patches that will be submitted
to stable?
A: Once patches that fix critical bugs got applied into the bpf tree, they
are queued up for stable submission under:
http://patchwork.ozlabs.org/bundle/bpf/stable/?state=*
They will be on hold there at minimum until the related commit made its
way into the mainline kernel tree.
After having been under broader exposure, the queued patches will be
submitted by the BPF maintainers to the stable maintainers.
Testing patches:
----------------
Q: Which BPF kernel selftests version should I run my kernel against?
A: If you run a kernel xyz, then always run the BPF kernel selftests from
that kernel xyz as well. Do not expect that the BPF selftest from the
latest mainline tree will pass all the time.
In particular, test_bpf.c and test_verifier.c have a large number of
test cases and are constantly updated with new BPF test sequences, or
existing ones are adapted to verifier changes e.g. due to verifier
becoming smarter and being able to better track certain things.
LLVM:
-----
Q: Where do I find LLVM with BPF support?
A: The BPF back end for LLVM is upstream in LLVM since version 3.7.1.
All major distributions these days ship LLVM with BPF back end enabled,
so for the majority of use-cases it is not required to compile LLVM by
hand anymore, just install the distribution provided package.
LLVM's static compiler lists the supported targets through 'llc --version',
make sure BPF targets are listed. Example:
$ llc --version
LLVM (http://llvm.org/):
LLVM version 6.0.0svn
Optimized build.
Default target: x86_64-unknown-linux-gnu
Host CPU: skylake
Registered Targets:
bpf - BPF (host endian)
bpfeb - BPF (big endian)
bpfel - BPF (little endian)
x86 - 32-bit X86: Pentium-Pro and above
x86-64 - 64-bit X86: EM64T and AMD64
For developers in order to utilize the latest features added to LLVM's
BPF back end, it is advisable to run the latest LLVM releases. Support
for new BPF kernel features such as additions to the BPF instruction
set are often developed together.
All LLVM releases can be found at: http://releases.llvm.org/
Q: Got it, so how do I build LLVM manually anyway?
A: You need cmake and gcc-c++ as build requisites for LLVM. Once you have
that set up, proceed with building the latest LLVM and clang version
from the git repositories:
$ git clone http://llvm.org/git/llvm.git
$ cd llvm/tools
$ git clone --depth 1 http://llvm.org/git/clang.git
$ cd ..; mkdir build; cd build
$ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" \
-DBUILD_SHARED_LIBS=OFF \
-DCMAKE_BUILD_TYPE=Release \
-DLLVM_BUILD_RUNTIME=OFF
$ make -j $(getconf _NPROCESSORS_ONLN)
The built binaries can then be found in the build/bin/ directory, where
you can point the PATH variable to.
Q: Should I notify BPF kernel maintainers about issues in LLVM's BPF code
generation back end or about LLVM generated code that the verifier
refuses to accept?
A: Yes, please do! LLVM's BPF back end is a key piece of the whole BPF
infrastructure and it ties deeply into verification of programs from the
kernel side. Therefore, any issues on either side need to be investigated
and fixed whenever necessary.
Therefore, please make sure to bring them up at netdev kernel mailing
list and Cc BPF maintainers for LLVM and kernel bits:
Yonghong Song <yhs@fb.com>
Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann <daniel@iogearbox.net>
LLVM also has an issue tracker where BPF related bugs can be found:
https://bugs.llvm.org/buglist.cgi?quicksearch=bpf
However, it is better to reach out through mailing lists with having
maintainers in Cc.
Q: I have added a new BPF instruction to the kernel, how can I integrate
it into LLVM?
A: LLVM has a -mcpu selector for the BPF back end in order to allow the
selection of BPF instruction set extensions. By default the 'generic'
processor target is used, which is the base instruction set (v1) of BPF.
LLVM has an option to select -mcpu=probe where it will probe the host
kernel for supported BPF instruction set extensions and selects the
optimal set automatically.
For cross-compilation, a specific version can be select manually as well.
$ llc -march bpf -mcpu=help
Available CPUs for this target:
generic - Select the generic processor.
probe - Select the probe processor.
v1 - Select the v1 processor.
v2 - Select the v2 processor.
[...]
Newly added BPF instructions to the Linux kernel need to follow the same
scheme, bump the instruction set version and implement probing for the
extensions such that -mcpu=probe users can benefit from the optimization
transparently when upgrading their kernels.
If you are unable to implement support for the newly added BPF instruction
please reach out to BPF developers for help.
By the way, the BPF kernel selftests run with -mcpu=probe for better
test coverage.
Q: In some cases clang flag "-target bpf" is used but in other cases the
default clang target, which matches the underlying architecture, is used.
What is the difference and when I should use which?
A: Although LLVM IR generation and optimization try to stay architecture
independent, "-target <arch>" still has some impact on generated code:
- BPF program may recursively include header file(s) with file scope
inline assembly codes. The default target can handle this well,
while bpf target may fail if bpf backend assembler does not
understand these assembly codes, which is true in most cases.
- When compiled without -g, additional elf sections, e.g.,
.eh_frame and .rela.eh_frame, may be present in the object file
with default target, but not with bpf target.
- The default target may turn a C switch statement into a switch table
lookup and jump operation. Since the switch table is placed
in the global readonly section, the bpf program will fail to load.
The bpf target does not support switch table optimization.
The clang option "-fno-jump-tables" can be used to disable
switch table generation.
- For clang -target bpf, it is guaranteed that pointer or long /
unsigned long types will always have a width of 64 bit, no matter
whether underlying clang binary or default target (or kernel) is
32 bit. However, when native clang target is used, then it will
compile these types based on the underlying architecture's conventions,
meaning in case of 32 bit architecture, pointer or long / unsigned
long types e.g. in BPF context structure will have width of 32 bit
while the BPF LLVM back end still operates in 64 bit. The native
target is mostly needed in tracing for the case of walking pt_regs
or other kernel structures where CPU's register width matters.
Otherwise, clang -target bpf is generally recommended.
You should use default target when:
- Your program includes a header file, e.g., ptrace.h, which eventually
pulls in some header files containing file scope host assembly codes.
- You can add "-fno-jump-tables" to work around the switch table issue.
Otherwise, you can use bpf target. Additionally, you _must_ use bpf target
when:
- Your program uses data structures with pointer or long / unsigned long
types that interface with BPF helpers or context data structures. Access
into these structures is verified by the BPF verifier and may result
in verification failures if the native architecture is not aligned with
the BPF architecture, e.g. 64-bit. An example of this is
BPF_PROG_TYPE_SK_MSG require '-target bpf'
Happy BPF hacking!

View file

@ -1142,6 +1142,7 @@ into a register from memory, the register's top 56 bits are known zero, while
the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we
then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0;
0x1ff), because of potential carries.
Besides arithmetic, the register state can also be updated by conditional
branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch
it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false'
@ -1150,14 +1151,16 @@ BPF_JSGE) would instead update the signed minimum/maximum values. Information
from the signed and unsigned bounds can be combined; for instance if a value is
first tested < 8 and then tested s> 4, the verifier will conclude that the value
is also > 4 and s< 8, since the bounds prevent crossing the sign boundary.
PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all
pointers sharing that same variable offset. This is important for packet range
checks: after adding some variable to a packet pointer, if you then copy it to
another register and (say) add a constant 4, both registers will share the same
'id' but one will have a fixed offset of +4. Then if it is bounds-checked and
found to be less than a PTR_TO_PACKET_END, the other register is now known to
have a safe range of at least 4 bytes. See 'Direct packet access', below, for
more on PTR_TO_PACKET ranges.
checks: after adding a variable to a packet pointer register A, if you then copy
it to another register B and then add a constant 4 to A, both registers will
share the same 'id' but the A will have a fixed offset of +4. Then if A is
bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is
now known to have a safe range of at least 4 bytes. See 'Direct packet access',
below, for more on PTR_TO_PACKET ranges.
The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of
the pointer returned from a map lookup. This means that when one copy is
checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs.

View file

@ -234,18 +234,11 @@ static void jit_fill_hole(void *area, unsigned int size)
#define SCRATCH_SIZE 80
/* total stack size used in JITed code */
#define _STACK_SIZE \
(ctx->prog->aux->stack_depth + \
+ SCRATCH_SIZE + \
+ 4 /* extra for skb_copy_bits buffer */)
#define STACK_SIZE ALIGN(_STACK_SIZE, STACK_ALIGNMENT)
#define _STACK_SIZE (ctx->prog->aux->stack_depth + SCRATCH_SIZE)
#define STACK_SIZE ALIGN(_STACK_SIZE, STACK_ALIGNMENT)
/* Get the offset of eBPF REGISTERs stored on scratch space. */
#define STACK_VAR(off) (STACK_SIZE-off-4)
/* Offset of skb_copy_bits buffer */
#define SKB_BUFFER STACK_VAR(SCRATCH_SIZE)
#define STACK_VAR(off) (STACK_SIZE - off)
#if __LINUX_ARM_ARCH__ < 7

View file

@ -21,7 +21,6 @@
#include <linux/bpf.h>
#include <linux/filter.h>
#include <linux/printk.h>
#include <linux/skbuff.h>
#include <linux/slab.h>
#include <asm/byteorder.h>
@ -80,37 +79,6 @@ static inline void emit(const u32 insn, struct jit_ctx *ctx)
ctx->idx++;
}
static inline void emit_a64_mov_i64(const int reg, const u64 val,
struct jit_ctx *ctx)
{
u64 tmp = val;
int shift = 0;
emit(A64_MOVZ(1, reg, tmp & 0xffff, shift), ctx);
tmp >>= 16;
shift += 16;
while (tmp) {
if (tmp & 0xffff)
emit(A64_MOVK(1, reg, tmp & 0xffff, shift), ctx);
tmp >>= 16;
shift += 16;
}
}
static inline void emit_addr_mov_i64(const int reg, const u64 val,
struct jit_ctx *ctx)
{
u64 tmp = val;
int shift = 0;
emit(A64_MOVZ(1, reg, tmp & 0xffff, shift), ctx);
for (;shift < 48;) {
tmp >>= 16;
shift += 16;
emit(A64_MOVK(1, reg, tmp & 0xffff, shift), ctx);
}
}
static inline void emit_a64_mov_i(const int is64, const int reg,
const s32 val, struct jit_ctx *ctx)
{
@ -122,7 +90,8 @@ static inline void emit_a64_mov_i(const int is64, const int reg,
emit(A64_MOVN(is64, reg, (u16)~lo, 0), ctx);
} else {
emit(A64_MOVN(is64, reg, (u16)~hi, 16), ctx);
emit(A64_MOVK(is64, reg, lo, 0), ctx);
if (lo != 0xffff)
emit(A64_MOVK(is64, reg, lo, 0), ctx);
}
} else {
emit(A64_MOVZ(is64, reg, lo, 0), ctx);
@ -131,6 +100,59 @@ static inline void emit_a64_mov_i(const int is64, const int reg,
}
}
static int i64_i16_blocks(const u64 val, bool inverse)
{
return (((val >> 0) & 0xffff) != (inverse ? 0xffff : 0x0000)) +
(((val >> 16) & 0xffff) != (inverse ? 0xffff : 0x0000)) +
(((val >> 32) & 0xffff) != (inverse ? 0xffff : 0x0000)) +
(((val >> 48) & 0xffff) != (inverse ? 0xffff : 0x0000));
}
static inline void emit_a64_mov_i64(const int reg, const u64 val,
struct jit_ctx *ctx)
{
u64 nrm_tmp = val, rev_tmp = ~val;
bool inverse;
int shift;
if (!(nrm_tmp >> 32))
return emit_a64_mov_i(0, reg, (u32)val, ctx);
inverse = i64_i16_blocks(nrm_tmp, true) < i64_i16_blocks(nrm_tmp, false);
shift = max(round_down((inverse ? (fls64(rev_tmp) - 1) :
(fls64(nrm_tmp) - 1)), 16), 0);
if (inverse)
emit(A64_MOVN(1, reg, (rev_tmp >> shift) & 0xffff, shift), ctx);
else
emit(A64_MOVZ(1, reg, (nrm_tmp >> shift) & 0xffff, shift), ctx);
shift -= 16;
while (shift >= 0) {
if (((nrm_tmp >> shift) & 0xffff) != (inverse ? 0xffff : 0x0000))
emit(A64_MOVK(1, reg, (nrm_tmp >> shift) & 0xffff, shift), ctx);
shift -= 16;
}
}
/*
* This is an unoptimized 64 immediate emission used for BPF to BPF call
* addresses. It will always do a full 64 bit decomposition as otherwise
* more complexity in the last extra pass is required since we previously
* reserved 4 instructions for the address.
*/
static inline void emit_addr_mov_i64(const int reg, const u64 val,
struct jit_ctx *ctx)
{
u64 tmp = val;
int shift = 0;
emit(A64_MOVZ(1, reg, tmp & 0xffff, shift), ctx);
for (;shift < 48;) {
tmp >>= 16;
shift += 16;
emit(A64_MOVK(1, reg, tmp & 0xffff, shift), ctx);
}
}
static inline int bpf2a64_offset(int bpf_to, int bpf_from,
const struct jit_ctx *ctx)
{
@ -163,7 +185,7 @@ static inline int epilogue_offset(const struct jit_ctx *ctx)
/* Tail call offset to jump into */
#define PROLOGUE_OFFSET 7
static int build_prologue(struct jit_ctx *ctx)
static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
{
const struct bpf_prog *prog = ctx->prog;
const u8 r6 = bpf2a64[BPF_REG_6];
@ -188,7 +210,7 @@ static int build_prologue(struct jit_ctx *ctx)
* | ... | BPF prog stack
* | |
* +-----+ <= (BPF_FP - prog->aux->stack_depth)
* |RSVD | JIT scratchpad
* |RSVD | padding
* current A64_SP => +-----+ <= (BPF_FP - ctx->stack_size)
* | |
* | ... | Function call stack
@ -210,19 +232,19 @@ static int build_prologue(struct jit_ctx *ctx)
/* Set up BPF prog stack base register */
emit(A64_MOV(1, fp, A64_SP), ctx);
/* Initialize tail_call_cnt */
emit(A64_MOVZ(1, tcc, 0, 0), ctx);
if (!ebpf_from_cbpf) {
/* Initialize tail_call_cnt */
emit(A64_MOVZ(1, tcc, 0, 0), ctx);
cur_offset = ctx->idx - idx0;
if (cur_offset != PROLOGUE_OFFSET) {
pr_err_once("PROLOGUE_OFFSET = %d, expected %d!\n",
cur_offset, PROLOGUE_OFFSET);
return -1;
cur_offset = ctx->idx - idx0;
if (cur_offset != PROLOGUE_OFFSET) {
pr_err_once("PROLOGUE_OFFSET = %d, expected %d!\n",
cur_offset, PROLOGUE_OFFSET);
return -1;
}
}
/* 4 byte extra for skb_copy_bits buffer */
ctx->stack_size = prog->aux->stack_depth + 4;
ctx->stack_size = STACK_ALIGN(ctx->stack_size);
ctx->stack_size = STACK_ALIGN(prog->aux->stack_depth);
/* Set up function call stack */
emit(A64_SUB_I(1, A64_SP, A64_SP, ctx->stack_size), ctx);
@ -786,6 +808,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
struct bpf_prog *tmp, *orig_prog = prog;
struct bpf_binary_header *header;
struct arm64_jit_data *jit_data;
bool was_classic = bpf_prog_was_classic(prog);
bool tmp_blinded = false;
bool extra_pass = false;
struct jit_ctx ctx;
@ -840,7 +863,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
goto out_off;
}
if (build_prologue(&ctx)) {
if (build_prologue(&ctx, was_classic)) {
prog = orig_prog;
goto out_off;
}
@ -863,7 +886,7 @@ struct bpf_prog *bpf_int_jit_compile(struct bpf_prog *prog)
skip_init_ctx:
ctx.idx = 0;
build_prologue(&ctx);
build_prologue(&ctx, was_classic);
if (build_body(&ctx)) {
bpf_jit_binary_free(header);

View file

@ -95,7 +95,6 @@ enum reg_val_type {
* struct jit_ctx - JIT context
* @skf: The sk_filter
* @stack_size: eBPF stack size
* @tmp_offset: eBPF $sp offset to 8-byte temporary memory
* @idx: Instruction index
* @flags: JIT flags
* @offsets: Instruction offsets
@ -105,7 +104,6 @@ enum reg_val_type {
struct jit_ctx {
const struct bpf_prog *skf;
int stack_size;
int tmp_offset;
u32 idx;
u32 flags;
u32 *offsets;
@ -293,7 +291,6 @@ static int gen_int_prologue(struct jit_ctx *ctx)
locals_size = (ctx->flags & EBPF_SEEN_FP) ? MAX_BPF_STACK : 0;
stack_adjust += locals_size;
ctx->tmp_offset = locals_size;
ctx->stack_size = stack_adjust;
@ -399,7 +396,6 @@ static void gen_imm_to_reg(const struct bpf_insn *insn, int reg,
emit_instr(ctx, lui, reg, upper >> 16);
emit_instr(ctx, addiu, reg, reg, lower);
}
}
static int gen_imm_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
@ -547,28 +543,6 @@ static int gen_imm_insn(const struct bpf_insn *insn, struct jit_ctx *ctx,
return 0;
}
static void * __must_check
ool_skb_header_pointer(const struct sk_buff *skb, int offset,
int len, void *buffer)
{
return skb_header_pointer(skb, offset, len, buffer);
}
static int size_to_len(const struct bpf_insn *insn)
{
switch (BPF_SIZE(insn->code)) {
case BPF_B:
return 1;
case BPF_H:
return 2;
case BPF_W:
return 4;
case BPF_DW:
return 8;
}
return 0;
}
static void emit_const_to_reg(struct jit_ctx *ctx, int dst, u64 value)
{
if (value >= 0xffffffffffff8000ull || value < 0x8000ull) {

View file

@ -894,7 +894,6 @@ static int build_insn(const struct bpf_insn *insn, struct jit_ctx *ctx)
const int i = insn - ctx->prog->insnsi;
const s16 off = insn->off;
const s32 imm = insn->imm;
u32 *func;
if (insn->src_reg == BPF_REG_FP)
ctx->saw_frame_pointer = true;

View file

@ -301,9 +301,9 @@ do { \
* jmp *%edx for x86_32
*/
#ifdef CONFIG_RETPOLINE
#ifdef CONFIG_X86_64
# define RETPOLINE_RAX_BPF_JIT_SIZE 17
# define RETPOLINE_RAX_BPF_JIT() \
# ifdef CONFIG_X86_64
# define RETPOLINE_RAX_BPF_JIT_SIZE 17
# define RETPOLINE_RAX_BPF_JIT() \
do { \
EMIT1_off32(0xE8, 7); /* callq do_rop */ \
/* spec_trap: */ \
@ -314,8 +314,8 @@ do { \
EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */ \
EMIT1(0xC3); /* retq */ \
} while (0)
#else
# define RETPOLINE_EDX_BPF_JIT() \
# else /* !CONFIG_X86_64 */
# define RETPOLINE_EDX_BPF_JIT() \
do { \
EMIT1_off32(0xE8, 7); /* call do_rop */ \
/* spec_trap: */ \
@ -326,17 +326,16 @@ do { \
EMIT3(0x89, 0x14, 0x24); /* mov %edx,(%esp) */ \
EMIT1(0xC3); /* ret */ \
} while (0)
#endif
# endif
#else /* !CONFIG_RETPOLINE */
#ifdef CONFIG_X86_64
# define RETPOLINE_RAX_BPF_JIT_SIZE 2
# define RETPOLINE_RAX_BPF_JIT() \
EMIT2(0xFF, 0xE0); /* jmp *%rax */
#else
# define RETPOLINE_EDX_BPF_JIT() \
EMIT2(0xFF, 0xE2) /* jmp *%edx */
#endif
# ifdef CONFIG_X86_64
# define RETPOLINE_RAX_BPF_JIT_SIZE 2
# define RETPOLINE_RAX_BPF_JIT() \
EMIT2(0xFF, 0xE0); /* jmp *%rax */
# else /* !CONFIG_X86_64 */
# define RETPOLINE_EDX_BPF_JIT() \
EMIT2(0xFF, 0xE2) /* jmp *%edx */
# endif
#endif
#endif /* _ASM_X86_NOSPEC_BRANCH_H_ */

View file

@ -50,6 +50,7 @@ enum bpf_cap_tlv_type {
NFP_BPF_CAP_TYPE_ADJUST_HEAD = 2,
NFP_BPF_CAP_TYPE_MAPS = 3,
NFP_BPF_CAP_TYPE_RANDOM = 4,
NFP_BPF_CAP_TYPE_QUEUE_SELECT = 5,
};
struct nfp_bpf_cap_tlv_func {

View file

@ -42,6 +42,7 @@
#include "main.h"
#include "../nfp_asm.h"
#include "../nfp_net_ctrl.h"
/* --- NFP prog --- */
/* Foreach "multiple" entries macros provide pos and next<n> pointers.
@ -1470,6 +1471,38 @@ nfp_perf_event_output(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
return 0;
}
static int
nfp_queue_select(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
{
u32 jmp_tgt;
jmp_tgt = nfp_prog_current_offset(nfp_prog) + 5;
/* Make sure the queue id fits into FW field */
emit_alu(nfp_prog, reg_none(), reg_a(meta->insn.src_reg * 2),
ALU_OP_AND_NOT_B, reg_imm(0xff));
emit_br(nfp_prog, BR_BEQ, jmp_tgt, 2);
/* Set the 'queue selected' bit and the queue value */
emit_shf(nfp_prog, pv_qsel_set(nfp_prog),
pv_qsel_set(nfp_prog), SHF_OP_OR, reg_imm(1),
SHF_SC_L_SHF, PKT_VEL_QSEL_SET_BIT);
emit_ld_field(nfp_prog,
pv_qsel_val(nfp_prog), 0x1, reg_b(meta->insn.src_reg * 2),
SHF_SC_NONE, 0);
/* Delay slots end here, we will jump over next instruction if queue
* value fits into the field.
*/
emit_ld_field(nfp_prog,
pv_qsel_val(nfp_prog), 0x1, reg_imm(NFP_NET_RXR_MAX),
SHF_SC_NONE, 0);
if (!nfp_prog_confirm_current_offset(nfp_prog, jmp_tgt))
return -EINVAL;
return 0;
}
/* --- Callbacks --- */
static int mov_reg64(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
{
@ -2160,6 +2193,17 @@ mem_stx_stack(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
false, wrp_lmem_store);
}
static int mem_stx_xdp(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
{
switch (meta->insn.off) {
case offsetof(struct xdp_md, rx_queue_index):
return nfp_queue_select(nfp_prog, meta);
}
WARN_ON_ONCE(1); /* verifier should have rejected bad accesses */
return -EOPNOTSUPP;
}
static int
mem_stx(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
unsigned int size)
@ -2186,6 +2230,9 @@ static int mem_stx2(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
static int mem_stx4(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
{
if (meta->ptr.type == PTR_TO_CTX)
if (nfp_prog->type == BPF_PROG_TYPE_XDP)
return mem_stx_xdp(nfp_prog, meta);
return mem_stx(nfp_prog, meta, 4);
}

View file

@ -334,6 +334,13 @@ nfp_bpf_parse_cap_random(struct nfp_app_bpf *bpf, void __iomem *value,
return 0;
}
static int
nfp_bpf_parse_cap_qsel(struct nfp_app_bpf *bpf, void __iomem *value, u32 length)
{
bpf->queue_select = true;
return 0;
}
static int nfp_bpf_parse_capabilities(struct nfp_app *app)
{
struct nfp_cpp *cpp = app->pf->cpp;
@ -376,6 +383,10 @@ static int nfp_bpf_parse_capabilities(struct nfp_app *app)
if (nfp_bpf_parse_cap_random(app->priv, value, length))
goto err_release_free;
break;
case NFP_BPF_CAP_TYPE_QUEUE_SELECT:
if (nfp_bpf_parse_cap_qsel(app->priv, value, length))
goto err_release_free;
break;
default:
nfp_dbg(cpp, "unknown BPF capability: %d\n", type);
break;

View file

@ -82,10 +82,16 @@ enum static_regs {
enum pkt_vec {
PKT_VEC_PKT_LEN = 0,
PKT_VEC_PKT_PTR = 2,
PKT_VEC_QSEL_SET = 4,
PKT_VEC_QSEL_VAL = 6,
};
#define PKT_VEL_QSEL_SET_BIT 4
#define pv_len(np) reg_lm(1, PKT_VEC_PKT_LEN)
#define pv_ctm_ptr(np) reg_lm(1, PKT_VEC_PKT_PTR)
#define pv_qsel_set(np) reg_lm(1, PKT_VEC_QSEL_SET)
#define pv_qsel_val(np) reg_lm(1, PKT_VEC_QSEL_VAL)
#define stack_reg(np) reg_a(STATIC_REG_STACK)
#define stack_imm(np) imm_b(np)
@ -139,6 +145,7 @@ enum pkt_vec {
* @helpers.perf_event_output: output perf event to a ring buffer
*
* @pseudo_random: FW initialized the pseudo-random machinery (CSRs)
* @queue_select: BPF can set the RX queue ID in packet vector
*/
struct nfp_app_bpf {
struct nfp_app *app;
@ -181,6 +188,7 @@ struct nfp_app_bpf {
} helpers;
bool pseudo_random;
bool queue_select;
};
enum nfp_bpf_map_use {

View file

@ -467,6 +467,30 @@ nfp_bpf_check_ptr(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
return 0;
}
static int
nfp_bpf_check_store(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
struct bpf_verifier_env *env)
{
const struct bpf_reg_state *reg = cur_regs(env) + meta->insn.dst_reg;
if (reg->type == PTR_TO_CTX) {
if (nfp_prog->type == BPF_PROG_TYPE_XDP) {
/* XDP ctx accesses must be 4B in size */
switch (meta->insn.off) {
case offsetof(struct xdp_md, rx_queue_index):
if (nfp_prog->bpf->queue_select)
goto exit_check_ptr;
pr_vlog(env, "queue selection not supported by FW\n");
return -EOPNOTSUPP;
}
}
pr_vlog(env, "unsupported store to context field\n");
return -EOPNOTSUPP;
}
exit_check_ptr:
return nfp_bpf_check_ptr(nfp_prog, meta, env, meta->insn.dst_reg);
}
static int
nfp_bpf_check_xadd(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
struct bpf_verifier_env *env)
@ -522,8 +546,8 @@ nfp_verify_insn(struct bpf_verifier_env *env, int insn_idx, int prev_insn_idx)
return nfp_bpf_check_ptr(nfp_prog, meta, env,
meta->insn.src_reg);
if (is_mbpf_store(meta))
return nfp_bpf_check_ptr(nfp_prog, meta, env,
meta->insn.dst_reg);
return nfp_bpf_check_store(nfp_prog, meta, env);
if (is_mbpf_xadd(meta))
return nfp_bpf_check_xadd(nfp_prog, meta, env);

View file

@ -183,16 +183,18 @@ enum shf_sc {
#define OP_ALU_DST_LMEXTN 0x80000000000ULL
enum alu_op {
ALU_OP_NONE = 0x00,
ALU_OP_ADD = 0x01,
ALU_OP_NOT = 0x04,
ALU_OP_ADD_2B = 0x05,
ALU_OP_AND = 0x08,
ALU_OP_SUB_C = 0x0d,
ALU_OP_ADD_C = 0x11,
ALU_OP_OR = 0x14,
ALU_OP_SUB = 0x15,
ALU_OP_XOR = 0x18,
ALU_OP_NONE = 0x00,
ALU_OP_ADD = 0x01,
ALU_OP_NOT = 0x04,
ALU_OP_ADD_2B = 0x05,
ALU_OP_AND = 0x08,
ALU_OP_AND_NOT_A = 0x0c,
ALU_OP_SUB_C = 0x0d,
ALU_OP_AND_NOT_B = 0x10,
ALU_OP_ADD_C = 0x11,
ALU_OP_OR = 0x14,
ALU_OP_SUB = 0x15,
ALU_OP_XOR = 0x18,
};
enum alu_dst_ab {

View file

@ -627,7 +627,7 @@ bool bpf_offload_dev_match(struct bpf_prog *prog, struct bpf_map *map);
#if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
static inline bool bpf_prog_is_dev_bound(struct bpf_prog_aux *aux)
static inline bool bpf_prog_is_dev_bound(const struct bpf_prog_aux *aux)
{
return aux->offload_requested;
}
@ -668,6 +668,7 @@ static inline void bpf_map_offload_map_free(struct bpf_map *map)
#if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_INET)
struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key);
struct sock *__sock_hash_lookup_elem(struct bpf_map *map, void *key);
int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type);
#else
static inline struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
@ -675,6 +676,12 @@ static inline struct sock *__sock_map_lookup_elem(struct bpf_map *map, u32 key)
return NULL;
}
static inline struct sock *__sock_hash_lookup_elem(struct bpf_map *map,
void *key)
{
return NULL;
}
static inline int sock_map_prog(struct bpf_map *map,
struct bpf_prog *prog,
u32 type)
@ -724,6 +731,7 @@ extern const struct bpf_func_proto bpf_get_current_comm_proto;
extern const struct bpf_func_proto bpf_get_stackid_proto;
extern const struct bpf_func_proto bpf_get_stack_proto;
extern const struct bpf_func_proto bpf_sock_map_update_proto;
extern const struct bpf_func_proto bpf_sock_hash_update_proto;
/* Shared helpers among cBPF and eBPF. */
void bpf_user_rnd_init_once(void);

View file

@ -47,6 +47,7 @@ BPF_MAP_TYPE(BPF_MAP_TYPE_HASH_OF_MAPS, htab_of_maps_map_ops)
BPF_MAP_TYPE(BPF_MAP_TYPE_DEVMAP, dev_map_ops)
#if defined(CONFIG_STREAM_PARSER) && defined(CONFIG_INET)
BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKMAP, sock_map_ops)
BPF_MAP_TYPE(BPF_MAP_TYPE_SOCKHASH, sock_hash_ops)
#endif
BPF_MAP_TYPE(BPF_MAP_TYPE_CPUMAP, cpu_map_ops)
#if defined(CONFIG_XDP_SOCKETS)

View file

@ -200,8 +200,8 @@ struct bpf_verifier_env {
u32 subprog_cnt;
};
void bpf_verifier_vlog(struct bpf_verifier_log *log, const char *fmt,
va_list args);
__printf(2, 0) void bpf_verifier_vlog(struct bpf_verifier_log *log,
const char *fmt, va_list args);
__printf(2, 3) void bpf_verifier_log_write(struct bpf_verifier_env *env,
const char *fmt, ...);

View file

@ -44,5 +44,7 @@ const struct btf_type *btf_type_id_size(const struct btf *btf,
u32 *ret_size);
void btf_type_seq_show(const struct btf *btf, u32 type_id, void *obj,
struct seq_file *m);
int btf_get_fd_by_id(u32 id);
u32 btf_id(const struct btf *btf);
#endif

View file

@ -515,9 +515,8 @@ struct sk_msg_buff {
int sg_end;
struct scatterlist sg_data[MAX_SKB_FRAGS];
bool sg_copy[MAX_SKB_FRAGS];
__u32 key;
__u32 flags;
struct bpf_map *map;
struct sock *sk_redir;
struct sk_buff *skb;
struct list_head list;
};

View file

@ -223,6 +223,20 @@ struct ipv6_stub {
const struct in6_addr *addr);
int (*ipv6_dst_lookup)(struct net *net, struct sock *sk,
struct dst_entry **dst, struct flowi6 *fl6);
struct fib6_table *(*fib6_get_table)(struct net *net, u32 id);
struct fib6_info *(*fib6_lookup)(struct net *net, int oif,
struct flowi6 *fl6, int flags);
struct fib6_info *(*fib6_table_lookup)(struct net *net,
struct fib6_table *table,
int oif, struct flowi6 *fl6,
int flags);
struct fib6_info *(*fib6_multipath_select)(const struct net *net,
struct fib6_info *f6i,
struct flowi6 *fl6, int oif,
const struct sk_buff *skb,
int strict);
void (*udpv6_encap_enable)(void);
void (*ndisc_send_na)(struct net_device *dev, const struct in6_addr *daddr,
const struct in6_addr *solicited_addr,

View file

@ -376,9 +376,24 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6,
const struct sk_buff *skb,
int flags, pol_lookup_t lookup);
struct fib6_node *fib6_lookup(struct fib6_node *root,
const struct in6_addr *daddr,
const struct in6_addr *saddr);
/* called with rcu lock held; can return error pointer
* caller needs to select path
*/
struct fib6_info *fib6_lookup(struct net *net, int oif, struct flowi6 *fl6,
int flags);
/* called with rcu lock held; caller needs to select path */
struct fib6_info *fib6_table_lookup(struct net *net, struct fib6_table *table,
int oif, struct flowi6 *fl6, int strict);
struct fib6_info *fib6_multipath_select(const struct net *net,
struct fib6_info *match,
struct flowi6 *fl6, int oif,
const struct sk_buff *skb, int strict);
struct fib6_node *fib6_node_lookup(struct fib6_node *root,
const struct in6_addr *daddr,
const struct in6_addr *saddr);
struct fib6_node *fib6_locate(struct fib6_node *root,
const struct in6_addr *daddr, int dst_len,

View file

@ -816,9 +816,8 @@ struct tcp_skb_cb {
#endif
} header; /* For incoming skbs */
struct {
__u32 key;
__u32 flags;
struct bpf_map *map;
struct sock *sk_redir;
void *data_end;
} bpf;
};

View file

@ -12,10 +12,10 @@
TRACE_EVENT(fib6_table_lookup,
TP_PROTO(const struct net *net, const struct rt6_info *rt,
TP_PROTO(const struct net *net, const struct fib6_info *f6i,
struct fib6_table *table, const struct flowi6 *flp),
TP_ARGS(net, rt, table, flp),
TP_ARGS(net, f6i, table, flp),
TP_STRUCT__entry(
__field( u32, tb_id )
@ -48,20 +48,20 @@ TRACE_EVENT(fib6_table_lookup,
in6 = (struct in6_addr *)__entry->dst;
*in6 = flp->daddr;
if (rt->rt6i_idev) {
__assign_str(name, rt->rt6i_idev->dev->name);
if (f6i->fib6_nh.nh_dev) {
__assign_str(name, f6i->fib6_nh.nh_dev);
} else {
__assign_str(name, "");
}
if (rt == net->ipv6.ip6_null_entry) {
if (f6i == net->ipv6.fib6_null_entry) {
struct in6_addr in6_zero = {};
in6 = (struct in6_addr *)__entry->gw;
*in6 = in6_zero;
} else if (rt) {
} else if (f6i) {
in6 = (struct in6_addr *)__entry->gw;
*in6 = rt->rt6i_gateway;
*in6 = f6i->fib6_nh.nh_gw;
}
),

View file

@ -96,6 +96,7 @@ enum bpf_cmd {
BPF_PROG_QUERY,
BPF_RAW_TRACEPOINT_OPEN,
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
};
enum bpf_map_type {
@ -117,6 +118,7 @@ enum bpf_map_type {
BPF_MAP_TYPE_SOCKMAP,
BPF_MAP_TYPE_CPUMAP,
BPF_MAP_TYPE_XSKMAP,
BPF_MAP_TYPE_SOCKHASH,
};
enum bpf_prog_type {
@ -344,6 +346,7 @@ union bpf_attr {
__u32 start_id;
__u32 prog_id;
__u32 map_id;
__u32 btf_id;
};
__u32 next_id;
__u32 open_flags;
@ -1826,6 +1829,79 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure.
*
* int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags)
* Description
* Do FIB lookup in kernel tables using parameters in *params*.
* If lookup is successful and result shows packet is to be
* forwarded, the neighbor tables are searched for the nexthop.
* If successful (ie., FIB lookup shows forwarding and nexthop
* is resolved), the nexthop address is returned in ipv4_dst,
* ipv6_dst or mpls_out based on family, smac is set to mac
* address of egress device, dmac is set to nexthop mac address,
* rt_metric is set to metric from route.
*
* *plen* argument is the size of the passed in struct.
* *flags* argument can be one or more BPF_FIB_LOOKUP_ flags:
*
* **BPF_FIB_LOOKUP_DIRECT** means do a direct table lookup vs
* full lookup using FIB rules
* **BPF_FIB_LOOKUP_OUTPUT** means do lookup from an egress
* perspective (default is ingress)
*
* *ctx* is either **struct xdp_md** for XDP programs or
* **struct sk_buff** tc cls_act programs.
*
* Return
* Egress device index on success, 0 if packet needs to continue
* up the stack for further processing or a negative error in case
* of failure.
*
* int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags)
* Description
* Add an entry to, or update a sockhash *map* referencing sockets.
* The *skops* is used as a new value for the entry associated to
* *key*. *flags* is one of:
*
* **BPF_NOEXIST**
* The entry for *key* must not exist in the map.
* **BPF_EXIST**
* The entry for *key* must already exist in the map.
* **BPF_ANY**
* No condition on the existence of the entry for *key*.
*
* If the *map* has eBPF programs (parser and verdict), those will
* be inherited by the socket being added. If the socket is
* already attached to eBPF programs, this results in an error.
* Return
* 0 on success, or a negative error in case of failure.
*
* int bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags)
* Description
* This helper is used in programs implementing policies at the
* socket level. If the message *msg* is allowed to pass (i.e. if
* the verdict eBPF program returns **SK_PASS**), redirect it to
* the socket referenced by *map* (of type
* **BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
* egress interfaces can be used for redirection. The
* **BPF_F_INGRESS** value in *flags* is used to make the
* distinction (ingress path is selected if the flag is present,
* egress path otherwise). This is the only flag supported for now.
* Return
* **SK_PASS** on success, or **SK_DROP** on error.
*
* int bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags)
* Description
* This helper is used in programs implementing policies at the
* skb socket level. If the sk_buff *skb* is allowed to pass (i.e.
* if the verdeict eBPF program returns **SK_PASS**), redirect it
* to the socket referenced by *map* (of type
* **BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
* egress interfaces can be used for redirection. The
* **BPF_F_INGRESS** value in *flags* is used to make the
* distinction (ingress path is selected if the flag is present,
* egress otherwise). This is the only flag supported for now.
* Return
* **SK_PASS** on success, or **SK_DROP** on error.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@ -1896,7 +1972,11 @@ union bpf_attr {
FN(xdp_adjust_tail), \
FN(skb_get_xfrm_state), \
FN(get_stack), \
FN(skb_load_bytes_relative),
FN(skb_load_bytes_relative), \
FN(fib_lookup), \
FN(sock_hash_update), \
FN(msg_redirect_hash), \
FN(sk_redirect_hash),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@ -2130,6 +2210,15 @@ struct bpf_map_info {
__u32 ifindex;
__u64 netns_dev;
__u64 netns_ino;
__u32 btf_id;
__u32 btf_key_id;
__u32 btf_value_id;
} __attribute__((aligned(8)));
struct bpf_btf_info {
__aligned_u64 btf;
__u32 btf_size;
__u32 id;
} __attribute__((aligned(8)));
/* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
@ -2310,4 +2399,55 @@ struct bpf_raw_tracepoint_args {
__u64 args[0];
};
/* DIRECT: Skip the FIB rules and go to FIB table associated with device
* OUTPUT: Do lookup from egress perspective; default is ingress
*/
#define BPF_FIB_LOOKUP_DIRECT BIT(0)
#define BPF_FIB_LOOKUP_OUTPUT BIT(1)
struct bpf_fib_lookup {
/* input */
__u8 family; /* network family, AF_INET, AF_INET6, AF_MPLS */
/* set if lookup is to consider L4 data - e.g., FIB rules */
__u8 l4_protocol;
__be16 sport;
__be16 dport;
/* total length of packet from network header - used for MTU check */
__u16 tot_len;
__u32 ifindex; /* L3 device index for lookup */
union {
/* inputs to lookup */
__u8 tos; /* AF_INET */
__be32 flowlabel; /* AF_INET6 */
/* output: metric of fib result */
__u32 rt_metric;
};
union {
__be32 mpls_in;
__be32 ipv4_src;
__u32 ipv6_src[4]; /* in6_addr; network order */
};
/* input to bpf_fib_lookup, *dst is destination address.
* output: bpf_fib_lookup sets to gateway address
*/
union {
/* return for MPLS lookups */
__be32 mpls_out[4]; /* support up to 4 labels */
__be32 ipv4_dst;
__u32 ipv6_dst[4]; /* in6_addr; network order */
};
/* output */
__be16 h_vlan_proto;
__be16 h_vlan_TCI;
__u8 smac[6]; /* ETH_ALEN */
__u8 dmac[6]; /* ETH_ALEN */
};
#endif /* _UAPI__LINUX_BPF_H__ */

View file

@ -1391,6 +1391,7 @@ config BPF_SYSCALL
bool "Enable bpf() system call"
select ANON_INODES
select BPF
select IRQ_WORK
default n
help
Enable the bpf() system call that allows to manipulate eBPF

View file

@ -11,6 +11,7 @@
#include <linux/file.h>
#include <linux/uaccess.h>
#include <linux/kernel.h>
#include <linux/idr.h>
#include <linux/bpf_verifier.h>
#include <linux/btf.h>
@ -179,6 +180,9 @@
i < btf_type_vlen(struct_type); \
i++, member++)
static DEFINE_IDR(btf_idr);
static DEFINE_SPINLOCK(btf_idr_lock);
struct btf {
union {
struct btf_header *hdr;
@ -193,6 +197,8 @@ struct btf {
u32 types_size;
u32 data_size;
refcount_t refcnt;
u32 id;
struct rcu_head rcu;
};
enum verifier_phase {
@ -598,6 +604,42 @@ static int btf_add_type(struct btf_verifier_env *env, struct btf_type *t)
return 0;
}
static int btf_alloc_id(struct btf *btf)
{
int id;
idr_preload(GFP_KERNEL);
spin_lock_bh(&btf_idr_lock);
id = idr_alloc_cyclic(&btf_idr, btf, 1, INT_MAX, GFP_ATOMIC);
if (id > 0)
btf->id = id;
spin_unlock_bh(&btf_idr_lock);
idr_preload_end();
if (WARN_ON_ONCE(!id))
return -ENOSPC;
return id > 0 ? 0 : id;
}
static void btf_free_id(struct btf *btf)
{
unsigned long flags;
/*
* In map-in-map, calling map_delete_elem() on outer
* map will call bpf_map_put on the inner map.
* It will then eventually call btf_free_id()
* on the inner map. Some of the map_delete_elem()
* implementation may have irq disabled, so
* we need to use the _irqsave() version instead
* of the _bh() version.
*/
spin_lock_irqsave(&btf_idr_lock, flags);
idr_remove(&btf_idr, btf->id);
spin_unlock_irqrestore(&btf_idr_lock, flags);
}
static void btf_free(struct btf *btf)
{
kvfree(btf->types);
@ -607,15 +649,19 @@ static void btf_free(struct btf *btf)
kfree(btf);
}
static void btf_get(struct btf *btf)
static void btf_free_rcu(struct rcu_head *rcu)
{
refcount_inc(&btf->refcnt);
struct btf *btf = container_of(rcu, struct btf, rcu);
btf_free(btf);
}
void btf_put(struct btf *btf)
{
if (btf && refcount_dec_and_test(&btf->refcnt))
btf_free(btf);
if (btf && refcount_dec_and_test(&btf->refcnt)) {
btf_free_id(btf);
call_rcu(&btf->rcu, btf_free_rcu);
}
}
static int env_resolve_init(struct btf_verifier_env *env)
@ -1977,7 +2023,7 @@ static struct btf *btf_parse(void __user *btf_data, u32 btf_data_size,
if (!err) {
btf_verifier_env_free(env);
btf_get(btf);
refcount_set(&btf->refcnt, 1);
return btf;
}
@ -2006,10 +2052,15 @@ const struct file_operations btf_fops = {
.release = btf_release,
};
static int __btf_new_fd(struct btf *btf)
{
return anon_inode_getfd("btf", &btf_fops, btf, O_RDONLY | O_CLOEXEC);
}
int btf_new_fd(const union bpf_attr *attr)
{
struct btf *btf;
int fd;
int ret;
btf = btf_parse(u64_to_user_ptr(attr->btf),
attr->btf_size, attr->btf_log_level,
@ -2018,12 +2069,23 @@ int btf_new_fd(const union bpf_attr *attr)
if (IS_ERR(btf))
return PTR_ERR(btf);
fd = anon_inode_getfd("btf", &btf_fops, btf,
O_RDONLY | O_CLOEXEC);
if (fd < 0)
ret = btf_alloc_id(btf);
if (ret) {
btf_free(btf);
return ret;
}
/*
* The BTF ID is published to the userspace.
* All BTF free must go through call_rcu() from
* now on (i.e. free by calling btf_put()).
*/
ret = __btf_new_fd(btf);
if (ret < 0)
btf_put(btf);
return fd;
return ret;
}
struct btf *btf_get_by_fd(int fd)
@ -2042,7 +2104,7 @@ struct btf *btf_get_by_fd(int fd)
}
btf = f.file->private_data;
btf_get(btf);
refcount_inc(&btf->refcnt);
fdput(f);
return btf;
@ -2052,13 +2114,55 @@ int btf_get_info_by_fd(const struct btf *btf,
const union bpf_attr *attr,
union bpf_attr __user *uattr)
{
void __user *udata = u64_to_user_ptr(attr->info.info);
u32 copy_len = min_t(u32, btf->data_size,
attr->info.info_len);
struct bpf_btf_info __user *uinfo;
struct bpf_btf_info info = {};
u32 info_copy, btf_copy;
void __user *ubtf;
u32 uinfo_len;
if (copy_to_user(udata, btf->data, copy_len) ||
put_user(btf->data_size, &uattr->info.info_len))
uinfo = u64_to_user_ptr(attr->info.info);
uinfo_len = attr->info.info_len;
info_copy = min_t(u32, uinfo_len, sizeof(info));
if (copy_from_user(&info, uinfo, info_copy))
return -EFAULT;
info.id = btf->id;
ubtf = u64_to_user_ptr(info.btf);
btf_copy = min_t(u32, btf->data_size, info.btf_size);
if (copy_to_user(ubtf, btf->data, btf_copy))
return -EFAULT;
info.btf_size = btf->data_size;
if (copy_to_user(uinfo, &info, info_copy) ||
put_user(info_copy, &uattr->info.info_len))
return -EFAULT;
return 0;
}
int btf_get_fd_by_id(u32 id)
{
struct btf *btf;
int fd;
rcu_read_lock();
btf = idr_find(&btf_idr, id);
if (!btf || !refcount_inc_not_zero(&btf->refcnt))
btf = ERR_PTR(-ENOENT);
rcu_read_unlock();
if (IS_ERR(btf))
return PTR_ERR(btf);
fd = __btf_new_fd(btf);
if (fd < 0)
btf_put(btf);
return fd;
}
u32 btf_id(const struct btf *btf)
{
return btf->id;
}

View file

@ -1707,6 +1707,7 @@ const struct bpf_func_proto bpf_get_current_pid_tgid_proto __weak;
const struct bpf_func_proto bpf_get_current_uid_gid_proto __weak;
const struct bpf_func_proto bpf_get_current_comm_proto __weak;
const struct bpf_func_proto bpf_sock_map_update_proto __weak;
const struct bpf_func_proto bpf_sock_hash_update_proto __weak;
const struct bpf_func_proto * __weak bpf_get_trace_printk_proto(void)
{

View file

@ -48,14 +48,40 @@
#define SOCK_CREATE_FLAG_MASK \
(BPF_F_NUMA_NODE | BPF_F_RDONLY | BPF_F_WRONLY)
struct bpf_stab {
struct bpf_map map;
struct sock **sock_map;
struct bpf_sock_progs {
struct bpf_prog *bpf_tx_msg;
struct bpf_prog *bpf_parse;
struct bpf_prog *bpf_verdict;
};
struct bpf_stab {
struct bpf_map map;
struct sock **sock_map;
struct bpf_sock_progs progs;
};
struct bucket {
struct hlist_head head;
raw_spinlock_t lock;
};
struct bpf_htab {
struct bpf_map map;
struct bucket *buckets;
atomic_t count;
u32 n_buckets;
u32 elem_size;
struct bpf_sock_progs progs;
};
struct htab_elem {
struct rcu_head rcu;
struct hlist_node hash_node;
u32 hash;
struct sock *sk;
char key[0];
};
enum smap_psock_state {
SMAP_TX_RUNNING,
};
@ -63,6 +89,8 @@ enum smap_psock_state {
struct smap_psock_map_entry {
struct list_head list;
struct sock **entry;
struct htab_elem *hash_link;
struct bpf_htab *htab;
};
struct smap_psock {
@ -191,6 +219,12 @@ out:
rcu_read_unlock();
}
static void free_htab_elem(struct bpf_htab *htab, struct htab_elem *l)
{
atomic_dec(&htab->count);
kfree_rcu(l, rcu);
}
static void bpf_tcp_close(struct sock *sk, long timeout)
{
void (*close_fun)(struct sock *sk, long timeout);
@ -227,10 +261,16 @@ static void bpf_tcp_close(struct sock *sk, long timeout)
}
list_for_each_entry_safe(e, tmp, &psock->maps, list) {
osk = cmpxchg(e->entry, sk, NULL);
if (osk == sk) {
list_del(&e->list);
smap_release_sock(psock, sk);
if (e->entry) {
osk = cmpxchg(e->entry, sk, NULL);
if (osk == sk) {
list_del(&e->list);
smap_release_sock(psock, sk);
}
} else {
hlist_del_rcu(&e->hash_link->hash_node);
smap_release_sock(psock, e->hash_link->sk);
free_htab_elem(e->htab, e->hash_link);
}
}
write_unlock_bh(&sk->sk_callback_lock);
@ -461,7 +501,7 @@ static int free_curr_sg(struct sock *sk, struct sk_msg_buff *md)
static int bpf_map_msg_verdict(int _rc, struct sk_msg_buff *md)
{
return ((_rc == SK_PASS) ?
(md->map ? __SK_REDIRECT : __SK_PASS) :
(md->sk_redir ? __SK_REDIRECT : __SK_PASS) :
__SK_DROP);
}
@ -1092,7 +1132,7 @@ static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb)
* when we orphan the skb so that we don't have the possibility
* to reference a stale map.
*/
TCP_SKB_CB(skb)->bpf.map = NULL;
TCP_SKB_CB(skb)->bpf.sk_redir = NULL;
skb->sk = psock->sock;
bpf_compute_data_pointers(skb);
preempt_disable();
@ -1102,7 +1142,7 @@ static int smap_verdict_func(struct smap_psock *psock, struct sk_buff *skb)
/* Moving return codes from UAPI namespace into internal namespace */
return rc == SK_PASS ?
(TCP_SKB_CB(skb)->bpf.map ? __SK_REDIRECT : __SK_PASS) :
(TCP_SKB_CB(skb)->bpf.sk_redir ? __SK_REDIRECT : __SK_PASS) :
__SK_DROP;
}
@ -1372,7 +1412,6 @@ static int smap_init_sock(struct smap_psock *psock,
}
static void smap_init_progs(struct smap_psock *psock,
struct bpf_stab *stab,
struct bpf_prog *verdict,
struct bpf_prog *parse)
{
@ -1450,14 +1489,13 @@ static void smap_gc_work(struct work_struct *w)
kfree(psock);
}
static struct smap_psock *smap_init_psock(struct sock *sock,
struct bpf_stab *stab)
static struct smap_psock *smap_init_psock(struct sock *sock, int node)
{
struct smap_psock *psock;
psock = kzalloc_node(sizeof(struct smap_psock),
GFP_ATOMIC | __GFP_NOWARN,
stab->map.numa_node);
node);
if (!psock)
return ERR_PTR(-ENOMEM);
@ -1525,12 +1563,14 @@ free_stab:
return ERR_PTR(err);
}
static void smap_list_remove(struct smap_psock *psock, struct sock **entry)
static void smap_list_remove(struct smap_psock *psock,
struct sock **entry,
struct htab_elem *hash_link)
{
struct smap_psock_map_entry *e, *tmp;
list_for_each_entry_safe(e, tmp, &psock->maps, list) {
if (e->entry == entry) {
if (e->entry == entry || e->hash_link == hash_link) {
list_del(&e->list);
break;
}
@ -1568,7 +1608,7 @@ static void sock_map_free(struct bpf_map *map)
* to be null and queued for garbage collection.
*/
if (likely(psock)) {
smap_list_remove(psock, &stab->sock_map[i]);
smap_list_remove(psock, &stab->sock_map[i], NULL);
smap_release_sock(psock, sock);
}
write_unlock_bh(&sock->sk_callback_lock);
@ -1627,7 +1667,7 @@ static int sock_map_delete_elem(struct bpf_map *map, void *key)
if (psock->bpf_parse)
smap_stop_sock(psock, sock);
smap_list_remove(psock, &stab->sock_map[k]);
smap_list_remove(psock, &stab->sock_map[k], NULL);
smap_release_sock(psock, sock);
out:
write_unlock_bh(&sock->sk_callback_lock);
@ -1662,40 +1702,26 @@ out:
* - sock_map must use READ_ONCE and (cmp)xchg operations
* - BPF verdict/parse programs must use READ_ONCE and xchg operations
*/
static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
struct bpf_map *map,
void *key, u64 flags)
static int __sock_map_ctx_update_elem(struct bpf_map *map,
struct bpf_sock_progs *progs,
struct sock *sock,
struct sock **map_link,
void *key)
{
struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
struct smap_psock_map_entry *e = NULL;
struct bpf_prog *verdict, *parse, *tx_msg;
struct sock *osock, *sock;
struct smap_psock_map_entry *e = NULL;
struct smap_psock *psock;
u32 i = *(u32 *)key;
bool new = false;
int err;
if (unlikely(flags > BPF_EXIST))
return -EINVAL;
if (unlikely(i >= stab->map.max_entries))
return -E2BIG;
sock = READ_ONCE(stab->sock_map[i]);
if (flags == BPF_EXIST && !sock)
return -ENOENT;
else if (flags == BPF_NOEXIST && sock)
return -EEXIST;
sock = skops->sk;
/* 1. If sock map has BPF programs those will be inherited by the
* sock being added. If the sock is already attached to BPF programs
* this results in an error.
*/
verdict = READ_ONCE(stab->bpf_verdict);
parse = READ_ONCE(stab->bpf_parse);
tx_msg = READ_ONCE(stab->bpf_tx_msg);
verdict = READ_ONCE(progs->bpf_verdict);
parse = READ_ONCE(progs->bpf_parse);
tx_msg = READ_ONCE(progs->bpf_tx_msg);
if (parse && verdict) {
/* bpf prog refcnt may be zero if a concurrent attach operation
@ -1703,11 +1729,11 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
* we increment the refcnt. If this is the case abort with an
* error.
*/
verdict = bpf_prog_inc_not_zero(stab->bpf_verdict);
verdict = bpf_prog_inc_not_zero(progs->bpf_verdict);
if (IS_ERR(verdict))
return PTR_ERR(verdict);
parse = bpf_prog_inc_not_zero(stab->bpf_parse);
parse = bpf_prog_inc_not_zero(progs->bpf_parse);
if (IS_ERR(parse)) {
bpf_prog_put(verdict);
return PTR_ERR(parse);
@ -1715,7 +1741,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
}
if (tx_msg) {
tx_msg = bpf_prog_inc_not_zero(stab->bpf_tx_msg);
tx_msg = bpf_prog_inc_not_zero(progs->bpf_tx_msg);
if (IS_ERR(tx_msg)) {
if (verdict)
bpf_prog_put(verdict);
@ -1748,7 +1774,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
goto out_progs;
}
} else {
psock = smap_init_psock(sock, stab);
psock = smap_init_psock(sock, map->numa_node);
if (IS_ERR(psock)) {
err = PTR_ERR(psock);
goto out_progs;
@ -1758,12 +1784,13 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
new = true;
}
e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
if (!e) {
err = -ENOMEM;
goto out_progs;
if (map_link) {
e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
if (!e) {
err = -ENOMEM;
goto out_progs;
}
}
e->entry = &stab->sock_map[i];
/* 3. At this point we have a reference to a valid psock that is
* running. Attach any BPF programs needed.
@ -1780,7 +1807,7 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
err = smap_init_sock(psock, sock);
if (err)
goto out_free;
smap_init_progs(psock, stab, verdict, parse);
smap_init_progs(psock, verdict, parse);
smap_start_sock(psock, sock);
}
@ -1789,20 +1816,14 @@ static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
* it with. Because we can only have a single set of programs if
* old_sock has a strp we can stop it.
*/
list_add_tail(&e->list, &psock->maps);
write_unlock_bh(&sock->sk_callback_lock);
osock = xchg(&stab->sock_map[i], sock);
if (osock) {
struct smap_psock *opsock = smap_psock_sk(osock);
write_lock_bh(&osock->sk_callback_lock);
smap_list_remove(opsock, &stab->sock_map[i]);
smap_release_sock(opsock, osock);
write_unlock_bh(&osock->sk_callback_lock);
if (map_link) {
e->entry = map_link;
list_add_tail(&e->list, &psock->maps);
}
return 0;
write_unlock_bh(&sock->sk_callback_lock);
return err;
out_free:
kfree(e);
smap_release_sock(psock, sock);
out_progs:
if (verdict)
@ -1816,23 +1837,73 @@ out_progs:
return err;
}
int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)
static int sock_map_ctx_update_elem(struct bpf_sock_ops_kern *skops,
struct bpf_map *map,
void *key, u64 flags)
{
struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
struct bpf_sock_progs *progs = &stab->progs;
struct sock *osock, *sock;
u32 i = *(u32 *)key;
int err;
if (unlikely(flags > BPF_EXIST))
return -EINVAL;
if (unlikely(i >= stab->map.max_entries))
return -E2BIG;
sock = READ_ONCE(stab->sock_map[i]);
if (flags == BPF_EXIST && !sock)
return -ENOENT;
else if (flags == BPF_NOEXIST && sock)
return -EEXIST;
sock = skops->sk;
err = __sock_map_ctx_update_elem(map, progs, sock, &stab->sock_map[i],
key);
if (err)
goto out;
osock = xchg(&stab->sock_map[i], sock);
if (osock) {
struct smap_psock *opsock = smap_psock_sk(osock);
write_lock_bh(&osock->sk_callback_lock);
smap_list_remove(opsock, &stab->sock_map[i], NULL);
smap_release_sock(opsock, osock);
write_unlock_bh(&osock->sk_callback_lock);
}
out:
return err;
}
int sock_map_prog(struct bpf_map *map, struct bpf_prog *prog, u32 type)
{
struct bpf_sock_progs *progs;
struct bpf_prog *orig;
if (unlikely(map->map_type != BPF_MAP_TYPE_SOCKMAP))
if (map->map_type == BPF_MAP_TYPE_SOCKMAP) {
struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
progs = &stab->progs;
} else if (map->map_type == BPF_MAP_TYPE_SOCKHASH) {
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
progs = &htab->progs;
} else {
return -EINVAL;
}
switch (type) {
case BPF_SK_MSG_VERDICT:
orig = xchg(&stab->bpf_tx_msg, prog);
orig = xchg(&progs->bpf_tx_msg, prog);
break;
case BPF_SK_SKB_STREAM_PARSER:
orig = xchg(&stab->bpf_parse, prog);
orig = xchg(&progs->bpf_parse, prog);
break;
case BPF_SK_SKB_STREAM_VERDICT:
orig = xchg(&stab->bpf_verdict, prog);
orig = xchg(&progs->bpf_verdict, prog);
break;
default:
return -EOPNOTSUPP;
@ -1880,21 +1951,421 @@ static int sock_map_update_elem(struct bpf_map *map,
static void sock_map_release(struct bpf_map *map)
{
struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
struct bpf_sock_progs *progs;
struct bpf_prog *orig;
orig = xchg(&stab->bpf_parse, NULL);
if (map->map_type == BPF_MAP_TYPE_SOCKMAP) {
struct bpf_stab *stab = container_of(map, struct bpf_stab, map);
progs = &stab->progs;
} else {
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
progs = &htab->progs;
}
orig = xchg(&progs->bpf_parse, NULL);
if (orig)
bpf_prog_put(orig);
orig = xchg(&stab->bpf_verdict, NULL);
orig = xchg(&progs->bpf_verdict, NULL);
if (orig)
bpf_prog_put(orig);
orig = xchg(&stab->bpf_tx_msg, NULL);
orig = xchg(&progs->bpf_tx_msg, NULL);
if (orig)
bpf_prog_put(orig);
}
static struct bpf_map *sock_hash_alloc(union bpf_attr *attr)
{
struct bpf_htab *htab;
int i, err;
u64 cost;
if (!capable(CAP_NET_ADMIN))
return ERR_PTR(-EPERM);
/* check sanity of attributes */
if (attr->max_entries == 0 || attr->value_size != 4 ||
attr->map_flags & ~SOCK_CREATE_FLAG_MASK)
return ERR_PTR(-EINVAL);
if (attr->key_size > MAX_BPF_STACK)
/* eBPF programs initialize keys on stack, so they cannot be
* larger than max stack size
*/
return ERR_PTR(-E2BIG);
err = bpf_tcp_ulp_register();
if (err && err != -EEXIST)
return ERR_PTR(err);
htab = kzalloc(sizeof(*htab), GFP_USER);
if (!htab)
return ERR_PTR(-ENOMEM);
bpf_map_init_from_attr(&htab->map, attr);
htab->n_buckets = roundup_pow_of_two(htab->map.max_entries);
htab->elem_size = sizeof(struct htab_elem) +
round_up(htab->map.key_size, 8);
err = -EINVAL;
if (htab->n_buckets == 0 ||
htab->n_buckets > U32_MAX / sizeof(struct bucket))
goto free_htab;
cost = (u64) htab->n_buckets * sizeof(struct bucket) +
(u64) htab->elem_size * htab->map.max_entries;
if (cost >= U32_MAX - PAGE_SIZE)
goto free_htab;
htab->map.pages = round_up(cost, PAGE_SIZE) >> PAGE_SHIFT;
err = bpf_map_precharge_memlock(htab->map.pages);
if (err)
goto free_htab;
err = -ENOMEM;
htab->buckets = bpf_map_area_alloc(
htab->n_buckets * sizeof(struct bucket),
htab->map.numa_node);
if (!htab->buckets)
goto free_htab;
for (i = 0; i < htab->n_buckets; i++) {
INIT_HLIST_HEAD(&htab->buckets[i].head);
raw_spin_lock_init(&htab->buckets[i].lock);
}
return &htab->map;
free_htab:
kfree(htab);
return ERR_PTR(err);
}
static inline struct bucket *__select_bucket(struct bpf_htab *htab, u32 hash)
{
return &htab->buckets[hash & (htab->n_buckets - 1)];
}
static inline struct hlist_head *select_bucket(struct bpf_htab *htab, u32 hash)
{
return &__select_bucket(htab, hash)->head;
}
static void sock_hash_free(struct bpf_map *map)
{
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
int i;
synchronize_rcu();
/* At this point no update, lookup or delete operations can happen.
* However, be aware we can still get a socket state event updates,
* and data ready callabacks that reference the psock from sk_user_data
* Also psock worker threads are still in-flight. So smap_release_sock
* will only free the psock after cancel_sync on the worker threads
* and a grace period expire to ensure psock is really safe to remove.
*/
rcu_read_lock();
for (i = 0; i < htab->n_buckets; i++) {
struct hlist_head *head = select_bucket(htab, i);
struct hlist_node *n;
struct htab_elem *l;
hlist_for_each_entry_safe(l, n, head, hash_node) {
struct sock *sock = l->sk;
struct smap_psock *psock;
hlist_del_rcu(&l->hash_node);
write_lock_bh(&sock->sk_callback_lock);
psock = smap_psock_sk(sock);
/* This check handles a racing sock event that can get
* the sk_callback_lock before this case but after xchg
* causing the refcnt to hit zero and sock user data
* (psock) to be null and queued for garbage collection.
*/
if (likely(psock)) {
smap_list_remove(psock, NULL, l);
smap_release_sock(psock, sock);
}
write_unlock_bh(&sock->sk_callback_lock);
kfree(l);
}
}
rcu_read_unlock();
bpf_map_area_free(htab->buckets);
kfree(htab);
}
static struct htab_elem *alloc_sock_hash_elem(struct bpf_htab *htab,
void *key, u32 key_size, u32 hash,
struct sock *sk,
struct htab_elem *old_elem)
{
struct htab_elem *l_new;
if (atomic_inc_return(&htab->count) > htab->map.max_entries) {
if (!old_elem) {
atomic_dec(&htab->count);
return ERR_PTR(-E2BIG);
}
}
l_new = kmalloc_node(htab->elem_size, GFP_ATOMIC | __GFP_NOWARN,
htab->map.numa_node);
if (!l_new)
return ERR_PTR(-ENOMEM);
memcpy(l_new->key, key, key_size);
l_new->sk = sk;
l_new->hash = hash;
return l_new;
}
static struct htab_elem *lookup_elem_raw(struct hlist_head *head,
u32 hash, void *key, u32 key_size)
{
struct htab_elem *l;
hlist_for_each_entry_rcu(l, head, hash_node) {
if (l->hash == hash && !memcmp(&l->key, key, key_size))
return l;
}
return NULL;
}
static inline u32 htab_map_hash(const void *key, u32 key_len)
{
return jhash(key, key_len, 0);
}
static int sock_hash_get_next_key(struct bpf_map *map,
void *key, void *next_key)
{
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
struct htab_elem *l, *next_l;
struct hlist_head *h;
u32 hash, key_size;
int i = 0;
WARN_ON_ONCE(!rcu_read_lock_held());
key_size = map->key_size;
if (!key)
goto find_first_elem;
hash = htab_map_hash(key, key_size);
h = select_bucket(htab, hash);
l = lookup_elem_raw(h, hash, key, key_size);
if (!l)
goto find_first_elem;
next_l = hlist_entry_safe(
rcu_dereference_raw(hlist_next_rcu(&l->hash_node)),
struct htab_elem, hash_node);
if (next_l) {
memcpy(next_key, next_l->key, key_size);
return 0;
}
/* no more elements in this hash list, go to the next bucket */
i = hash & (htab->n_buckets - 1);
i++;
find_first_elem:
/* iterate over buckets */
for (; i < htab->n_buckets; i++) {
h = select_bucket(htab, i);
/* pick first element in the bucket */
next_l = hlist_entry_safe(
rcu_dereference_raw(hlist_first_rcu(h)),
struct htab_elem, hash_node);
if (next_l) {
/* if it's not empty, just return it */
memcpy(next_key, next_l->key, key_size);
return 0;
}
}
/* iterated over all buckets and all elements */
return -ENOENT;
}
static int sock_hash_ctx_update_elem(struct bpf_sock_ops_kern *skops,
struct bpf_map *map,
void *key, u64 map_flags)
{
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
struct bpf_sock_progs *progs = &htab->progs;
struct htab_elem *l_new = NULL, *l_old;
struct smap_psock_map_entry *e = NULL;
struct hlist_head *head;
struct smap_psock *psock;
u32 key_size, hash;
struct sock *sock;
struct bucket *b;
int err;
sock = skops->sk;
if (sock->sk_type != SOCK_STREAM ||
sock->sk_protocol != IPPROTO_TCP)
return -EOPNOTSUPP;
if (unlikely(map_flags > BPF_EXIST))
return -EINVAL;
e = kzalloc(sizeof(*e), GFP_ATOMIC | __GFP_NOWARN);
if (!e)
return -ENOMEM;
WARN_ON_ONCE(!rcu_read_lock_held());
key_size = map->key_size;
hash = htab_map_hash(key, key_size);
b = __select_bucket(htab, hash);
head = &b->head;
err = __sock_map_ctx_update_elem(map, progs, sock, NULL, key);
if (err)
goto err;
/* bpf_map_update_elem() can be called in_irq() */
raw_spin_lock_bh(&b->lock);
l_old = lookup_elem_raw(head, hash, key, key_size);
if (l_old && map_flags == BPF_NOEXIST) {
err = -EEXIST;
goto bucket_err;
}
if (!l_old && map_flags == BPF_EXIST) {
err = -ENOENT;
goto bucket_err;
}
l_new = alloc_sock_hash_elem(htab, key, key_size, hash, sock, l_old);
if (IS_ERR(l_new)) {
err = PTR_ERR(l_new);
goto bucket_err;
}
psock = smap_psock_sk(sock);
if (unlikely(!psock)) {
err = -EINVAL;
goto bucket_err;
}
e->hash_link = l_new;
e->htab = container_of(map, struct bpf_htab, map);
list_add_tail(&e->list, &psock->maps);
/* add new element to the head of the list, so that
* concurrent search will find it before old elem
*/
hlist_add_head_rcu(&l_new->hash_node, head);
if (l_old) {
psock = smap_psock_sk(l_old->sk);
hlist_del_rcu(&l_old->hash_node);
smap_list_remove(psock, NULL, l_old);
smap_release_sock(psock, l_old->sk);
free_htab_elem(htab, l_old);
}
raw_spin_unlock_bh(&b->lock);
return 0;
bucket_err:
raw_spin_unlock_bh(&b->lock);
err:
kfree(e);
psock = smap_psock_sk(sock);
if (psock)
smap_release_sock(psock, sock);
return err;
}
static int sock_hash_update_elem(struct bpf_map *map,
void *key, void *value, u64 flags)
{
struct bpf_sock_ops_kern skops;
u32 fd = *(u32 *)value;
struct socket *socket;
int err;
socket = sockfd_lookup(fd, &err);
if (!socket)
return err;
skops.sk = socket->sk;
if (!skops.sk) {
fput(socket->file);
return -EINVAL;
}
err = sock_hash_ctx_update_elem(&skops, map, key, flags);
fput(socket->file);
return err;
}
static int sock_hash_delete_elem(struct bpf_map *map, void *key)
{
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
struct hlist_head *head;
struct bucket *b;
struct htab_elem *l;
u32 hash, key_size;
int ret = -ENOENT;
key_size = map->key_size;
hash = htab_map_hash(key, key_size);
b = __select_bucket(htab, hash);
head = &b->head;
raw_spin_lock_bh(&b->lock);
l = lookup_elem_raw(head, hash, key, key_size);
if (l) {
struct sock *sock = l->sk;
struct smap_psock *psock;
hlist_del_rcu(&l->hash_node);
write_lock_bh(&sock->sk_callback_lock);
psock = smap_psock_sk(sock);
/* This check handles a racing sock event that can get the
* sk_callback_lock before this case but after xchg happens
* causing the refcnt to hit zero and sock user data (psock)
* to be null and queued for garbage collection.
*/
if (likely(psock)) {
smap_list_remove(psock, NULL, l);
smap_release_sock(psock, sock);
}
write_unlock_bh(&sock->sk_callback_lock);
free_htab_elem(htab, l);
ret = 0;
}
raw_spin_unlock_bh(&b->lock);
return ret;
}
struct sock *__sock_hash_lookup_elem(struct bpf_map *map, void *key)
{
struct bpf_htab *htab = container_of(map, struct bpf_htab, map);
struct hlist_head *head;
struct htab_elem *l;
u32 key_size, hash;
struct bucket *b;
struct sock *sk;
key_size = map->key_size;
hash = htab_map_hash(key, key_size);
b = __select_bucket(htab, hash);
head = &b->head;
raw_spin_lock_bh(&b->lock);
l = lookup_elem_raw(head, hash, key, key_size);
sk = l ? l->sk : NULL;
raw_spin_unlock_bh(&b->lock);
return sk;
}
const struct bpf_map_ops sock_map_ops = {
.map_alloc = sock_map_alloc,
.map_free = sock_map_free,
@ -1905,6 +2376,15 @@ const struct bpf_map_ops sock_map_ops = {
.map_release_uref = sock_map_release,
};
const struct bpf_map_ops sock_hash_ops = {
.map_alloc = sock_hash_alloc,
.map_free = sock_hash_free,
.map_lookup_elem = sock_map_lookup,
.map_get_next_key = sock_hash_get_next_key,
.map_update_elem = sock_hash_update_elem,
.map_delete_elem = sock_hash_delete_elem,
};
BPF_CALL_4(bpf_sock_map_update, struct bpf_sock_ops_kern *, bpf_sock,
struct bpf_map *, map, void *, key, u64, flags)
{
@ -1922,3 +2402,21 @@ const struct bpf_func_proto bpf_sock_map_update_proto = {
.arg3_type = ARG_PTR_TO_MAP_KEY,
.arg4_type = ARG_ANYTHING,
};
BPF_CALL_4(bpf_sock_hash_update, struct bpf_sock_ops_kern *, bpf_sock,
struct bpf_map *, map, void *, key, u64, flags)
{
WARN_ON_ONCE(!rcu_read_lock_held());
return sock_hash_ctx_update_elem(bpf_sock, map, key, flags);
}
const struct bpf_func_proto bpf_sock_hash_update_proto = {
.func = bpf_sock_hash_update,
.gpl_only = false,
.pkt_access = true,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_CONST_MAP_PTR,
.arg3_type = ARG_PTR_TO_MAP_KEY,
.arg4_type = ARG_ANYTHING,
};

View file

@ -11,6 +11,7 @@
#include <linux/perf_event.h>
#include <linux/elf.h>
#include <linux/pagemap.h>
#include <linux/irq_work.h>
#include "percpu_freelist.h"
#define STACK_CREATE_FLAG_MASK \
@ -32,6 +33,23 @@ struct bpf_stack_map {
struct stack_map_bucket *buckets[];
};
/* irq_work to run up_read() for build_id lookup in nmi context */
struct stack_map_irq_work {
struct irq_work irq_work;
struct rw_semaphore *sem;
};
static void do_up_read(struct irq_work *entry)
{
struct stack_map_irq_work *work;
work = container_of(entry, struct stack_map_irq_work, irq_work);
up_read(work->sem);
work->sem = NULL;
}
static DEFINE_PER_CPU(struct stack_map_irq_work, up_read_work);
static inline bool stack_map_use_build_id(struct bpf_map *map)
{
return (map->map_flags & BPF_F_STACK_BUILD_ID);
@ -267,17 +285,27 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
{
int i;
struct vm_area_struct *vma;
bool in_nmi_ctx = in_nmi();
bool irq_work_busy = false;
struct stack_map_irq_work *work;
if (in_nmi_ctx) {
work = this_cpu_ptr(&up_read_work);
if (work->irq_work.flags & IRQ_WORK_BUSY)
/* cannot queue more up_read, fallback */
irq_work_busy = true;
}
/*
* We cannot do up_read() in nmi context, so build_id lookup is
* only supported for non-nmi events. If at some point, it is
* possible to run find_vma() without taking the semaphore, we
* would like to allow build_id lookup in nmi context.
* We cannot do up_read() in nmi context. To do build_id lookup
* in nmi context, we need to run up_read() in irq_work. We use
* a percpu variable to do the irq_work. If the irq_work is
* already used by another lookup, we fall back to report ips.
*
* Same fallback is used for kernel stack (!user) on a stackmap
* with build_id.
*/
if (!user || !current || !current->mm || in_nmi() ||
if (!user || !current || !current->mm || irq_work_busy ||
down_read_trylock(&current->mm->mmap_sem) == 0) {
/* cannot access current->mm, fall back to ips */
for (i = 0; i < trace_nr; i++) {
@ -299,7 +327,13 @@ static void stack_map_get_build_id_offset(struct bpf_stack_build_id *id_offs,
- vma->vm_start;
id_offs[i].status = BPF_STACK_BUILD_ID_VALID;
}
up_read(&current->mm->mmap_sem);
if (!in_nmi_ctx) {
up_read(&current->mm->mmap_sem);
} else {
work->sem = &current->mm->mmap_sem;
irq_work_queue(&work->irq_work);
}
}
BPF_CALL_3(bpf_get_stackid, struct pt_regs *, regs, struct bpf_map *, map,
@ -575,3 +609,16 @@ const struct bpf_map_ops stack_map_ops = {
.map_update_elem = stack_map_update_elem,
.map_delete_elem = stack_map_delete_elem,
};
static int __init stack_map_init(void)
{
int cpu;
struct stack_map_irq_work *work;
for_each_possible_cpu(cpu) {
work = per_cpu_ptr(&up_read_work, cpu);
init_irq_work(&work->irq_work, do_up_read);
}
return 0;
}
subsys_initcall(stack_map_init);

View file

@ -255,7 +255,6 @@ static void bpf_map_free_deferred(struct work_struct *work)
bpf_map_uncharge_memlock(map);
security_bpf_map_free(map);
btf_put(map->btf);
/* implementation dependent freeing */
map->ops->map_free(map);
}
@ -276,6 +275,7 @@ static void __bpf_map_put(struct bpf_map *map, bool do_idr_lock)
if (atomic_dec_and_test(&map->refcnt)) {
/* bpf_map_free_id() must be called first */
bpf_map_free_id(map, do_idr_lock);
btf_put(map->btf);
INIT_WORK(&map->work, bpf_map_free_deferred);
schedule_work(&map->work);
}
@ -2011,6 +2011,12 @@ static int bpf_map_get_info_by_fd(struct bpf_map *map,
info.map_flags = map->map_flags;
memcpy(info.name, map->name, sizeof(map->name));
if (map->btf) {
info.btf_id = btf_id(map->btf);
info.btf_key_id = map->btf_key_id;
info.btf_value_id = map->btf_value_id;
}
if (bpf_map_is_dev_bound(map)) {
err = bpf_map_offload_info_fill(&info, map);
if (err)
@ -2024,6 +2030,21 @@ static int bpf_map_get_info_by_fd(struct bpf_map *map,
return 0;
}
static int bpf_btf_get_info_by_fd(struct btf *btf,
const union bpf_attr *attr,
union bpf_attr __user *uattr)
{
struct bpf_btf_info __user *uinfo = u64_to_user_ptr(attr->info.info);
u32 info_len = attr->info.info_len;
int err;
err = check_uarg_tail_zero(uinfo, sizeof(*uinfo), info_len);
if (err)
return err;
return btf_get_info_by_fd(btf, attr, uattr);
}
#define BPF_OBJ_GET_INFO_BY_FD_LAST_FIELD info.info
static int bpf_obj_get_info_by_fd(const union bpf_attr *attr,
@ -2047,7 +2068,7 @@ static int bpf_obj_get_info_by_fd(const union bpf_attr *attr,
err = bpf_map_get_info_by_fd(f.file->private_data, attr,
uattr);
else if (f.file->f_op == &btf_fops)
err = btf_get_info_by_fd(f.file->private_data, attr, uattr);
err = bpf_btf_get_info_by_fd(f.file->private_data, attr, uattr);
else
err = -EINVAL;
@ -2068,6 +2089,19 @@ static int bpf_btf_load(const union bpf_attr *attr)
return btf_new_fd(attr);
}
#define BPF_BTF_GET_FD_BY_ID_LAST_FIELD btf_id
static int bpf_btf_get_fd_by_id(const union bpf_attr *attr)
{
if (CHECK_ATTR(BPF_BTF_GET_FD_BY_ID))
return -EINVAL;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
return btf_get_fd_by_id(attr->btf_id);
}
SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size)
{
union bpf_attr attr = {};
@ -2151,6 +2185,9 @@ SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, siz
case BPF_BTF_LOAD:
err = bpf_btf_load(&attr);
break;
case BPF_BTF_GET_FD_BY_ID:
err = bpf_btf_get_fd_by_id(&attr);
break;
default:
err = -EINVAL;
break;

View file

@ -2093,6 +2093,13 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
func_id != BPF_FUNC_msg_redirect_map)
goto error;
break;
case BPF_MAP_TYPE_SOCKHASH:
if (func_id != BPF_FUNC_sk_redirect_hash &&
func_id != BPF_FUNC_sock_hash_update &&
func_id != BPF_FUNC_map_delete_elem &&
func_id != BPF_FUNC_msg_redirect_hash)
goto error;
break;
default:
break;
}
@ -2130,11 +2137,14 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
break;
case BPF_FUNC_sk_redirect_map:
case BPF_FUNC_msg_redirect_map:
case BPF_FUNC_sock_map_update:
if (map->map_type != BPF_MAP_TYPE_SOCKMAP)
goto error;
break;
case BPF_FUNC_sock_map_update:
if (map->map_type != BPF_MAP_TYPE_SOCKMAP)
case BPF_FUNC_sk_redirect_hash:
case BPF_FUNC_msg_redirect_hash:
case BPF_FUNC_sock_hash_update:
if (map->map_type != BPF_MAP_TYPE_SOCKHASH)
goto error;
break;
default:
@ -5215,7 +5225,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
}
}
if (!ops->convert_ctx_access)
if (!ops->convert_ctx_access || bpf_prog_is_dev_bound(env->prog->aux))
return 0;
insn = env->prog->insnsi + delta;

View file

@ -60,6 +60,10 @@
#include <net/xfrm.h>
#include <linux/bpf_trace.h>
#include <net/xdp_sock.h>
#include <linux/inetdevice.h>
#include <net/ip_fib.h>
#include <net/flow.h>
#include <net/arp.h>
/**
* sk_filter_trim_cap - run a packet through a socket filter
@ -2070,6 +2074,33 @@ static const struct bpf_func_proto bpf_redirect_proto = {
.arg2_type = ARG_ANYTHING,
};
BPF_CALL_4(bpf_sk_redirect_hash, struct sk_buff *, skb,
struct bpf_map *, map, void *, key, u64, flags)
{
struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
/* If user passes invalid input drop the packet. */
if (unlikely(flags & ~(BPF_F_INGRESS)))
return SK_DROP;
tcb->bpf.flags = flags;
tcb->bpf.sk_redir = __sock_hash_lookup_elem(map, key);
if (!tcb->bpf.sk_redir)
return SK_DROP;
return SK_PASS;
}
static const struct bpf_func_proto bpf_sk_redirect_hash_proto = {
.func = bpf_sk_redirect_hash,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_CONST_MAP_PTR,
.arg3_type = ARG_PTR_TO_MAP_KEY,
.arg4_type = ARG_ANYTHING,
};
BPF_CALL_4(bpf_sk_redirect_map, struct sk_buff *, skb,
struct bpf_map *, map, u32, key, u64, flags)
{
@ -2079,9 +2110,10 @@ BPF_CALL_4(bpf_sk_redirect_map, struct sk_buff *, skb,
if (unlikely(flags & ~(BPF_F_INGRESS)))
return SK_DROP;
tcb->bpf.key = key;
tcb->bpf.flags = flags;
tcb->bpf.map = map;
tcb->bpf.sk_redir = __sock_map_lookup_elem(map, key);
if (!tcb->bpf.sk_redir)
return SK_DROP;
return SK_PASS;
}
@ -2089,16 +2121,8 @@ BPF_CALL_4(bpf_sk_redirect_map, struct sk_buff *, skb,
struct sock *do_sk_redirect_map(struct sk_buff *skb)
{
struct tcp_skb_cb *tcb = TCP_SKB_CB(skb);
struct sock *sk = NULL;
if (tcb->bpf.map) {
sk = __sock_map_lookup_elem(tcb->bpf.map, tcb->bpf.key);
tcb->bpf.key = 0;
tcb->bpf.map = NULL;
}
return sk;
return tcb->bpf.sk_redir;
}
static const struct bpf_func_proto bpf_sk_redirect_map_proto = {
@ -2111,6 +2135,31 @@ static const struct bpf_func_proto bpf_sk_redirect_map_proto = {
.arg4_type = ARG_ANYTHING,
};
BPF_CALL_4(bpf_msg_redirect_hash, struct sk_msg_buff *, msg,
struct bpf_map *, map, void *, key, u64, flags)
{
/* If user passes invalid input drop the packet. */
if (unlikely(flags & ~(BPF_F_INGRESS)))
return SK_DROP;
msg->flags = flags;
msg->sk_redir = __sock_hash_lookup_elem(map, key);
if (!msg->sk_redir)
return SK_DROP;
return SK_PASS;
}
static const struct bpf_func_proto bpf_msg_redirect_hash_proto = {
.func = bpf_msg_redirect_hash,
.gpl_only = false,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_CONST_MAP_PTR,
.arg3_type = ARG_PTR_TO_MAP_KEY,
.arg4_type = ARG_ANYTHING,
};
BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg_buff *, msg,
struct bpf_map *, map, u32, key, u64, flags)
{
@ -2118,25 +2167,17 @@ BPF_CALL_4(bpf_msg_redirect_map, struct sk_msg_buff *, msg,
if (unlikely(flags & ~(BPF_F_INGRESS)))
return SK_DROP;
msg->key = key;
msg->flags = flags;
msg->map = map;
msg->sk_redir = __sock_map_lookup_elem(map, key);
if (!msg->sk_redir)
return SK_DROP;
return SK_PASS;
}
struct sock *do_msg_redirect_map(struct sk_msg_buff *msg)
{
struct sock *sk = NULL;
if (msg->map) {
sk = __sock_map_lookup_elem(msg->map, msg->key);
msg->key = 0;
msg->map = NULL;
}
return sk;
return msg->sk_redir;
}
static const struct bpf_func_proto bpf_msg_redirect_map_proto = {
@ -4032,6 +4073,265 @@ static const struct bpf_func_proto bpf_skb_get_xfrm_state_proto = {
};
#endif
#if IS_ENABLED(CONFIG_INET) || IS_ENABLED(CONFIG_IPV6)
static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
const struct neighbour *neigh,
const struct net_device *dev)
{
memcpy(params->dmac, neigh->ha, ETH_ALEN);
memcpy(params->smac, dev->dev_addr, ETH_ALEN);
params->h_vlan_TCI = 0;
params->h_vlan_proto = 0;
return dev->ifindex;
}
#endif
#if IS_ENABLED(CONFIG_INET)
static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
u32 flags)
{
struct in_device *in_dev;
struct neighbour *neigh;
struct net_device *dev;
struct fib_result res;
struct fib_nh *nh;
struct flowi4 fl4;
int err;
dev = dev_get_by_index_rcu(net, params->ifindex);
if (unlikely(!dev))
return -ENODEV;
/* verify forwarding is enabled on this interface */
in_dev = __in_dev_get_rcu(dev);
if (unlikely(!in_dev || !IN_DEV_FORWARD(in_dev)))
return 0;
if (flags & BPF_FIB_LOOKUP_OUTPUT) {
fl4.flowi4_iif = 1;
fl4.flowi4_oif = params->ifindex;
} else {
fl4.flowi4_iif = params->ifindex;
fl4.flowi4_oif = 0;
}
fl4.flowi4_tos = params->tos & IPTOS_RT_MASK;
fl4.flowi4_scope = RT_SCOPE_UNIVERSE;
fl4.flowi4_flags = 0;
fl4.flowi4_proto = params->l4_protocol;
fl4.daddr = params->ipv4_dst;
fl4.saddr = params->ipv4_src;
fl4.fl4_sport = params->sport;
fl4.fl4_dport = params->dport;
if (flags & BPF_FIB_LOOKUP_DIRECT) {
u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN;
struct fib_table *tb;
tb = fib_get_table(net, tbid);
if (unlikely(!tb))
return 0;
err = fib_table_lookup(tb, &fl4, &res, FIB_LOOKUP_NOREF);
} else {
fl4.flowi4_mark = 0;
fl4.flowi4_secid = 0;
fl4.flowi4_tun_key.tun_id = 0;
fl4.flowi4_uid = sock_net_uid(net, NULL);
err = fib_lookup(net, &fl4, &res, FIB_LOOKUP_NOREF);
}
if (err || res.type != RTN_UNICAST)
return 0;
if (res.fi->fib_nhs > 1)
fib_select_path(net, &res, &fl4, NULL);
nh = &res.fi->fib_nh[res.nh_sel];
/* do not handle lwt encaps right now */
if (nh->nh_lwtstate)
return 0;
dev = nh->nh_dev;
if (unlikely(!dev))
return 0;
if (nh->nh_gw)
params->ipv4_dst = nh->nh_gw;
params->rt_metric = res.fi->fib_priority;
/* xdp and cls_bpf programs are run in RCU-bh so
* rcu_read_lock_bh is not needed here
*/
neigh = __ipv4_neigh_lookup_noref(dev, (__force u32)params->ipv4_dst);
if (neigh)
return bpf_fib_set_fwd_params(params, neigh, dev);
return 0;
}
#endif
#if IS_ENABLED(CONFIG_IPV6)
static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
u32 flags)
{
struct in6_addr *src = (struct in6_addr *) params->ipv6_src;
struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst;
struct neighbour *neigh;
struct net_device *dev;
struct inet6_dev *idev;
struct fib6_info *f6i;
struct flowi6 fl6;
int strict = 0;
int oif;
/* link local addresses are never forwarded */
if (rt6_need_strict(dst) || rt6_need_strict(src))
return 0;
dev = dev_get_by_index_rcu(net, params->ifindex);
if (unlikely(!dev))
return -ENODEV;
idev = __in6_dev_get_safely(dev);
if (unlikely(!idev || !net->ipv6.devconf_all->forwarding))
return 0;
if (flags & BPF_FIB_LOOKUP_OUTPUT) {
fl6.flowi6_iif = 1;
oif = fl6.flowi6_oif = params->ifindex;
} else {
oif = fl6.flowi6_iif = params->ifindex;
fl6.flowi6_oif = 0;
strict = RT6_LOOKUP_F_HAS_SADDR;
}
fl6.flowlabel = params->flowlabel;
fl6.flowi6_scope = 0;
fl6.flowi6_flags = 0;
fl6.mp_hash = 0;
fl6.flowi6_proto = params->l4_protocol;
fl6.daddr = *dst;
fl6.saddr = *src;
fl6.fl6_sport = params->sport;
fl6.fl6_dport = params->dport;
if (flags & BPF_FIB_LOOKUP_DIRECT) {
u32 tbid = l3mdev_fib_table_rcu(dev) ? : RT_TABLE_MAIN;
struct fib6_table *tb;
tb = ipv6_stub->fib6_get_table(net, tbid);
if (unlikely(!tb))
return 0;
f6i = ipv6_stub->fib6_table_lookup(net, tb, oif, &fl6, strict);
} else {
fl6.flowi6_mark = 0;
fl6.flowi6_secid = 0;
fl6.flowi6_tun_key.tun_id = 0;
fl6.flowi6_uid = sock_net_uid(net, NULL);
f6i = ipv6_stub->fib6_lookup(net, oif, &fl6, strict);
}
if (unlikely(IS_ERR_OR_NULL(f6i) || f6i == net->ipv6.fib6_null_entry))
return 0;
if (unlikely(f6i->fib6_flags & RTF_REJECT ||
f6i->fib6_type != RTN_UNICAST))
return 0;
if (f6i->fib6_nsiblings && fl6.flowi6_oif == 0)
f6i = ipv6_stub->fib6_multipath_select(net, f6i, &fl6,
fl6.flowi6_oif, NULL,
strict);
if (f6i->fib6_nh.nh_lwtstate)
return 0;
if (f6i->fib6_flags & RTF_GATEWAY)
*dst = f6i->fib6_nh.nh_gw;
dev = f6i->fib6_nh.nh_dev;
params->rt_metric = f6i->fib6_metric;
/* xdp and cls_bpf programs are run in RCU-bh so rcu_read_lock_bh is
* not needed here. Can not use __ipv6_neigh_lookup_noref here
* because we need to get nd_tbl via the stub
*/
neigh = ___neigh_lookup_noref(ipv6_stub->nd_tbl, neigh_key_eq128,
ndisc_hashfn, dst, dev);
if (neigh)
return bpf_fib_set_fwd_params(params, neigh, dev);
return 0;
}
#endif
BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
struct bpf_fib_lookup *, params, int, plen, u32, flags)
{
if (plen < sizeof(*params))
return -EINVAL;
switch (params->family) {
#if IS_ENABLED(CONFIG_INET)
case AF_INET:
return bpf_ipv4_fib_lookup(dev_net(ctx->rxq->dev), params,
flags);
#endif
#if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
return bpf_ipv6_fib_lookup(dev_net(ctx->rxq->dev), params,
flags);
#endif
}
return 0;
}
static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = {
.func = bpf_xdp_fib_lookup,
.gpl_only = true,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_PTR_TO_MEM,
.arg3_type = ARG_CONST_SIZE,
.arg4_type = ARG_ANYTHING,
};
BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
struct bpf_fib_lookup *, params, int, plen, u32, flags)
{
if (plen < sizeof(*params))
return -EINVAL;
switch (params->family) {
#if IS_ENABLED(CONFIG_INET)
case AF_INET:
return bpf_ipv4_fib_lookup(dev_net(skb->dev), params, flags);
#endif
#if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
return bpf_ipv6_fib_lookup(dev_net(skb->dev), params, flags);
#endif
}
return -ENOTSUPP;
}
static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
.func = bpf_skb_fib_lookup,
.gpl_only = true,
.ret_type = RET_INTEGER,
.arg1_type = ARG_PTR_TO_CTX,
.arg2_type = ARG_PTR_TO_MEM,
.arg3_type = ARG_CONST_SIZE,
.arg4_type = ARG_ANYTHING,
};
static const struct bpf_func_proto *
bpf_base_func_proto(enum bpf_func_id func_id)
{
@ -4181,6 +4481,8 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
case BPF_FUNC_skb_get_xfrm_state:
return &bpf_skb_get_xfrm_state_proto;
#endif
case BPF_FUNC_fib_lookup:
return &bpf_skb_fib_lookup_proto;
default:
return bpf_base_func_proto(func_id);
}
@ -4206,6 +4508,8 @@ xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_xdp_redirect_map_proto;
case BPF_FUNC_xdp_adjust_tail:
return &bpf_xdp_adjust_tail_proto;
case BPF_FUNC_fib_lookup:
return &bpf_xdp_fib_lookup_proto;
default:
return bpf_base_func_proto(func_id);
}
@ -4250,6 +4554,8 @@ sock_ops_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_sock_ops_cb_flags_set_proto;
case BPF_FUNC_sock_map_update:
return &bpf_sock_map_update_proto;
case BPF_FUNC_sock_hash_update:
return &bpf_sock_hash_update_proto;
default:
return bpf_base_func_proto(func_id);
}
@ -4261,6 +4567,8 @@ sk_msg_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
switch (func_id) {
case BPF_FUNC_msg_redirect_map:
return &bpf_msg_redirect_map_proto;
case BPF_FUNC_msg_redirect_hash:
return &bpf_msg_redirect_hash_proto;
case BPF_FUNC_msg_apply_bytes:
return &bpf_msg_apply_bytes_proto;
case BPF_FUNC_msg_cork_bytes:
@ -4292,6 +4600,8 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
return &bpf_get_socket_uid_proto;
case BPF_FUNC_sk_redirect_map:
return &bpf_sk_redirect_map_proto;
case BPF_FUNC_sk_redirect_hash:
return &bpf_sk_redirect_hash_proto;
default:
return bpf_base_func_proto(func_id);
}
@ -4645,8 +4955,15 @@ static bool xdp_is_valid_access(int off, int size,
const struct bpf_prog *prog,
struct bpf_insn_access_aux *info)
{
if (type == BPF_WRITE)
if (type == BPF_WRITE) {
if (bpf_prog_is_dev_bound(prog->aux)) {
switch (off) {
case offsetof(struct xdp_md, rx_queue_index):
return __is_valid_xdp_access(off, size);
}
}
return false;
}
switch (off) {
case offsetof(struct xdp_md, data):

View file

@ -134,8 +134,39 @@ static int eafnosupport_ipv6_dst_lookup(struct net *net, struct sock *u1,
return -EAFNOSUPPORT;
}
static struct fib6_table *eafnosupport_fib6_get_table(struct net *net, u32 id)
{
return NULL;
}
static struct fib6_info *
eafnosupport_fib6_table_lookup(struct net *net, struct fib6_table *table,
int oif, struct flowi6 *fl6, int flags)
{
return NULL;
}
static struct fib6_info *
eafnosupport_fib6_lookup(struct net *net, int oif, struct flowi6 *fl6,
int flags)
{
return NULL;
}
static struct fib6_info *
eafnosupport_fib6_multipath_select(const struct net *net, struct fib6_info *f6i,
struct flowi6 *fl6, int oif,
const struct sk_buff *skb, int strict)
{
return f6i;
}
const struct ipv6_stub *ipv6_stub __read_mostly = &(struct ipv6_stub) {
.ipv6_dst_lookup = eafnosupport_ipv6_dst_lookup,
.ipv6_dst_lookup = eafnosupport_ipv6_dst_lookup,
.fib6_get_table = eafnosupport_fib6_get_table,
.fib6_table_lookup = eafnosupport_fib6_table_lookup,
.fib6_lookup = eafnosupport_fib6_lookup,
.fib6_multipath_select = eafnosupport_fib6_multipath_select,
};
EXPORT_SYMBOL_GPL(ipv6_stub);

View file

@ -889,7 +889,11 @@ static struct pernet_operations inet6_net_ops = {
static const struct ipv6_stub ipv6_stub_impl = {
.ipv6_sock_mc_join = ipv6_sock_mc_join,
.ipv6_sock_mc_drop = ipv6_sock_mc_drop,
.ipv6_dst_lookup = ip6_dst_lookup,
.ipv6_dst_lookup = ip6_dst_lookup,
.fib6_get_table = fib6_get_table,
.fib6_table_lookup = fib6_table_lookup,
.fib6_lookup = fib6_lookup,
.fib6_multipath_select = fib6_multipath_select,
.udpv6_encap_enable = udpv6_encap_enable,
.ndisc_send_na = ndisc_send_na,
.nd_tbl = &nd_tbl,

View file

@ -60,6 +60,39 @@ unsigned int fib6_rules_seq_read(struct net *net)
return fib_rules_seq_read(net, AF_INET6);
}
/* called with rcu lock held; no reference taken on fib6_info */
struct fib6_info *fib6_lookup(struct net *net, int oif, struct flowi6 *fl6,
int flags)
{
struct fib6_info *f6i;
int err;
if (net->ipv6.fib6_has_custom_rules) {
struct fib_lookup_arg arg = {
.lookup_ptr = fib6_table_lookup,
.lookup_data = &oif,
.flags = FIB_LOOKUP_NOREF,
};
l3mdev_update_flow(net, flowi6_to_flowi(fl6));
err = fib_rules_lookup(net->ipv6.fib6_rules_ops,
flowi6_to_flowi(fl6), flags, &arg);
if (err)
return ERR_PTR(err);
f6i = arg.result ? : net->ipv6.fib6_null_entry;
} else {
f6i = fib6_table_lookup(net, net->ipv6.fib6_local_tbl,
oif, fl6, flags);
if (!f6i || f6i == net->ipv6.fib6_null_entry)
f6i = fib6_table_lookup(net, net->ipv6.fib6_main_tbl,
oif, fl6, flags);
}
return f6i;
}
struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6,
const struct sk_buff *skb,
int flags, pol_lookup_t lookup)
@ -96,8 +129,73 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6,
return &net->ipv6.ip6_null_entry->dst;
}
static int fib6_rule_action(struct fib_rule *rule, struct flowi *flp,
int flags, struct fib_lookup_arg *arg)
static int fib6_rule_saddr(struct net *net, struct fib_rule *rule, int flags,
struct flowi6 *flp6, const struct net_device *dev)
{
struct fib6_rule *r = (struct fib6_rule *)rule;
/* If we need to find a source address for this traffic,
* we check the result if it meets requirement of the rule.
*/
if ((rule->flags & FIB_RULE_FIND_SADDR) &&
r->src.plen && !(flags & RT6_LOOKUP_F_HAS_SADDR)) {
struct in6_addr saddr;
if (ipv6_dev_get_saddr(net, dev, &flp6->daddr,
rt6_flags2srcprefs(flags), &saddr))
return -EAGAIN;
if (!ipv6_prefix_equal(&saddr, &r->src.addr, r->src.plen))
return -EAGAIN;
flp6->saddr = saddr;
}
return 0;
}
static int fib6_rule_action_alt(struct fib_rule *rule, struct flowi *flp,
int flags, struct fib_lookup_arg *arg)
{
struct flowi6 *flp6 = &flp->u.ip6;
struct net *net = rule->fr_net;
struct fib6_table *table;
struct fib6_info *f6i;
int err = -EAGAIN, *oif;
u32 tb_id;
switch (rule->action) {
case FR_ACT_TO_TBL:
break;
case FR_ACT_UNREACHABLE:
return -ENETUNREACH;
case FR_ACT_PROHIBIT:
return -EACCES;
case FR_ACT_BLACKHOLE:
default:
return -EINVAL;
}
tb_id = fib_rule_get_table(rule, arg);
table = fib6_get_table(net, tb_id);
if (!table)
return -EAGAIN;
oif = (int *)arg->lookup_data;
f6i = fib6_table_lookup(net, table, *oif, flp6, flags);
if (f6i != net->ipv6.fib6_null_entry) {
err = fib6_rule_saddr(net, rule, flags, flp6,
fib6_info_nh_dev(f6i));
if (likely(!err))
arg->result = f6i;
}
return err;
}
static int __fib6_rule_action(struct fib_rule *rule, struct flowi *flp,
int flags, struct fib_lookup_arg *arg)
{
struct flowi6 *flp6 = &flp->u.ip6;
struct rt6_info *rt = NULL;
@ -134,27 +232,12 @@ static int fib6_rule_action(struct fib_rule *rule, struct flowi *flp,
rt = lookup(net, table, flp6, arg->lookup_data, flags);
if (rt != net->ipv6.ip6_null_entry) {
struct fib6_rule *r = (struct fib6_rule *)rule;
err = fib6_rule_saddr(net, rule, flags, flp6,
ip6_dst_idev(&rt->dst)->dev);
/*
* If we need to find a source address for this traffic,
* we check the result if it meets requirement of the rule.
*/
if ((rule->flags & FIB_RULE_FIND_SADDR) &&
r->src.plen && !(flags & RT6_LOOKUP_F_HAS_SADDR)) {
struct in6_addr saddr;
if (err == -EAGAIN)
goto again;
if (ipv6_dev_get_saddr(net,
ip6_dst_idev(&rt->dst)->dev,
&flp6->daddr,
rt6_flags2srcprefs(flags),
&saddr))
goto again;
if (!ipv6_prefix_equal(&saddr, &r->src.addr,
r->src.plen))
goto again;
flp6->saddr = saddr;
}
err = rt->dst.error;
if (err != -EAGAIN)
goto out;
@ -172,6 +255,15 @@ out:
return err;
}
static int fib6_rule_action(struct fib_rule *rule, struct flowi *flp,
int flags, struct fib_lookup_arg *arg)
{
if (arg->lookup_ptr == fib6_table_lookup)
return fib6_rule_action_alt(rule, flp, flags, arg);
return __fib6_rule_action(rule, flp, flags, arg);
}
static bool fib6_rule_suppress(struct fib_rule *rule, struct fib_lookup_arg *arg)
{
struct rt6_info *rt = (struct rt6_info *) arg->result;

View file

@ -354,6 +354,13 @@ struct dst_entry *fib6_rule_lookup(struct net *net, struct flowi6 *fl6,
return &rt->dst;
}
/* called with rcu lock held; no reference taken on fib6_info */
struct fib6_info *fib6_lookup(struct net *net, int oif, struct flowi6 *fl6,
int flags)
{
return fib6_table_lookup(net, net->ipv6.fib6_main_tbl, oif, fl6, flags);
}
static void __net_init fib6_tables_init(struct net *net)
{
fib6_link_table(net, net->ipv6.fib6_main_tbl);
@ -1354,8 +1361,8 @@ struct lookup_args {
const struct in6_addr *addr; /* search key */
};
static struct fib6_node *fib6_lookup_1(struct fib6_node *root,
struct lookup_args *args)
static struct fib6_node *fib6_node_lookup_1(struct fib6_node *root,
struct lookup_args *args)
{
struct fib6_node *fn;
__be32 dir;
@ -1400,7 +1407,8 @@ static struct fib6_node *fib6_lookup_1(struct fib6_node *root,
#ifdef CONFIG_IPV6_SUBTREES
if (subtree) {
struct fib6_node *sfn;
sfn = fib6_lookup_1(subtree, args + 1);
sfn = fib6_node_lookup_1(subtree,
args + 1);
if (!sfn)
goto backtrack;
fn = sfn;
@ -1422,8 +1430,9 @@ backtrack:
/* called with rcu_read_lock() held
*/
struct fib6_node *fib6_lookup(struct fib6_node *root, const struct in6_addr *daddr,
const struct in6_addr *saddr)
struct fib6_node *fib6_node_lookup(struct fib6_node *root,
const struct in6_addr *daddr,
const struct in6_addr *saddr)
{
struct fib6_node *fn;
struct lookup_args args[] = {
@ -1442,7 +1451,7 @@ struct fib6_node *fib6_lookup(struct fib6_node *root, const struct in6_addr *dad
}
};
fn = fib6_lookup_1(root, daddr ? args : args + 1);
fn = fib6_node_lookup_1(root, daddr ? args : args + 1);
if (!fn || fn->fn_flags & RTN_TL_ROOT)
fn = root;

View file

@ -419,11 +419,11 @@ static bool rt6_check_expired(const struct rt6_info *rt)
return false;
}
static struct fib6_info *rt6_multipath_select(const struct net *net,
struct fib6_info *match,
struct flowi6 *fl6, int oif,
const struct sk_buff *skb,
int strict)
struct fib6_info *fib6_multipath_select(const struct net *net,
struct fib6_info *match,
struct flowi6 *fl6, int oif,
const struct sk_buff *skb,
int strict)
{
struct fib6_info *sibling, *next_sibling;
@ -1006,7 +1006,7 @@ static struct fib6_node* fib6_backtrack(struct fib6_node *fn,
pn = rcu_dereference(fn->parent);
sn = FIB6_SUBTREE(pn);
if (sn && sn != fn)
fn = fib6_lookup(sn, NULL, saddr);
fn = fib6_node_lookup(sn, NULL, saddr);
else
fn = pn;
if (fn->fn_flags & RTN_RTINFO)
@ -1059,7 +1059,7 @@ static struct rt6_info *ip6_pol_route_lookup(struct net *net,
flags &= ~RT6_LOOKUP_F_IFACE;
rcu_read_lock();
fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr);
fn = fib6_node_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr);
restart:
f6i = rcu_dereference(fn->leaf);
if (!f6i) {
@ -1068,8 +1068,9 @@ restart:
f6i = rt6_device_match(net, f6i, &fl6->saddr,
fl6->flowi6_oif, flags);
if (f6i->fib6_nsiblings && fl6->flowi6_oif == 0)
f6i = rt6_multipath_select(net, f6i, fl6,
fl6->flowi6_oif, skb, flags);
f6i = fib6_multipath_select(net, f6i, fl6,
fl6->flowi6_oif, skb,
flags);
}
if (f6i == net->ipv6.fib6_null_entry) {
fn = fib6_backtrack(fn, &fl6->saddr);
@ -1077,6 +1078,8 @@ restart:
goto restart;
}
trace_fib6_table_lookup(net, f6i, table, fl6);
/* Search through exception table */
rt = rt6_find_cached_rt(f6i, &fl6->daddr, &fl6->saddr);
if (rt) {
@ -1095,8 +1098,6 @@ restart:
rcu_read_unlock();
trace_fib6_table_lookup(net, rt, table, fl6);
return rt;
}
@ -1799,23 +1800,14 @@ void rt6_age_exceptions(struct fib6_info *rt,
rcu_read_unlock_bh();
}
struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
int oif, struct flowi6 *fl6,
const struct sk_buff *skb, int flags)
/* must be called with rcu lock held */
struct fib6_info *fib6_table_lookup(struct net *net, struct fib6_table *table,
int oif, struct flowi6 *fl6, int strict)
{
struct fib6_node *fn, *saved_fn;
struct fib6_info *f6i;
struct rt6_info *rt;
int strict = 0;
strict |= flags & RT6_LOOKUP_F_IFACE;
strict |= flags & RT6_LOOKUP_F_IGNORE_LINKSTATE;
if (net->ipv6.devconf_all->forwarding == 0)
strict |= RT6_LOOKUP_F_REACHABLE;
rcu_read_lock();
fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr);
fn = fib6_node_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr);
saved_fn = fn;
if (fl6->flowi6_flags & FLOWI_FLAG_SKIP_NH_OIF)
@ -1823,8 +1815,6 @@ struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
redo_rt6_select:
f6i = rt6_select(net, fn, oif, strict);
if (f6i->fib6_nsiblings)
f6i = rt6_multipath_select(net, f6i, fl6, oif, skb, strict);
if (f6i == net->ipv6.fib6_null_entry) {
fn = fib6_backtrack(fn, &fl6->saddr);
if (fn)
@ -1837,11 +1827,34 @@ redo_rt6_select:
}
}
trace_fib6_table_lookup(net, f6i, table, fl6);
return f6i;
}
struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
int oif, struct flowi6 *fl6,
const struct sk_buff *skb, int flags)
{
struct fib6_info *f6i;
struct rt6_info *rt;
int strict = 0;
strict |= flags & RT6_LOOKUP_F_IFACE;
strict |= flags & RT6_LOOKUP_F_IGNORE_LINKSTATE;
if (net->ipv6.devconf_all->forwarding == 0)
strict |= RT6_LOOKUP_F_REACHABLE;
rcu_read_lock();
f6i = fib6_table_lookup(net, table, oif, fl6, strict);
if (f6i->fib6_nsiblings)
f6i = fib6_multipath_select(net, f6i, fl6, oif, skb, strict);
if (f6i == net->ipv6.fib6_null_entry) {
rt = net->ipv6.ip6_null_entry;
rcu_read_unlock();
dst_hold(&rt->dst);
trace_fib6_table_lookup(net, rt, table, fl6);
return rt;
}
@ -1852,7 +1865,6 @@ redo_rt6_select:
dst_use_noref(&rt->dst, jiffies);
rcu_read_unlock();
trace_fib6_table_lookup(net, rt, table, fl6);
return rt;
} else if (unlikely((fl6->flowi6_flags & FLOWI_FLAG_KNOWN_NH) &&
!(f6i->fib6_flags & RTF_GATEWAY))) {
@ -1878,9 +1890,7 @@ redo_rt6_select:
dst_hold(&uncached_rt->dst);
}
trace_fib6_table_lookup(net, uncached_rt, table, fl6);
return uncached_rt;
} else {
/* Get a percpu copy */
@ -1894,7 +1904,7 @@ redo_rt6_select:
local_bh_enable();
rcu_read_unlock();
trace_fib6_table_lookup(net, pcpu_rt, table, fl6);
return pcpu_rt;
}
}
@ -2425,7 +2435,7 @@ static struct rt6_info *__ip6_route_redirect(struct net *net,
*/
rcu_read_lock();
fn = fib6_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr);
fn = fib6_node_lookup(&table->tb6_root, &fl6->daddr, &fl6->saddr);
restart:
for_each_fib6_node_rt_rcu(fn) {
if (rt->fib6_nh.nh_flags & RTNH_F_DEAD)
@ -2479,7 +2489,7 @@ out:
rcu_read_unlock();
trace_fib6_table_lookup(net, ret, table, fl6);
trace_fib6_table_lookup(net, rt, table, fl6);
return ret;
};

View file

@ -209,7 +209,7 @@ int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
if ((addr + size) < addr)
return -EINVAL;
nframes = size / frame_size;
nframes = (unsigned int)div_u64(size, frame_size);
if (nframes == 0 || nframes > UINT_MAX)
return -EINVAL;

View file

@ -1,4 +1,8 @@
# SPDX-License-Identifier: GPL-2.0
BPF_SAMPLES_PATH ?= $(abspath $(srctree)/$(src))
TOOLS_PATH := $(BPF_SAMPLES_PATH)/../../tools
# List of programs to build
hostprogs-y := test_lru_dist
hostprogs-y += sock_example
@ -46,60 +50,61 @@ hostprogs-y += syscall_tp
hostprogs-y += cpustat
hostprogs-y += xdp_adjust_tail
hostprogs-y += xdpsock
hostprogs-y += xdp_fwd
# Libbpf dependencies
LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
LIBBPF = $(TOOLS_PATH)/lib/bpf/libbpf.a
CGROUP_HELPERS := ../../tools/testing/selftests/bpf/cgroup_helpers.o
TRACE_HELPERS := ../../tools/testing/selftests/bpf/trace_helpers.o
test_lru_dist-objs := test_lru_dist.o $(LIBBPF)
sock_example-objs := sock_example.o $(LIBBPF)
fds_example-objs := bpf_load.o $(LIBBPF) fds_example.o
sockex1-objs := bpf_load.o $(LIBBPF) sockex1_user.o
sockex2-objs := bpf_load.o $(LIBBPF) sockex2_user.o
sockex3-objs := bpf_load.o $(LIBBPF) sockex3_user.o
tracex1-objs := bpf_load.o $(LIBBPF) tracex1_user.o
tracex2-objs := bpf_load.o $(LIBBPF) tracex2_user.o
tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o
tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o
tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o
tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o $(TRACE_HELPERS)
lathist-objs := bpf_load.o $(LIBBPF) lathist_user.o
offwaketime-objs := bpf_load.o $(LIBBPF) offwaketime_user.o $(TRACE_HELPERS)
spintest-objs := bpf_load.o $(LIBBPF) spintest_user.o $(TRACE_HELPERS)
map_perf_test-objs := bpf_load.o $(LIBBPF) map_perf_test_user.o
test_overhead-objs := bpf_load.o $(LIBBPF) test_overhead_user.o
test_cgrp2_array_pin-objs := $(LIBBPF) test_cgrp2_array_pin.o
test_cgrp2_attach-objs := $(LIBBPF) test_cgrp2_attach.o
test_cgrp2_attach2-objs := $(LIBBPF) test_cgrp2_attach2.o $(CGROUP_HELPERS)
test_cgrp2_sock-objs := $(LIBBPF) test_cgrp2_sock.o
test_cgrp2_sock2-objs := bpf_load.o $(LIBBPF) test_cgrp2_sock2.o
xdp1-objs := bpf_load.o $(LIBBPF) xdp1_user.o
fds_example-objs := bpf_load.o fds_example.o
sockex1-objs := bpf_load.o sockex1_user.o
sockex2-objs := bpf_load.o sockex2_user.o
sockex3-objs := bpf_load.o sockex3_user.o
tracex1-objs := bpf_load.o tracex1_user.o
tracex2-objs := bpf_load.o tracex2_user.o
tracex3-objs := bpf_load.o tracex3_user.o
tracex4-objs := bpf_load.o tracex4_user.o
tracex5-objs := bpf_load.o tracex5_user.o
tracex6-objs := bpf_load.o tracex6_user.o
tracex7-objs := bpf_load.o tracex7_user.o
load_sock_ops-objs := bpf_load.o load_sock_ops.o
test_probe_write_user-objs := bpf_load.o test_probe_write_user_user.o
trace_output-objs := bpf_load.o trace_output_user.o $(TRACE_HELPERS)
lathist-objs := bpf_load.o lathist_user.o
offwaketime-objs := bpf_load.o offwaketime_user.o $(TRACE_HELPERS)
spintest-objs := bpf_load.o spintest_user.o $(TRACE_HELPERS)
map_perf_test-objs := bpf_load.o map_perf_test_user.o
test_overhead-objs := bpf_load.o test_overhead_user.o
test_cgrp2_array_pin-objs := test_cgrp2_array_pin.o
test_cgrp2_attach-objs := test_cgrp2_attach.o
test_cgrp2_attach2-objs := test_cgrp2_attach2.o $(CGROUP_HELPERS)
test_cgrp2_sock-objs := test_cgrp2_sock.o
test_cgrp2_sock2-objs := bpf_load.o test_cgrp2_sock2.o
xdp1-objs := xdp1_user.o
# reuse xdp1 source intentionally
xdp2-objs := bpf_load.o $(LIBBPF) xdp1_user.o
xdp_router_ipv4-objs := bpf_load.o $(LIBBPF) xdp_router_ipv4_user.o
test_current_task_under_cgroup-objs := bpf_load.o $(LIBBPF) $(CGROUP_HELPERS) \
xdp2-objs := xdp1_user.o
xdp_router_ipv4-objs := bpf_load.o xdp_router_ipv4_user.o
test_current_task_under_cgroup-objs := bpf_load.o $(CGROUP_HELPERS) \
test_current_task_under_cgroup_user.o
trace_event-objs := bpf_load.o $(LIBBPF) trace_event_user.o $(TRACE_HELPERS)
sampleip-objs := bpf_load.o $(LIBBPF) sampleip_user.o $(TRACE_HELPERS)
tc_l2_redirect-objs := bpf_load.o $(LIBBPF) tc_l2_redirect_user.o
lwt_len_hist-objs := bpf_load.o $(LIBBPF) lwt_len_hist_user.o
xdp_tx_iptunnel-objs := bpf_load.o $(LIBBPF) xdp_tx_iptunnel_user.o
test_map_in_map-objs := bpf_load.o $(LIBBPF) test_map_in_map_user.o
per_socket_stats_example-objs := $(LIBBPF) cookie_uid_helper_example.o
xdp_redirect-objs := bpf_load.o $(LIBBPF) xdp_redirect_user.o
xdp_redirect_map-objs := bpf_load.o $(LIBBPF) xdp_redirect_map_user.o
xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) xdp_redirect_cpu_user.o
xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
cpustat-objs := bpf_load.o $(LIBBPF) cpustat_user.o
xdp_adjust_tail-objs := bpf_load.o $(LIBBPF) xdp_adjust_tail_user.o
xdpsock-objs := bpf_load.o $(LIBBPF) xdpsock_user.o
trace_event-objs := bpf_load.o trace_event_user.o $(TRACE_HELPERS)
sampleip-objs := bpf_load.o sampleip_user.o $(TRACE_HELPERS)
tc_l2_redirect-objs := bpf_load.o tc_l2_redirect_user.o
lwt_len_hist-objs := bpf_load.o lwt_len_hist_user.o
xdp_tx_iptunnel-objs := bpf_load.o xdp_tx_iptunnel_user.o
test_map_in_map-objs := bpf_load.o test_map_in_map_user.o
per_socket_stats_example-objs := cookie_uid_helper_example.o
xdp_redirect-objs := bpf_load.o xdp_redirect_user.o
xdp_redirect_map-objs := bpf_load.o xdp_redirect_map_user.o
xdp_redirect_cpu-objs := bpf_load.o xdp_redirect_cpu_user.o
xdp_monitor-objs := bpf_load.o xdp_monitor_user.o
xdp_rxq_info-objs := xdp_rxq_info_user.o
syscall_tp-objs := bpf_load.o syscall_tp_user.o
cpustat-objs := bpf_load.o cpustat_user.o
xdp_adjust_tail-objs := xdp_adjust_tail_user.o
xdpsock-objs := bpf_load.o xdpsock_user.o
xdp_fwd-objs := bpf_load.o xdp_fwd_user.o
# Tell kbuild to always build the programs
always := $(hostprogs-y)
@ -154,6 +159,7 @@ always += syscall_tp_kern.o
always += cpustat_kern.o
always += xdp_adjust_tail_kern.o
always += xdpsock_kern.o
always += xdp_fwd_kern.o
HOSTCFLAGS += -I$(objtree)/usr/include
HOSTCFLAGS += -I$(srctree)/tools/lib/
@ -162,45 +168,20 @@ HOSTCFLAGS += -I$(srctree)/tools/lib/ -I$(srctree)/tools/include
HOSTCFLAGS += -I$(srctree)/tools/perf
HOSTCFLAGS_bpf_load.o += -I$(objtree)/usr/include -Wno-unused-variable
HOSTLOADLIBES_fds_example += -lelf
HOSTLOADLIBES_sockex1 += -lelf
HOSTLOADLIBES_sockex2 += -lelf
HOSTLOADLIBES_sockex3 += -lelf
HOSTLOADLIBES_tracex1 += -lelf
HOSTLOADLIBES_tracex2 += -lelf
HOSTLOADLIBES_tracex3 += -lelf
HOSTLOADLIBES_tracex4 += -lelf -lrt
HOSTLOADLIBES_tracex5 += -lelf
HOSTLOADLIBES_tracex6 += -lelf
HOSTLOADLIBES_tracex7 += -lelf
HOSTLOADLIBES_test_cgrp2_sock2 += -lelf
HOSTLOADLIBES_load_sock_ops += -lelf
HOSTLOADLIBES_test_probe_write_user += -lelf
HOSTLOADLIBES_trace_output += -lelf -lrt
HOSTLOADLIBES_lathist += -lelf
HOSTLOADLIBES_offwaketime += -lelf
HOSTLOADLIBES_spintest += -lelf
HOSTLOADLIBES_map_perf_test += -lelf -lrt
HOSTLOADLIBES_test_overhead += -lelf -lrt
HOSTLOADLIBES_xdp1 += -lelf
HOSTLOADLIBES_xdp2 += -lelf
HOSTLOADLIBES_xdp_router_ipv4 += -lelf
HOSTLOADLIBES_test_current_task_under_cgroup += -lelf
HOSTLOADLIBES_trace_event += -lelf
HOSTLOADLIBES_sampleip += -lelf
HOSTLOADLIBES_tc_l2_redirect += -l elf
HOSTLOADLIBES_lwt_len_hist += -l elf
HOSTLOADLIBES_xdp_tx_iptunnel += -lelf
HOSTLOADLIBES_test_map_in_map += -lelf
HOSTLOADLIBES_xdp_redirect += -lelf
HOSTLOADLIBES_xdp_redirect_map += -lelf
HOSTLOADLIBES_xdp_redirect_cpu += -lelf
HOSTLOADLIBES_xdp_monitor += -lelf
HOSTLOADLIBES_xdp_rxq_info += -lelf
HOSTLOADLIBES_syscall_tp += -lelf
HOSTLOADLIBES_cpustat += -lelf
HOSTLOADLIBES_xdp_adjust_tail += -lelf
HOSTLOADLIBES_xdpsock += -lelf -pthread
HOSTCFLAGS_trace_helpers.o += -I$(srctree)/tools/lib/bpf/
HOSTCFLAGS_trace_output_user.o += -I$(srctree)/tools/lib/bpf/
HOSTCFLAGS_offwaketime_user.o += -I$(srctree)/tools/lib/bpf/
HOSTCFLAGS_spintest_user.o += -I$(srctree)/tools/lib/bpf/
HOSTCFLAGS_trace_event_user.o += -I$(srctree)/tools/lib/bpf/
HOSTCFLAGS_sampleip_user.o += -I$(srctree)/tools/lib/bpf/
HOST_LOADLIBES += $(LIBBPF) -lelf
HOSTLOADLIBES_tracex4 += -lrt
HOSTLOADLIBES_trace_output += -lrt
HOSTLOADLIBES_map_perf_test += -lrt
HOSTLOADLIBES_test_overhead += -lrt
HOSTLOADLIBES_xdpsock += -pthread
# Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on cmdline:
# make samples/bpf/ LLC=~/git/llvm/build/bin/llc CLANG=~/git/llvm/build/bin/clang
@ -214,15 +195,16 @@ CLANG_ARCH_ARGS = -target $(ARCH)
endif
# Trick to allow make to be run from this directory
all: $(LIBBPF)
$(MAKE) -C ../../ $(CURDIR)/
all:
$(MAKE) -C ../../ $(CURDIR)/ BPF_SAMPLES_PATH=$(CURDIR)
clean:
$(MAKE) -C ../../ M=$(CURDIR) clean
@rm -f *~
$(LIBBPF): FORCE
$(MAKE) -C $(dir $@) $(notdir $@)
# Fix up variables inherited from Kbuild that tools/ build system won't like
$(MAKE) -C $(dir $@) RM='rm -rf' LDFLAGS= srctree=$(BPF_SAMPLES_PATH)/../../ O=
$(obj)/syscall_nrs.s: $(src)/syscall_nrs.c
$(call if_changed_dep,cc_s_c)
@ -253,7 +235,8 @@ verify_target_bpf: verify_cmds
exit 2; \
else true; fi
$(src)/*.c: verify_target_bpf
$(BPF_SAMPLES_PATH)/*.c: verify_target_bpf $(LIBBPF)
$(src)/*.c: verify_target_bpf $(LIBBPF)
$(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h
@ -261,7 +244,8 @@ $(obj)/tracex5_kern.o: $(obj)/syscall_nrs.h
# But, there is no easy way to fix it, so just exclude it since it is
# useless for BPF samples.
$(obj)/%.o: $(src)/%.c
$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
@echo " CLANG-bpf " $@
$(Q)$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
-I$(srctree)/tools/testing/selftests/bpf/ \
-D__KERNEL__ -Wno-unused-value -Wno-pointer-sign \
-D__TARGET_ARCH_$(ARCH) -Wno-compare-distinct-pointer-types \

View file

@ -1,9 +1,7 @@
/* SPDX-License-Identifier: GPL-2.0 */
/* eBPF mini library */
#ifndef __LIBBPF_H
#define __LIBBPF_H
#include <bpf/bpf.h>
/* eBPF instruction mini library */
#ifndef __BPF_INSN_H
#define __BPF_INSN_H
struct bpf_insn;

View file

@ -24,7 +24,7 @@
#include <poll.h>
#include <ctype.h>
#include <assert.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#include "perf-sys.h"
@ -420,7 +420,7 @@ static int load_elf_maps_section(struct bpf_map_data *maps, int maps_shndx,
/* Keeping compatible with ELF maps section changes
* ------------------------------------------------
* The program size of struct bpf_map_def is known by loader
* The program size of struct bpf_load_map_def is known by loader
* code, but struct stored in ELF file can be different.
*
* Unfortunately sym[i].st_size is zero. To calculate the
@ -429,7 +429,7 @@ static int load_elf_maps_section(struct bpf_map_data *maps, int maps_shndx,
* symbols.
*/
map_sz_elf = data_maps->d_size / nr_maps;
map_sz_copy = sizeof(struct bpf_map_def);
map_sz_copy = sizeof(struct bpf_load_map_def);
if (map_sz_elf < map_sz_copy) {
/*
* Backward compat, loading older ELF file with
@ -448,8 +448,8 @@ static int load_elf_maps_section(struct bpf_map_data *maps, int maps_shndx,
/* Memcpy relevant part of ELF maps data to loader maps */
for (i = 0; i < nr_maps; i++) {
struct bpf_load_map_def *def;
unsigned char *addr, *end;
struct bpf_map_def *def;
const char *map_name;
size_t offset;
@ -464,9 +464,9 @@ static int load_elf_maps_section(struct bpf_map_data *maps, int maps_shndx,
/* Symbol value is offset into ELF maps section data area */
offset = sym[i].st_value;
def = (struct bpf_map_def *)(data_maps->d_buf + offset);
def = (struct bpf_load_map_def *)(data_maps->d_buf + offset);
maps[i].elf_offset = offset;
memset(&maps[i].def, 0, sizeof(struct bpf_map_def));
memset(&maps[i].def, 0, sizeof(struct bpf_load_map_def));
memcpy(&maps[i].def, def, map_sz_copy);
/* Verify no newer features were requested */

View file

@ -2,12 +2,12 @@
#ifndef __BPF_LOAD_H
#define __BPF_LOAD_H
#include "libbpf.h"
#include <bpf/bpf.h>
#define MAX_MAPS 32
#define MAX_PROGS 32
struct bpf_map_def {
struct bpf_load_map_def {
unsigned int type;
unsigned int key_size;
unsigned int value_size;
@ -21,7 +21,7 @@ struct bpf_map_data {
int fd;
char *name;
size_t elf_offset;
struct bpf_map_def def;
struct bpf_load_map_def def;
};
typedef void (*fixup_map_cb)(struct bpf_map_data *map, int idx);

View file

@ -51,7 +51,7 @@
#include <sys/types.h>
#include <unistd.h>
#include <bpf/bpf.h>
#include "libbpf.h"
#include "bpf_insn.h"
#define PORT 8888

View file

@ -17,7 +17,7 @@
#include <sys/resource.h>
#include <sys/wait.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#define MAX_CPU 8

View file

@ -12,8 +12,10 @@
#include <sys/types.h>
#include <sys/socket.h>
#include <bpf/bpf.h>
#include "bpf_insn.h"
#include "bpf_load.h"
#include "libbpf.h"
#include "sock_example.h"
#define BPF_F_PIN (1 << 0)

View file

@ -10,7 +10,7 @@
#include <stdlib.h>
#include <signal.h>
#include <linux/bpf.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#define MAX_ENTRIES 20

View file

@ -8,7 +8,7 @@
#include <stdlib.h>
#include <string.h>
#include <linux/bpf.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#include <unistd.h>
#include <errno.h>

View file

@ -9,7 +9,7 @@
#include <errno.h>
#include <arpa/inet.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_util.h"
#define MAX_INDEX 64

View file

@ -21,7 +21,7 @@
#include <arpa/inet.h>
#include <errno.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#define TEST_BIT(t) (1U << (t))

View file

@ -26,7 +26,8 @@
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <stddef.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_insn.h"
#include "sock_example.h"
char bpf_log_buf[BPF_LOG_BUF_SIZE];

View file

@ -9,7 +9,6 @@
#include <net/if.h>
#include <linux/if_packet.h>
#include <arpa/inet.h>
#include "libbpf.h"
static inline int open_raw_sock(const char *name)
{

View file

@ -2,7 +2,7 @@
#include <stdio.h>
#include <assert.h>
#include <linux/bpf.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#include "sock_example.h"
#include <unistd.h>

View file

@ -2,7 +2,7 @@
#include <stdio.h>
#include <assert.h>
#include <linux/bpf.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#include "sock_example.h"
#include <unistd.h>

View file

@ -2,7 +2,7 @@
#include <stdio.h>
#include <assert.h>
#include <linux/bpf.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#include "sock_example.h"
#include <unistd.h>

View file

@ -16,7 +16,7 @@
#include <assert.h>
#include <stdbool.h>
#include <sys/resource.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
/* This program verifies bpf attachment to tracepoint sys_enter_* and sys_exit_*.

View file

@ -13,7 +13,7 @@
#include <string.h>
#include <errno.h>
#include "libbpf.h"
#include <bpf/bpf.h>
static void usage(void)
{

View file

@ -14,7 +14,7 @@
#include <errno.h>
#include <fcntl.h>
#include "libbpf.h"
#include <bpf/bpf.h>
static void usage(void)
{

View file

@ -28,8 +28,9 @@
#include <fcntl.h>
#include <linux/bpf.h>
#include <bpf/bpf.h>
#include "libbpf.h"
#include "bpf_insn.h"
enum {
MAP_KEY_PACKETS,

View file

@ -24,8 +24,9 @@
#include <unistd.h>
#include <linux/bpf.h>
#include <bpf/bpf.h>
#include "libbpf.h"
#include "bpf_insn.h"
#include "cgroup_helpers.h"
#define FOO "/foo"

View file

@ -21,8 +21,9 @@
#include <net/if.h>
#include <inttypes.h>
#include <linux/bpf.h>
#include <bpf/bpf.h>
#include "libbpf.h"
#include "bpf_insn.h"
char bpf_log_buf[BPF_LOG_BUF_SIZE];

View file

@ -19,8 +19,9 @@
#include <fcntl.h>
#include <net/if.h>
#include <linux/bpf.h>
#include <bpf/bpf.h>
#include "libbpf.h"
#include "bpf_insn.h"
#include "bpf_load.h"
static int usage(const char *argv0)

View file

@ -9,7 +9,7 @@
#include <stdio.h>
#include <linux/bpf.h>
#include <unistd.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#include <linux/bpf.h>
#include "cgroup_helpers.h"

View file

@ -21,7 +21,7 @@
#include <stdlib.h>
#include <time.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_util.h"
#define min(a, b) ((a) < (b) ? (a) : (b))

View file

@ -13,7 +13,7 @@
#include <errno.h>
#include <stdlib.h>
#include <stdio.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#define PORT_A (map_fd[0])

View file

@ -19,7 +19,7 @@
#include <string.h>
#include <time.h>
#include <sys/resource.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#define MAX_CNT 1000000

View file

@ -3,7 +3,7 @@
#include <assert.h>
#include <linux/bpf.h>
#include <unistd.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#include <sys/socket.h>
#include <string.h>

View file

@ -18,7 +18,7 @@
#include <sys/mman.h>
#include <time.h>
#include <signal.h>
#include "libbpf.h"
#include <libbpf.h>
#include "bpf_load.h"
#include "perf-sys.h"
#include "trace_helpers.h"
@ -48,7 +48,7 @@ static int print_bpf_output(void *data, int size)
if (e->cookie != 0x12345678) {
printf("BUG pid %llx cookie %llx sized %d\n",
e->pid, e->cookie, size);
return PERF_EVENT_ERROR;
return LIBBPF_PERF_EVENT_ERROR;
}
cnt++;
@ -56,10 +56,10 @@ static int print_bpf_output(void *data, int size)
if (cnt == MAX_CNT) {
printf("recv %lld events per sec\n",
MAX_CNT * 1000000000ll / (time_get_ns() - start_time));
return PERF_EVENT_DONE;
return LIBBPF_PERF_EVENT_DONE;
}
return PERF_EVENT_CONT;
return LIBBPF_PERF_EVENT_CONT;
}
static void test_bpf_perf_event(void)

View file

@ -2,7 +2,7 @@
#include <stdio.h>
#include <linux/bpf.h>
#include <unistd.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
int main(int ac, char **argv)

View file

@ -7,7 +7,7 @@
#include <string.h>
#include <sys/resource.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#include "bpf_util.h"

View file

@ -13,7 +13,7 @@
#include <linux/bpf.h>
#include <sys/resource.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#include "bpf_util.h"

View file

@ -14,7 +14,7 @@
#include <linux/bpf.h>
#include <sys/resource.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
struct pair {

View file

@ -5,7 +5,7 @@
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <sys/prctl.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#include <sys/resource.h>

View file

@ -16,7 +16,7 @@
#include <unistd.h>
#include "bpf_load.h"
#include "libbpf.h"
#include <bpf/bpf.h>
#include "perf-sys.h"
#define SAMPLE_PERIOD 0x7fffffffffffffffULL

View file

@ -3,7 +3,7 @@
#include <stdio.h>
#include <linux/bpf.h>
#include <unistd.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
int main(int argc, char **argv)

View file

@ -16,9 +16,9 @@
#include <libgen.h>
#include <sys/resource.h>
#include "bpf_load.h"
#include "bpf_util.h"
#include "libbpf.h"
#include "bpf/bpf.h"
#include "bpf/libbpf.h"
static int ifindex;
static __u32 xdp_flags;
@ -31,7 +31,7 @@ static void int_exit(int sig)
/* simple per-protocol drop counter
*/
static void poll_stats(int interval)
static void poll_stats(int map_fd, int interval)
{
unsigned int nr_cpus = bpf_num_possible_cpus();
const unsigned int nr_keys = 256;
@ -47,7 +47,7 @@ static void poll_stats(int interval)
for (key = 0; key < nr_keys; key++) {
__u64 sum = 0;
assert(bpf_map_lookup_elem(map_fd[0], &key, values) == 0);
assert(bpf_map_lookup_elem(map_fd, &key, values) == 0);
for (i = 0; i < nr_cpus; i++)
sum += (values[i] - prev[key][i]);
if (sum)
@ -71,9 +71,14 @@ static void usage(const char *prog)
int main(int argc, char **argv)
{
struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
struct bpf_prog_load_attr prog_load_attr = {
.prog_type = BPF_PROG_TYPE_XDP,
};
const char *optstr = "SN";
int prog_fd, map_fd, opt;
struct bpf_object *obj;
struct bpf_map *map;
char filename[256];
int opt;
while ((opt = getopt(argc, argv, optstr)) != -1) {
switch (opt) {
@ -102,13 +107,19 @@ int main(int argc, char **argv)
ifindex = strtoul(argv[optind], NULL, 0);
snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
prog_load_attr.file = filename;
if (load_bpf_file(filename)) {
printf("%s", bpf_log_buf);
if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
return 1;
map = bpf_map__next(NULL, obj);
if (!map) {
printf("finding a map in obj file failed\n");
return 1;
}
map_fd = bpf_map__fd(map);
if (!prog_fd[0]) {
if (!prog_fd) {
printf("load_bpf_file: %s\n", strerror(errno));
return 1;
}
@ -116,12 +127,12 @@ int main(int argc, char **argv)
signal(SIGINT, int_exit);
signal(SIGTERM, int_exit);
if (bpf_set_link_xdp_fd(ifindex, prog_fd[0], xdp_flags) < 0) {
if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) {
printf("link set xdp fd failed\n");
return 1;
}
poll_stats(2);
poll_stats(map_fd, 2);
return 0;
}

View file

@ -18,9 +18,8 @@
#include <netinet/ether.h>
#include <unistd.h>
#include <time.h>
#include "bpf_load.h"
#include "libbpf.h"
#include "bpf_util.h"
#include "bpf/bpf.h"
#include "bpf/libbpf.h"
#define STATS_INTERVAL_S 2U
@ -36,7 +35,7 @@ static void int_exit(int sig)
/* simple "icmp packet too big sent" counter
*/
static void poll_stats(unsigned int kill_after_s)
static void poll_stats(unsigned int map_fd, unsigned int kill_after_s)
{
time_t started_at = time(NULL);
__u64 value = 0;
@ -46,7 +45,7 @@ static void poll_stats(unsigned int kill_after_s)
while (!kill_after_s || time(NULL) - started_at <= kill_after_s) {
sleep(STATS_INTERVAL_S);
assert(bpf_map_lookup_elem(map_fd[0], &key, &value) == 0);
assert(bpf_map_lookup_elem(map_fd, &key, &value) == 0);
printf("icmp \"packet too big\" sent: %10llu pkts\n", value);
}
@ -66,14 +65,17 @@ static void usage(const char *cmd)
int main(int argc, char **argv)
{
struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
struct bpf_prog_load_attr prog_load_attr = {
.prog_type = BPF_PROG_TYPE_XDP,
};
unsigned char opt_flags[256] = {};
unsigned int kill_after_s = 0;
const char *optstr = "i:T:SNh";
struct rlimit r = {RLIM_INFINITY, RLIM_INFINITY};
int i, prog_fd, map_fd, opt;
struct bpf_object *obj;
struct bpf_map *map;
char filename[256];
int opt;
int i;
for (i = 0; i < strlen(optstr); i++)
if (optstr[i] != 'h' && 'a' <= optstr[i] && optstr[i] <= 'z')
@ -115,13 +117,19 @@ int main(int argc, char **argv)
}
snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
prog_load_attr.file = filename;
if (load_bpf_file(filename)) {
printf("%s", bpf_log_buf);
if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
return 1;
map = bpf_map__next(NULL, obj);
if (!map) {
printf("finding a map in obj file failed\n");
return 1;
}
map_fd = bpf_map__fd(map);
if (!prog_fd[0]) {
if (!prog_fd) {
printf("load_bpf_file: %s\n", strerror(errno));
return 1;
}
@ -129,12 +137,12 @@ int main(int argc, char **argv)
signal(SIGINT, int_exit);
signal(SIGTERM, int_exit);
if (bpf_set_link_xdp_fd(ifindex, prog_fd[0], xdp_flags) < 0) {
if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) {
printf("link set xdp fd failed\n");
return 1;
}
poll_stats(kill_after_s);
poll_stats(map_fd, kill_after_s);
bpf_set_link_xdp_fd(ifindex, -1, xdp_flags);

138
samples/bpf/xdp_fwd_kern.c Normal file
View file

@ -0,0 +1,138 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2017-18 David Ahern <dsahern@gmail.com>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of version 2 of the GNU General Public
* License as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#define KBUILD_MODNAME "foo"
#include <uapi/linux/bpf.h>
#include <linux/in.h>
#include <linux/if_ether.h>
#include <linux/if_packet.h>
#include <linux/if_vlan.h>
#include <linux/ip.h>
#include <linux/ipv6.h>
#include "bpf_helpers.h"
#define IPV6_FLOWINFO_MASK cpu_to_be32(0x0FFFFFFF)
struct bpf_map_def SEC("maps") tx_port = {
.type = BPF_MAP_TYPE_DEVMAP,
.key_size = sizeof(int),
.value_size = sizeof(int),
.max_entries = 64,
};
/* from include/net/ip.h */
static __always_inline int ip_decrease_ttl(struct iphdr *iph)
{
u32 check = (__force u32)iph->check;
check += (__force u32)htons(0x0100);
iph->check = (__force __sum16)(check + (check >= 0xFFFF));
return --iph->ttl;
}
static __always_inline int xdp_fwd_flags(struct xdp_md *ctx, u32 flags)
{
void *data_end = (void *)(long)ctx->data_end;
void *data = (void *)(long)ctx->data;
struct bpf_fib_lookup fib_params;
struct ethhdr *eth = data;
struct ipv6hdr *ip6h;
struct iphdr *iph;
int out_index;
u16 h_proto;
u64 nh_off;
nh_off = sizeof(*eth);
if (data + nh_off > data_end)
return XDP_DROP;
__builtin_memset(&fib_params, 0, sizeof(fib_params));
h_proto = eth->h_proto;
if (h_proto == htons(ETH_P_IP)) {
iph = data + nh_off;
if (iph + 1 > data_end)
return XDP_DROP;
if (iph->ttl <= 1)
return XDP_PASS;
fib_params.family = AF_INET;
fib_params.tos = iph->tos;
fib_params.l4_protocol = iph->protocol;
fib_params.sport = 0;
fib_params.dport = 0;
fib_params.tot_len = ntohs(iph->tot_len);
fib_params.ipv4_src = iph->saddr;
fib_params.ipv4_dst = iph->daddr;
} else if (h_proto == htons(ETH_P_IPV6)) {
struct in6_addr *src = (struct in6_addr *) fib_params.ipv6_src;
struct in6_addr *dst = (struct in6_addr *) fib_params.ipv6_dst;
ip6h = data + nh_off;
if (ip6h + 1 > data_end)
return XDP_DROP;
if (ip6h->hop_limit <= 1)
return XDP_PASS;
fib_params.family = AF_INET6;
fib_params.flowlabel = *(__be32 *)ip6h & IPV6_FLOWINFO_MASK;
fib_params.l4_protocol = ip6h->nexthdr;
fib_params.sport = 0;
fib_params.dport = 0;
fib_params.tot_len = ntohs(ip6h->payload_len);
*src = ip6h->saddr;
*dst = ip6h->daddr;
} else {
return XDP_PASS;
}
fib_params.ifindex = ctx->ingress_ifindex;
out_index = bpf_fib_lookup(ctx, &fib_params, sizeof(fib_params), flags);
/* verify egress index has xdp support
* TO-DO bpf_map_lookup_elem(&tx_port, &key) fails with
* cannot pass map_type 14 into func bpf_map_lookup_elem#1:
* NOTE: without verification that egress index supports XDP
* forwarding packets are dropped.
*/
if (out_index > 0) {
if (h_proto == htons(ETH_P_IP))
ip_decrease_ttl(iph);
else if (h_proto == htons(ETH_P_IPV6))
ip6h->hop_limit--;
memcpy(eth->h_dest, fib_params.dmac, ETH_ALEN);
memcpy(eth->h_source, fib_params.smac, ETH_ALEN);
return bpf_redirect_map(&tx_port, out_index, 0);
}
return XDP_PASS;
}
SEC("xdp_fwd")
int xdp_fwd_prog(struct xdp_md *ctx)
{
return xdp_fwd_flags(ctx, 0);
}
SEC("xdp_fwd_direct")
int xdp_fwd_direct_prog(struct xdp_md *ctx)
{
return xdp_fwd_flags(ctx, BPF_FIB_LOOKUP_DIRECT);
}
char _license[] SEC("license") = "GPL";

136
samples/bpf/xdp_fwd_user.c Normal file
View file

@ -0,0 +1,136 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2017-18 David Ahern <dsahern@gmail.com>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of version 2 of the GNU General Public
* License as published by the Free Software Foundation.
*
* This program is distributed in the hope that it will be useful, but
* WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
* General Public License for more details.
*/
#include <linux/bpf.h>
#include <linux/if_link.h>
#include <linux/limits.h>
#include <net/if.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <libgen.h>
#include "bpf_load.h"
#include "bpf_util.h"
#include <bpf/bpf.h>
static int do_attach(int idx, int fd, const char *name)
{
int err;
err = bpf_set_link_xdp_fd(idx, fd, 0);
if (err < 0)
printf("ERROR: failed to attach program to %s\n", name);
return err;
}
static int do_detach(int idx, const char *name)
{
int err;
err = bpf_set_link_xdp_fd(idx, -1, 0);
if (err < 0)
printf("ERROR: failed to detach program from %s\n", name);
return err;
}
static void usage(const char *prog)
{
fprintf(stderr,
"usage: %s [OPTS] interface-list\n"
"\nOPTS:\n"
" -d detach program\n"
" -D direct table lookups (skip fib rules)\n",
prog);
}
int main(int argc, char **argv)
{
char filename[PATH_MAX];
int opt, i, idx, err;
int prog_id = 0;
int attach = 1;
int ret = 0;
while ((opt = getopt(argc, argv, ":dD")) != -1) {
switch (opt) {
case 'd':
attach = 0;
break;
case 'D':
prog_id = 1;
break;
default:
usage(basename(argv[0]));
return 1;
}
}
if (optind == argc) {
usage(basename(argv[0]));
return 1;
}
if (attach) {
snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
if (access(filename, O_RDONLY) < 0) {
printf("error accessing file %s: %s\n",
filename, strerror(errno));
return 1;
}
if (load_bpf_file(filename)) {
printf("%s", bpf_log_buf);
return 1;
}
if (!prog_fd[prog_id]) {
printf("load_bpf_file: %s\n", strerror(errno));
return 1;
}
}
if (attach) {
for (i = 1; i < 64; ++i)
bpf_map_update_elem(map_fd[0], &i, &i, 0);
}
for (i = optind; i < argc; ++i) {
idx = if_nametoindex(argv[i]);
if (!idx)
idx = strtoul(argv[i], NULL, 0);
if (!idx) {
fprintf(stderr, "Invalid arg\n");
return 1;
}
if (!attach) {
err = do_detach(idx, argv[i]);
if (err)
ret = err;
} else {
err = do_attach(idx, prog_fd[prog_id], argv[i]);
if (err)
ret = err;
}
}
return ret;
}

View file

@ -26,7 +26,7 @@ static const char *__doc_err_only__=
#include <net/if.h>
#include <time.h>
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#include "bpf_util.h"
@ -58,7 +58,7 @@ static void usage(char *argv[])
printf(" flag (internal value:%d)",
*long_options[i].flag);
else
printf("(internal short-option: -%c)",
printf("short-option: -%c",
long_options[i].val);
printf("\n");
}
@ -594,7 +594,7 @@ int main(int argc, char **argv)
snprintf(bpf_obj_file, sizeof(bpf_obj_file), "%s_kern.o", argv[0]);
/* Parse commands line args */
while ((opt = getopt_long(argc, argv, "h",
while ((opt = getopt_long(argc, argv, "hDSs:",
long_options, &longindex)) != -1) {
switch (opt) {
case 'D':

View file

@ -28,7 +28,7 @@ static const char *__doc__ =
* use bpf/libbpf.h), but cannot as (currently) needed for XDP
* attaching to a device via bpf_set_link_xdp_fd()
*/
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_load.h"
#include "bpf_util.h"

View file

@ -24,7 +24,7 @@
#include "bpf_load.h"
#include "bpf_util.h"
#include "libbpf.h"
#include <bpf/bpf.h>
static int ifindex_in;
static int ifindex_out;

View file

@ -24,7 +24,7 @@
#include "bpf_load.h"
#include "bpf_util.h"
#include "libbpf.h"
#include <bpf/bpf.h>
static int ifindex_in;
static int ifindex_out;

View file

@ -16,7 +16,7 @@
#include <sys/socket.h>
#include <unistd.h>
#include "bpf_load.h"
#include "libbpf.h"
#include <bpf/bpf.h>
#include <arpa/inet.h>
#include <fcntl.h>
#include <poll.h>

View file

@ -22,8 +22,8 @@ static const char *__doc__ = " XDP RX-queue info extract example\n\n"
#include <arpa/inet.h>
#include <linux/if_link.h>
#include "libbpf.h"
#include "bpf_load.h"
#include "bpf/bpf.h"
#include "bpf/libbpf.h"
#include "bpf_util.h"
static int ifindex = -1;
@ -32,6 +32,9 @@ static char *ifname;
static __u32 xdp_flags;
static struct bpf_map *stats_global_map;
static struct bpf_map *rx_queue_index_map;
/* Exit return codes */
#define EXIT_OK 0
#define EXIT_FAIL 1
@ -174,7 +177,7 @@ static struct datarec *alloc_record_per_cpu(void)
static struct record *alloc_record_per_rxq(void)
{
unsigned int nr_rxqs = map_data[2].def.max_entries;
unsigned int nr_rxqs = bpf_map__def(rx_queue_index_map)->max_entries;
struct record *array;
size_t size;
@ -190,7 +193,7 @@ static struct record *alloc_record_per_rxq(void)
static struct stats_record *alloc_stats_record(void)
{
unsigned int nr_rxqs = map_data[2].def.max_entries;
unsigned int nr_rxqs = bpf_map__def(rx_queue_index_map)->max_entries;
struct stats_record *rec;
int i;
@ -210,7 +213,7 @@ static struct stats_record *alloc_stats_record(void)
static void free_stats_record(struct stats_record *r)
{
unsigned int nr_rxqs = map_data[2].def.max_entries;
unsigned int nr_rxqs = bpf_map__def(rx_queue_index_map)->max_entries;
int i;
for (i = 0; i < nr_rxqs; i++)
@ -254,11 +257,11 @@ static void stats_collect(struct stats_record *rec)
{
int fd, i, max_rxqs;
fd = map_data[1].fd; /* map: stats_global_map */
fd = bpf_map__fd(stats_global_map);
map_collect_percpu(fd, 0, &rec->stats);
fd = map_data[2].fd; /* map: rx_queue_index_map */
max_rxqs = map_data[2].def.max_entries;
fd = bpf_map__fd(rx_queue_index_map);
max_rxqs = bpf_map__def(rx_queue_index_map)->max_entries;
for (i = 0; i < max_rxqs; i++)
map_collect_percpu(fd, i, &rec->rxq[i]);
}
@ -304,8 +307,8 @@ static void stats_print(struct stats_record *stats_rec,
struct stats_record *stats_prev,
int action)
{
unsigned int nr_rxqs = bpf_map__def(rx_queue_index_map)->max_entries;
unsigned int nr_cpus = bpf_num_possible_cpus();
unsigned int nr_rxqs = map_data[2].def.max_entries;
double pps = 0, err = 0;
struct record *rec, *prev;
double t;
@ -419,31 +422,44 @@ static void stats_poll(int interval, int action)
int main(int argc, char **argv)
{
struct rlimit r = {10 * 1024 * 1024, RLIM_INFINITY};
struct bpf_prog_load_attr prog_load_attr = {
.prog_type = BPF_PROG_TYPE_XDP,
};
int prog_fd, map_fd, opt, err;
bool use_separators = true;
struct config cfg = { 0 };
struct bpf_object *obj;
struct bpf_map *map;
char filename[256];
int longindex = 0;
int interval = 2;
__u32 key = 0;
int opt, err;
char action_str_buf[XDP_ACTION_MAX_STRLEN + 1 /* for \0 */] = { 0 };
int action = XDP_PASS; /* Default action */
char *action_str = NULL;
snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
prog_load_attr.file = filename;
if (setrlimit(RLIMIT_MEMLOCK, &r)) {
perror("setrlimit(RLIMIT_MEMLOCK)");
return 1;
}
if (load_bpf_file(filename)) {
fprintf(stderr, "ERR in load_bpf_file(): %s", bpf_log_buf);
if (bpf_prog_load_xattr(&prog_load_attr, &obj, &prog_fd))
return EXIT_FAIL;
map = bpf_map__next(NULL, obj);
stats_global_map = bpf_map__next(map, obj);
rx_queue_index_map = bpf_map__next(stats_global_map, obj);
if (!map || !stats_global_map || !rx_queue_index_map) {
printf("finding a map in obj file failed\n");
return EXIT_FAIL;
}
map_fd = bpf_map__fd(map);
if (!prog_fd[0]) {
if (!prog_fd) {
fprintf(stderr, "ERR: load_bpf_file: %s\n", strerror(errno));
return EXIT_FAIL;
}
@ -512,7 +528,7 @@ int main(int argc, char **argv)
setlocale(LC_NUMERIC, "en_US");
/* User-side setup ifindex in config_map */
err = bpf_map_update_elem(map_fd[0], &key, &cfg, 0);
err = bpf_map_update_elem(map_fd, &key, &cfg, 0);
if (err) {
fprintf(stderr, "Store config failed (err:%d)\n", err);
exit(EXIT_FAIL_BPF);
@ -521,7 +537,7 @@ int main(int argc, char **argv)
/* Remove XDP program when program is interrupted */
signal(SIGINT, int_exit);
if (bpf_set_link_xdp_fd(ifindex, prog_fd[0], xdp_flags) < 0) {
if (bpf_set_link_xdp_fd(ifindex, prog_fd, xdp_flags) < 0) {
fprintf(stderr, "link set xdp fd failed\n");
return EXIT_FAIL_XDP;
}

View file

@ -18,7 +18,7 @@
#include <unistd.h>
#include <time.h>
#include "bpf_load.h"
#include "libbpf.h"
#include <bpf/bpf.h>
#include "bpf_util.h"
#include "xdp_tx_iptunnel_common.h"

View file

@ -38,7 +38,7 @@
#include "bpf_load.h"
#include "bpf_util.h"
#include "libbpf.h"
#include <bpf/bpf.h>
#include "xdpsock.h"

3
tools/bpf/bpftool/.gitignore vendored Normal file
View file

@ -0,0 +1,3 @@
*.d
bpftool
FEATURE-DUMP.bpftool

View file

@ -66,6 +66,7 @@ static const char * const map_type_name[] = {
[BPF_MAP_TYPE_DEVMAP] = "devmap",
[BPF_MAP_TYPE_SOCKMAP] = "sockmap",
[BPF_MAP_TYPE_CPUMAP] = "cpumap",
[BPF_MAP_TYPE_SOCKHASH] = "sockhash",
};
static bool map_is_per_cpu(__u32 type)

View file

@ -39,6 +39,7 @@ struct event_ring_info {
struct perf_event_sample {
struct perf_event_header header;
u64 time;
__u32 size;
unsigned char data[];
};
@ -49,25 +50,18 @@ static void int_exit(int signo)
stop = true;
}
static void
print_bpf_output(struct event_ring_info *ring, struct perf_event_sample *e)
static enum bpf_perf_event_ret print_bpf_output(void *event, void *priv)
{
struct event_ring_info *ring = priv;
struct perf_event_sample *e = event;
struct {
struct perf_event_header header;
__u64 id;
__u64 lost;
} *lost = (void *)e;
struct timespec ts;
if (clock_gettime(CLOCK_MONOTONIC, &ts)) {
perror("Can't read clock for timestamp");
return;
}
} *lost = event;
if (json_output) {
jsonw_start_object(json_wtr);
jsonw_name(json_wtr, "timestamp");
jsonw_uint(json_wtr, ts.tv_sec * 1000000000ull + ts.tv_nsec);
jsonw_name(json_wtr, "type");
jsonw_uint(json_wtr, e->header.type);
jsonw_name(json_wtr, "cpu");
@ -75,6 +69,8 @@ print_bpf_output(struct event_ring_info *ring, struct perf_event_sample *e)
jsonw_name(json_wtr, "index");
jsonw_uint(json_wtr, ring->key);
if (e->header.type == PERF_RECORD_SAMPLE) {
jsonw_name(json_wtr, "timestamp");
jsonw_uint(json_wtr, e->time);
jsonw_name(json_wtr, "data");
print_data_json(e->data, e->size);
} else if (e->header.type == PERF_RECORD_LOST) {
@ -89,8 +85,8 @@ print_bpf_output(struct event_ring_info *ring, struct perf_event_sample *e)
jsonw_end_object(json_wtr);
} else {
if (e->header.type == PERF_RECORD_SAMPLE) {
printf("== @%ld.%ld CPU: %d index: %d =====\n",
(long)ts.tv_sec, ts.tv_nsec,
printf("== @%lld.%09lld CPU: %d index: %d =====\n",
e->time / 1000000000ULL, e->time % 1000000000ULL,
ring->cpu, ring->key);
fprint_hex(stdout, e->data, e->size, " ");
printf("\n");
@ -101,60 +97,23 @@ print_bpf_output(struct event_ring_info *ring, struct perf_event_sample *e)
e->header.type, e->header.size);
}
}
return LIBBPF_PERF_EVENT_CONT;
}
static void
perf_event_read(struct event_ring_info *ring, void **buf, size_t *buf_len)
{
volatile struct perf_event_mmap_page *header = ring->mem;
__u64 buffer_size = MMAP_PAGE_CNT * get_page_size();
__u64 data_tail = header->data_tail;
__u64 data_head = header->data_head;
void *base, *begin, *end;
enum bpf_perf_event_ret ret;
asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
if (data_head == data_tail)
return;
base = ((char *)header) + get_page_size();
begin = base + data_tail % buffer_size;
end = base + data_head % buffer_size;
while (begin != end) {
struct perf_event_sample *e;
e = begin;
if (begin + e->header.size > base + buffer_size) {
long len = base + buffer_size - begin;
if (*buf_len < e->header.size) {
free(*buf);
*buf = malloc(e->header.size);
if (!*buf) {
fprintf(stderr,
"can't allocate memory");
stop = true;
return;
}
*buf_len = e->header.size;
}
memcpy(*buf, begin, len);
memcpy(*buf + len, base, e->header.size - len);
e = (void *)*buf;
begin = base + e->header.size - len;
} else if (begin + e->header.size == base + buffer_size) {
begin = base;
} else {
begin += e->header.size;
}
print_bpf_output(ring, e);
ret = bpf_perf_event_read_simple(ring->mem,
MMAP_PAGE_CNT * get_page_size(),
get_page_size(), buf, buf_len,
print_bpf_output, ring);
if (ret != LIBBPF_PERF_EVENT_CONT) {
fprintf(stderr, "perf read loop failed with %d\n", ret);
stop = true;
}
__sync_synchronize(); /* smp_mb() */
header->data_tail = data_head;
}
static int perf_mmap_size(void)
@ -185,7 +144,7 @@ static void perf_event_unmap(void *mem)
static int bpf_perf_event_open(int map_fd, int key, int cpu)
{
struct perf_event_attr attr = {
.sample_type = PERF_SAMPLE_RAW,
.sample_type = PERF_SAMPLE_RAW | PERF_SAMPLE_TIME,
.type = PERF_TYPE_SOFTWARE,
.config = PERF_COUNT_SW_BPF_OUTPUT,
};

View file

@ -0,0 +1,18 @@
/* SPDX-License-Identifier: GPL-2.0 */
#if defined(__i386__) || defined(__x86_64__)
#include "../../arch/x86/include/uapi/asm/bitsperlong.h"
#elif defined(__aarch64__)
#include "../../arch/arm64/include/uapi/asm/bitsperlong.h"
#elif defined(__powerpc__)
#include "../../arch/powerpc/include/uapi/asm/bitsperlong.h"
#elif defined(__s390__)
#include "../../arch/s390/include/uapi/asm/bitsperlong.h"
#elif defined(__sparc__)
#include "../../arch/sparc/include/uapi/asm/bitsperlong.h"
#elif defined(__mips__)
#include "../../arch/mips/include/uapi/asm/bitsperlong.h"
#elif defined(__ia64__)
#include "../../arch/ia64/include/uapi/asm/bitsperlong.h"
#else
#include <asm-generic/bitsperlong.h>
#endif

View file

@ -0,0 +1,18 @@
/* SPDX-License-Identifier: GPL-2.0 */
#if defined(__i386__) || defined(__x86_64__)
#include "../../arch/x86/include/uapi/asm/errno.h"
#elif defined(__powerpc__)
#include "../../arch/powerpc/include/uapi/asm/errno.h"
#elif defined(__sparc__)
#include "../../arch/sparc/include/uapi/asm/errno.h"
#elif defined(__alpha__)
#include "../../arch/alpha/include/uapi/asm/errno.h"
#elif defined(__mips__)
#include "../../arch/mips/include/uapi/asm/errno.h"
#elif defined(__ia64__)
#include "../../arch/ia64/include/uapi/asm/errno.h"
#elif defined(__xtensa__)
#include "../../arch/xtensa/include/uapi/asm/errno.h"
#else
#include <asm-generic/errno.h>
#endif

View file

@ -96,6 +96,7 @@ enum bpf_cmd {
BPF_PROG_QUERY,
BPF_RAW_TRACEPOINT_OPEN,
BPF_BTF_LOAD,
BPF_BTF_GET_FD_BY_ID,
};
enum bpf_map_type {
@ -116,6 +117,8 @@ enum bpf_map_type {
BPF_MAP_TYPE_DEVMAP,
BPF_MAP_TYPE_SOCKMAP,
BPF_MAP_TYPE_CPUMAP,
BPF_MAP_TYPE_XSKMAP,
BPF_MAP_TYPE_SOCKHASH,
};
enum bpf_prog_type {
@ -343,6 +346,7 @@ union bpf_attr {
__u32 start_id;
__u32 prog_id;
__u32 map_id;
__u32 btf_id;
};
__u32 next_id;
__u32 open_flags;
@ -1825,6 +1829,79 @@ union bpf_attr {
* Return
* 0 on success, or a negative error in case of failure.
*
* int bpf_fib_lookup(void *ctx, struct bpf_fib_lookup *params, int plen, u32 flags)
* Description
* Do FIB lookup in kernel tables using parameters in *params*.
* If lookup is successful and result shows packet is to be
* forwarded, the neighbor tables are searched for the nexthop.
* If successful (ie., FIB lookup shows forwarding and nexthop
* is resolved), the nexthop address is returned in ipv4_dst,
* ipv6_dst or mpls_out based on family, smac is set to mac
* address of egress device, dmac is set to nexthop mac address,
* rt_metric is set to metric from route.
*
* *plen* argument is the size of the passed in struct.
* *flags* argument can be one or more BPF_FIB_LOOKUP_ flags:
*
* **BPF_FIB_LOOKUP_DIRECT** means do a direct table lookup vs
* full lookup using FIB rules
* **BPF_FIB_LOOKUP_OUTPUT** means do lookup from an egress
* perspective (default is ingress)
*
* *ctx* is either **struct xdp_md** for XDP programs or
* **struct sk_buff** tc cls_act programs.
*
* Return
* Egress device index on success, 0 if packet needs to continue
* up the stack for further processing or a negative error in case
* of failure.
*
* int bpf_sock_hash_update(struct bpf_sock_ops_kern *skops, struct bpf_map *map, void *key, u64 flags)
* Description
* Add an entry to, or update a sockhash *map* referencing sockets.
* The *skops* is used as a new value for the entry associated to
* *key*. *flags* is one of:
*
* **BPF_NOEXIST**
* The entry for *key* must not exist in the map.
* **BPF_EXIST**
* The entry for *key* must already exist in the map.
* **BPF_ANY**
* No condition on the existence of the entry for *key*.
*
* If the *map* has eBPF programs (parser and verdict), those will
* be inherited by the socket being added. If the socket is
* already attached to eBPF programs, this results in an error.
* Return
* 0 on success, or a negative error in case of failure.
*
* int bpf_msg_redirect_hash(struct sk_msg_buff *msg, struct bpf_map *map, void *key, u64 flags)
* Description
* This helper is used in programs implementing policies at the
* socket level. If the message *msg* is allowed to pass (i.e. if
* the verdict eBPF program returns **SK_PASS**), redirect it to
* the socket referenced by *map* (of type
* **BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
* egress interfaces can be used for redirection. The
* **BPF_F_INGRESS** value in *flags* is used to make the
* distinction (ingress path is selected if the flag is present,
* egress path otherwise). This is the only flag supported for now.
* Return
* **SK_PASS** on success, or **SK_DROP** on error.
*
* int bpf_sk_redirect_hash(struct sk_buff *skb, struct bpf_map *map, void *key, u64 flags)
* Description
* This helper is used in programs implementing policies at the
* skb socket level. If the sk_buff *skb* is allowed to pass (i.e.
* if the verdeict eBPF program returns **SK_PASS**), redirect it
* to the socket referenced by *map* (of type
* **BPF_MAP_TYPE_SOCKHASH**) using hash *key*. Both ingress and
* egress interfaces can be used for redirection. The
* **BPF_F_INGRESS** value in *flags* is used to make the
* distinction (ingress path is selected if the flag is present,
* egress otherwise). This is the only flag supported for now.
* Return
* **SK_PASS** on success, or **SK_DROP** on error.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@ -1895,7 +1972,11 @@ union bpf_attr {
FN(xdp_adjust_tail), \
FN(skb_get_xfrm_state), \
FN(get_stack), \
FN(skb_load_bytes_relative),
FN(skb_load_bytes_relative), \
FN(fib_lookup), \
FN(sock_hash_update), \
FN(msg_redirect_hash), \
FN(sk_redirect_hash),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@ -2129,6 +2210,15 @@ struct bpf_map_info {
__u32 ifindex;
__u64 netns_dev;
__u64 netns_ino;
__u32 btf_id;
__u32 btf_key_id;
__u32 btf_value_id;
} __attribute__((aligned(8)));
struct bpf_btf_info {
__aligned_u64 btf;
__u32 btf_size;
__u32 id;
} __attribute__((aligned(8)));
/* User bpf_sock_addr struct to access socket fields and sockaddr struct passed
@ -2309,4 +2399,55 @@ struct bpf_raw_tracepoint_args {
__u64 args[0];
};
/* DIRECT: Skip the FIB rules and go to FIB table associated with device
* OUTPUT: Do lookup from egress perspective; default is ingress
*/
#define BPF_FIB_LOOKUP_DIRECT BIT(0)
#define BPF_FIB_LOOKUP_OUTPUT BIT(1)
struct bpf_fib_lookup {
/* input */
__u8 family; /* network family, AF_INET, AF_INET6, AF_MPLS */
/* set if lookup is to consider L4 data - e.g., FIB rules */
__u8 l4_protocol;
__be16 sport;
__be16 dport;
/* total length of packet from network header - used for MTU check */
__u16 tot_len;
__u32 ifindex; /* L3 device index for lookup */
union {
/* inputs to lookup */
__u8 tos; /* AF_INET */
__be32 flowlabel; /* AF_INET6 */
/* output: metric of fib result */
__u32 rt_metric;
};
union {
__be32 mpls_in;
__be32 ipv4_src;
__u32 ipv6_src[4]; /* in6_addr; network order */
};
/* input to bpf_fib_lookup, *dst is destination address.
* output: bpf_fib_lookup sets to gateway address
*/
union {
/* return for MPLS lookups */
__be32 mpls_out[4]; /* support up to 4 labels */
__be32 ipv4_dst;
__u32 ipv6_dst[4]; /* in6_addr; network order */
};
/* output */
__be16 h_vlan_proto;
__be16 h_vlan_TCI;
__u8 smac[6]; /* ETH_ALEN */
__u8 dmac[6]; /* ETH_ALEN */
};
#endif /* _UAPI__LINUX_BPF_H__ */

View file

@ -69,7 +69,7 @@ FEATURE_USER = .libbpf
FEATURE_TESTS = libelf libelf-getphdrnum libelf-mmap bpf
FEATURE_DISPLAY = libelf bpf
INCLUDES = -I. -I$(srctree)/tools/include -I$(srctree)/tools/arch/$(ARCH)/include/uapi -I$(srctree)/tools/include/uapi
INCLUDES = -I. -I$(srctree)/tools/include -I$(srctree)/tools/arch/$(ARCH)/include/uapi -I$(srctree)/tools/include/uapi -I$(srctree)/tools/perf
FEATURE_CHECK_CFLAGS-bpf = $(INCLUDES)
check_feat := 1

View file

@ -91,6 +91,7 @@ int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr)
attr.btf_fd = create_attr->btf_fd;
attr.btf_key_id = create_attr->btf_key_id;
attr.btf_value_id = create_attr->btf_value_id;
attr.map_ifindex = create_attr->map_ifindex;
return sys_bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
}
@ -201,6 +202,7 @@ int bpf_load_program_xattr(const struct bpf_load_program_attr *load_attr,
attr.log_size = 0;
attr.log_level = 0;
attr.kern_version = load_attr->kern_version;
attr.prog_ifindex = load_attr->prog_ifindex;
memcpy(attr.prog_name, load_attr->name,
min(name_len, BPF_OBJ_NAME_LEN - 1));
@ -458,6 +460,16 @@ int bpf_map_get_fd_by_id(__u32 id)
return sys_bpf(BPF_MAP_GET_FD_BY_ID, &attr, sizeof(attr));
}
int bpf_btf_get_fd_by_id(__u32 id)
{
union bpf_attr attr;
bzero(&attr, sizeof(attr));
attr.btf_id = id;
return sys_bpf(BPF_BTF_GET_FD_BY_ID, &attr, sizeof(attr));
}
int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len)
{
union bpf_attr attr;

View file

@ -38,6 +38,7 @@ struct bpf_create_map_attr {
__u32 btf_fd;
__u32 btf_key_id;
__u32 btf_value_id;
__u32 map_ifindex;
};
int bpf_create_map_xattr(const struct bpf_create_map_attr *create_attr);
@ -64,6 +65,7 @@ struct bpf_load_program_attr {
size_t insns_cnt;
const char *license;
__u32 kern_version;
__u32 prog_ifindex;
};
/* Recommend log buffer size */
@ -98,6 +100,7 @@ int bpf_prog_get_next_id(__u32 start_id, __u32 *next_id);
int bpf_map_get_next_id(__u32 start_id, __u32 *next_id);
int bpf_prog_get_fd_by_id(__u32 id);
int bpf_map_get_fd_by_id(__u32 id);
int bpf_btf_get_fd_by_id(__u32 id);
int bpf_obj_get_info_by_fd(int prog_fd, void *info, __u32 *info_len);
int bpf_prog_query(int target_fd, enum bpf_attach_type type, __u32 query_flags,
__u32 *attach_flags, __u32 *prog_ids, __u32 *prog_cnt);

View file

@ -31,6 +31,7 @@
#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <perf-sys.h>
#include <asm/unistd.h>
#include <linux/err.h>
#include <linux/kernel.h>
@ -177,6 +178,7 @@ struct bpf_program {
/* Index in elf obj file, for relocation use. */
int idx;
char *name;
int prog_ifindex;
char *section_name;
struct bpf_insn *insns;
size_t insns_cnt, main_prog_cnt;
@ -212,6 +214,7 @@ struct bpf_map {
int fd;
char *name;
size_t offset;
int map_ifindex;
struct bpf_map_def def;
uint32_t btf_key_id;
uint32_t btf_value_id;
@ -1090,6 +1093,7 @@ bpf_object__create_maps(struct bpf_object *obj)
int *pfd = &map->fd;
create_attr.name = map->name;
create_attr.map_ifindex = map->map_ifindex;
create_attr.map_type = def->type;
create_attr.map_flags = def->map_flags;
create_attr.key_size = def->key_size;
@ -1272,7 +1276,7 @@ static int bpf_object__collect_reloc(struct bpf_object *obj)
static int
load_program(enum bpf_prog_type type, enum bpf_attach_type expected_attach_type,
const char *name, struct bpf_insn *insns, int insns_cnt,
char *license, u32 kern_version, int *pfd)
char *license, u32 kern_version, int *pfd, int prog_ifindex)
{
struct bpf_load_program_attr load_attr;
char *log_buf;
@ -1286,6 +1290,7 @@ load_program(enum bpf_prog_type type, enum bpf_attach_type expected_attach_type,
load_attr.insns_cnt = insns_cnt;
load_attr.license = license;
load_attr.kern_version = kern_version;
load_attr.prog_ifindex = prog_ifindex;
if (!load_attr.insns || !load_attr.insns_cnt)
return -EINVAL;
@ -1367,7 +1372,8 @@ bpf_program__load(struct bpf_program *prog,
}
err = load_program(prog->type, prog->expected_attach_type,
prog->name, prog->insns, prog->insns_cnt,
license, kern_version, &fd);
license, kern_version, &fd,
prog->prog_ifindex);
if (!err)
prog->instances.fds[0] = fd;
goto out;
@ -1398,7 +1404,8 @@ bpf_program__load(struct bpf_program *prog,
err = load_program(prog->type, prog->expected_attach_type,
prog->name, result.new_insn_ptr,
result.new_insn_cnt,
license, kern_version, &fd);
license, kern_version, &fd,
prog->prog_ifindex);
if (err) {
pr_warning("Loading the %dth instance of program '%s' failed\n",
@ -1437,9 +1444,37 @@ bpf_object__load_progs(struct bpf_object *obj)
return 0;
}
static int bpf_object__validate(struct bpf_object *obj)
static bool bpf_prog_type__needs_kver(enum bpf_prog_type type)
{
if (obj->kern_version == 0) {
switch (type) {
case BPF_PROG_TYPE_SOCKET_FILTER:
case BPF_PROG_TYPE_SCHED_CLS:
case BPF_PROG_TYPE_SCHED_ACT:
case BPF_PROG_TYPE_XDP:
case BPF_PROG_TYPE_CGROUP_SKB:
case BPF_PROG_TYPE_CGROUP_SOCK:
case BPF_PROG_TYPE_LWT_IN:
case BPF_PROG_TYPE_LWT_OUT:
case BPF_PROG_TYPE_LWT_XMIT:
case BPF_PROG_TYPE_SOCK_OPS:
case BPF_PROG_TYPE_SK_SKB:
case BPF_PROG_TYPE_CGROUP_DEVICE:
case BPF_PROG_TYPE_SK_MSG:
case BPF_PROG_TYPE_CGROUP_SOCK_ADDR:
return false;
case BPF_PROG_TYPE_UNSPEC:
case BPF_PROG_TYPE_KPROBE:
case BPF_PROG_TYPE_TRACEPOINT:
case BPF_PROG_TYPE_PERF_EVENT:
case BPF_PROG_TYPE_RAW_TRACEPOINT:
default:
return true;
}
}
static int bpf_object__validate(struct bpf_object *obj, bool needs_kver)
{
if (needs_kver && obj->kern_version == 0) {
pr_warning("%s doesn't provide kernel version\n",
obj->path);
return -LIBBPF_ERRNO__KVERSION;
@ -1448,7 +1483,8 @@ static int bpf_object__validate(struct bpf_object *obj)
}
static struct bpf_object *
__bpf_object__open(const char *path, void *obj_buf, size_t obj_buf_sz)
__bpf_object__open(const char *path, void *obj_buf, size_t obj_buf_sz,
bool needs_kver)
{
struct bpf_object *obj;
int err;
@ -1466,7 +1502,7 @@ __bpf_object__open(const char *path, void *obj_buf, size_t obj_buf_sz)
CHECK_ERR(bpf_object__check_endianness(obj), err, out);
CHECK_ERR(bpf_object__elf_collect(obj), err, out);
CHECK_ERR(bpf_object__collect_reloc(obj), err, out);
CHECK_ERR(bpf_object__validate(obj), err, out);
CHECK_ERR(bpf_object__validate(obj, needs_kver), err, out);
bpf_object__elf_finish(obj);
return obj;
@ -1483,7 +1519,7 @@ struct bpf_object *bpf_object__open(const char *path)
pr_debug("loading %s\n", path);
return __bpf_object__open(path, NULL, 0);
return __bpf_object__open(path, NULL, 0, true);
}
struct bpf_object *bpf_object__open_buffer(void *obj_buf,
@ -1506,7 +1542,7 @@ struct bpf_object *bpf_object__open_buffer(void *obj_buf,
pr_debug("loading object '%s' from buffer\n",
name);
return __bpf_object__open(name, obj_buf, obj_buf_sz);
return __bpf_object__open(name, obj_buf, obj_buf_sz, true);
}
int bpf_object__unload(struct bpf_object *obj)
@ -2158,13 +2194,17 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
enum bpf_attach_type expected_attach_type;
enum bpf_prog_type prog_type;
struct bpf_object *obj;
struct bpf_map *map;
int section_idx;
int err;
if (!attr)
return -EINVAL;
if (!attr->file)
return -EINVAL;
obj = bpf_object__open(attr->file);
obj = __bpf_object__open(attr->file, NULL, 0,
bpf_prog_type__needs_kver(attr->prog_type));
if (IS_ERR(obj))
return -ENOENT;
@ -2174,6 +2214,7 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
* section name.
*/
prog_type = attr->prog_type;
prog->prog_ifindex = attr->ifindex;
expected_attach_type = attr->expected_attach_type;
if (prog_type == BPF_PROG_TYPE_UNSPEC) {
section_idx = bpf_program__identify_section(prog);
@ -2194,6 +2235,10 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
first_prog = prog;
}
bpf_map__for_each(map, obj) {
map->map_ifindex = attr->ifindex;
}
if (!first_prog) {
pr_warning("object file doesn't contain bpf program\n");
bpf_object__close(obj);
@ -2210,3 +2255,63 @@ int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
*prog_fd = bpf_program__fd(first_prog);
return 0;
}
enum bpf_perf_event_ret
bpf_perf_event_read_simple(void *mem, unsigned long size,
unsigned long page_size, void **buf, size_t *buf_len,
bpf_perf_event_print_t fn, void *priv)
{
volatile struct perf_event_mmap_page *header = mem;
__u64 data_tail = header->data_tail;
__u64 data_head = header->data_head;
void *base, *begin, *end;
int ret;
asm volatile("" ::: "memory"); /* in real code it should be smp_rmb() */
if (data_head == data_tail)
return LIBBPF_PERF_EVENT_CONT;
base = ((char *)header) + page_size;
begin = base + data_tail % size;
end = base + data_head % size;
while (begin != end) {
struct perf_event_header *ehdr;
ehdr = begin;
if (begin + ehdr->size > base + size) {
long len = base + size - begin;
if (*buf_len < ehdr->size) {
free(*buf);
*buf = malloc(ehdr->size);
if (!*buf) {
ret = LIBBPF_PERF_EVENT_ERROR;
break;
}
*buf_len = ehdr->size;
}
memcpy(*buf, begin, len);
memcpy(*buf + len, base, ehdr->size - len);
ehdr = (void *)*buf;
begin = base + ehdr->size - len;
} else if (begin + ehdr->size == base + size) {
begin = base;
} else {
begin += ehdr->size;
}
ret = fn(ehdr, priv);
if (ret != LIBBPF_PERF_EVENT_CONT)
break;
data_tail += ehdr->size;
}
__sync_synchronize(); /* smp_mb() */
header->data_tail = data_tail;
return ret;
}

View file

@ -52,8 +52,8 @@ enum libbpf_errno {
int libbpf_strerror(int err, char *buf, size_t size);
/*
* In include/linux/compiler-gcc.h, __printf is defined. However
* it should be better if libbpf.h doesn't depend on Linux header file.
* __printf is defined in include/linux/compiler-gcc.h. However,
* it would be better if libbpf.h didn't depend on Linux header files.
* So instead of __printf, here we use gcc attribute directly.
*/
typedef int (*libbpf_print_fn_t)(const char *, ...)
@ -92,7 +92,7 @@ int bpf_object__set_priv(struct bpf_object *obj, void *priv,
bpf_object_clear_priv_t clear_priv);
void *bpf_object__priv(struct bpf_object *prog);
/* Accessors of bpf_program. */
/* Accessors of bpf_program */
struct bpf_program;
struct bpf_program *bpf_program__next(struct bpf_program *prog,
struct bpf_object *obj);
@ -121,28 +121,28 @@ struct bpf_insn;
/*
* Libbpf allows callers to adjust BPF programs before being loaded
* into kernel. One program in an object file can be transform into
* multiple variants to be attached to different code.
* into kernel. One program in an object file can be transformed into
* multiple variants to be attached to different hooks.
*
* bpf_program_prep_t, bpf_program__set_prep and bpf_program__nth_fd
* are APIs for this propose.
* form an API for this purpose.
*
* - bpf_program_prep_t:
* It defines 'preprocessor', which is a caller defined function
* Defines a 'preprocessor', which is a caller defined function
* passed to libbpf through bpf_program__set_prep(), and will be
* called before program is loaded. The processor should adjust
* the program one time for each instances according to the number
* the program one time for each instance according to the instance id
* passed to it.
*
* - bpf_program__set_prep:
* Attachs a preprocessor to a BPF program. The number of instances
* whould be created is also passed through this function.
* Attaches a preprocessor to a BPF program. The number of instances
* that should be created is also passed through this function.
*
* - bpf_program__nth_fd:
* After the program is loaded, get resuling fds from bpf program for
* each instances.
* After the program is loaded, get resulting FD of a given instance
* of the BPF program.
*
* If bpf_program__set_prep() is not used, the program whould be loaded
* If bpf_program__set_prep() is not used, the program would be loaded
* without adjustment during bpf_object__load(). The program has only
* one instance. In this case bpf_program__fd(prog) is equal to
* bpf_program__nth_fd(prog, 0).
@ -156,7 +156,7 @@ struct bpf_prog_prep_result {
struct bpf_insn *new_insn_ptr;
int new_insn_cnt;
/* If not NULL, result fd is set to it */
/* If not NULL, result FD is written to it. */
int *pfd;
};
@ -169,8 +169,8 @@ struct bpf_prog_prep_result {
* - res: Output parameter, result of transformation.
*
* Return value:
* - Zero: pre-processing success.
* - Non-zero: pre-processing, stop loading.
* - Zero: pre-processing success.
* - Non-zero: pre-processing error, stop loading.
*/
typedef int (*bpf_program_prep_t)(struct bpf_program *prog, int n,
struct bpf_insn *insns, int insns_cnt,
@ -182,7 +182,7 @@ int bpf_program__set_prep(struct bpf_program *prog, int nr_instance,
int bpf_program__nth_fd(struct bpf_program *prog, int n);
/*
* Adjust type of bpf program. Default is kprobe.
* Adjust type of BPF program. Default is kprobe.
*/
int bpf_program__set_socket_filter(struct bpf_program *prog);
int bpf_program__set_tracepoint(struct bpf_program *prog);
@ -206,10 +206,10 @@ bool bpf_program__is_xdp(struct bpf_program *prog);
bool bpf_program__is_perf_event(struct bpf_program *prog);
/*
* We don't need __attribute__((packed)) now since it is
* unnecessary for 'bpf_map_def' because they are all aligned.
* In addition, using it will trigger -Wpacked warning message,
* and will be treated as an error due to -Werror.
* No need for __attribute__((packed)), all members of 'bpf_map_def'
* are all aligned. In addition, using __attribute__((packed))
* would trigger a -Wpacked warning message, and lead to an error
* if -Werror is set.
*/
struct bpf_map_def {
unsigned int type;
@ -220,8 +220,8 @@ struct bpf_map_def {
};
/*
* There is another 'struct bpf_map' in include/linux/map.h. However,
* it is not a uapi header so no need to consider name clash.
* The 'struct bpf_map' in include/linux/bpf.h is internal to the kernel,
* so no need to worry about a name clash.
*/
struct bpf_map;
struct bpf_map *
@ -229,7 +229,7 @@ bpf_object__find_map_by_name(struct bpf_object *obj, const char *name);
/*
* Get bpf_map through the offset of corresponding struct bpf_map_def
* in the bpf object file.
* in the BPF object file.
*/
struct bpf_map *
bpf_object__find_map_by_offset(struct bpf_object *obj, size_t offset);
@ -259,6 +259,7 @@ struct bpf_prog_load_attr {
const char *file;
enum bpf_prog_type prog_type;
enum bpf_attach_type expected_attach_type;
int ifindex;
};
int bpf_prog_load_xattr(const struct bpf_prog_load_attr *attr,
@ -267,4 +268,17 @@ int bpf_prog_load(const char *file, enum bpf_prog_type type,
struct bpf_object **pobj, int *prog_fd);
int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags);
enum bpf_perf_event_ret {
LIBBPF_PERF_EVENT_DONE = 0,
LIBBPF_PERF_EVENT_ERROR = -1,
LIBBPF_PERF_EVENT_CONT = -2,
};
typedef enum bpf_perf_event_ret (*bpf_perf_event_print_t)(void *event,
void *priv);
int bpf_perf_event_read_simple(void *mem, unsigned long size,
unsigned long page_size,
void **buf, size_t *buf_len,
bpf_perf_event_print_t fn, void *priv);
#endif

Some files were not shown because too many files have changed in this diff Show more