Commit graph

73249 commits

Author SHA1 Message Date
Hannes Frederic Sowa
1855b7c3e8 ipv6: introduce idgen_delay and idgen_retries knobs
This is specified by RFC 7217.

Cc: Erik Kline <ek@google.com>
Cc: Fernando Gont <fgont@si6networks.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:12:09 -04:00
Hannes Frederic Sowa
5f40ef77ad ipv6: do retries on stable privacy addresses
If a DAD conflict is detected, we want to retry privacy stable address
generation up to idgen_retries (= 3) times with a delay of idgen_delay
(= 1 second). Add the logic to addrconf_dad_failure.

By design, we don't clean up dad failed permanent addresses.

Cc: Erik Kline <ek@google.com>
Cc: Fernando Gont <fgont@si6networks.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:12:09 -04:00
Hannes Frederic Sowa
8e8e676d0b ipv6: collapse state_lock and lock
Cc: Erik Kline <ek@google.com>
Cc: Fernando Gont <fgont@si6networks.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:12:09 -04:00
Hannes Frederic Sowa
64236f3f3d ipv6: introduce IFA_F_STABLE_PRIVACY flag
We need to mark appropriate addresses so we can do retries in case their
DAD failed.

Cc: Erik Kline <ek@google.com>
Cc: Fernando Gont <fgont@si6networks.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:12:09 -04:00
Hannes Frederic Sowa
622c81d57b ipv6: generation of stable privacy addresses for link-local and autoconf
This patch implements the stable privacy address generation for
link-local and autoconf addresses as specified in RFC7217.

  RID = F(Prefix, Net_Iface, Network_ID, DAD_Counter, secret_key)

is the RID (random identifier). As the hash function F we chose one
round of sha1. Prefix will be either the link-local prefix or the
router advertised one. As Net_Iface we use the MAC address of the
device. DAD_Counter and secret_key are implemented as specified.

We don't use Network_ID, as it couples the code too closely to other
subsystems. It is specified as optional in the RFC.

As Net_Iface we only use the MAC address: we simply have no stable
identifier in the kernel we could possibly use: because this code might
run very early, we cannot depend on names, as they might be changed by
user space early on during the boot process.

A new address generation mode is introduced,
IN6_ADDR_GEN_MODE_STABLE_PRIVACY. With iproute2 one can switch back to
none or eui64 address configuration mode although the stable_secret is
already set.

We refuse writes to ipv6/conf/all/stable_secret but only allow
ipv6/conf/default/stable_secret and the interface specific file to be
written to. The default stable_secret is used as the parameter for the
namespace, the interface specific can overwrite the secret, e.g. when
switching a network configuration from one system to another while
inheriting the secret.

Cc: Erik Kline <ek@google.com>
Cc: Fernando Gont <fgont@si6networks.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:12:08 -04:00
Hannes Frederic Sowa
3d1bec9932 ipv6: introduce secret_stable to ipv6_devconf
This patch implements the procfs logic for the stable_address knob:
The secret is formatted as an ipv6 address and will be stored per
interface and per namespace. We track initialized flag and return EIO
errors until the secret is set.

We don't inherit the secret to newly created namespaces.

Cc: Erik Kline <ek@google.com>
Cc: Fernando Gont <fgont@si6networks.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:12:08 -04:00
Herbert Xu
ccd57b1bd3 rhashtable: Add immediate rehash during insertion
This patch reintroduces immediate rehash during insertion.  If
we find during insertion that the table is full or the chain
length exceeds a set limit (currently 16 but may be disabled
with insecure_elasticity) then we will force an immediate rehash.
The rehash will contain an expansion if the table utilisation
exceeds 75%.

If this rehash fails then the insertion will fail.  Otherwise the
insertion will be reattempted in the new hash table.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:07:52 -04:00
Herbert Xu
b824478b21 rhashtable: Add multiple rehash support
This patch adds the missing bits to allow multiple rehashes.  The
read-side as well as remove already handle this correctly.  So it's
only the rehasher and insertion that need modification to handle
this.

Note that this patch doesn't actually enable it so for now rehashing
is still only performed by the worker thread.

This patch also disables the explicit expand/shrink interface because
the table is meant to expand and shrink automatically, and continuing
to export these interfaces unnecessarily complicates the life of the
rehasher since the rehash process is now composed of two parts.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:07:52 -04:00
Herbert Xu
31ccde2dac rhashtable: Allow hashfn to be unset
Since every current rhashtable user uses jhash as their hash
function, the fact that jhash is an inline function causes each
user to generate a copy of its code.

This function provides a solution to this problem by allowing
hashfn to be unset.  In which case rhashtable will automatically
set it to jhash.  Furthermore, if the key length is a multiple
of 4, we will switch over to jhash2.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:07:51 -04:00
Herbert Xu
de91b25c80 rhashtable: Eliminate unnecessary branch in rht_key_hashfn
When rht_key_hashfn is called from rhashtable itself and params
is equal to ht->p, there is no point in checking params.key_len
and falling back to ht->p.key_len.

For some reason gcc couldn't figure out that params is the same
as ht->p.  So let's help it by only checking params.key_len when
it's a constant.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:07:51 -04:00
David S. Miller
e167359be0 linux-can-next-for-4.1-20150323
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABCAAGBQJVD+1xAAoJECte4hHFiupUVD8P/2ymRe2lcNILciuzaErjkDjc
 GvbvsDM2rIPKfxT1ptrn52l9MhvIN/oFwVjSzdBNFKz9GmCnqvwwT+EHpV2CxdIR
 deblAbDAGuzYbDbY7uadOXU2CQWcUw7r7wCgNhIcWs1qd7ndfGVj3eBKFsJsGuvM
 5s2sSP4s7U2xgIk2Ts8RHAGdofL/YVUUqSTOWzUfTJm7rPDTtWLXHMk0e3V8kFN0
 im9yy00gOnMYpuAMfemD7fm6OClO/3L2x5gjVXNMGrf73TivfLMSrCGC3eqN3W9J
 FY7G9502c6cMxNK7eneZE31XCHxpVIOHwGD9C4AWi/b9kHGZjKdQ5HK9KTXAbCLh
 eN8h2hpnlrotS1wDXh0ThGqZYpW7LhzKyZvZ3DSW9WBnwf2NqTdHx2HtIQL6WE8X
 WAZbmUGJhOh+ZZ7hGhIBNHqAT7qz6rprMIKfjio609ZU3vZscnwEUv+8FEKipKFD
 qZTSUE1ZLTczHZg/jMQ208ecE4uD06GsmNzi1kzzHjbaCtGnRhwKfD3yl3V8dn1O
 Zz8n2nJqF+V5F+AWFHgJ6AkM5Na7UOvVhtY/8R7u1D1gn+yQStYjY4y4jxR4JdsU
 8mIUc+1PhJgIggsvbKYP1IzXvMecWtPDDfIkPIXsOaFANF4jN8PTfLXe1ZYGhpCV
 aMuUZk7XUthdJavD0FWZ
 =m6GG
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-4.1-20150323' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2015-03-23

this is a pull request of 6 patches for net-next/master.

A patch by Florian Westphal, converts the skb->destructor to use
sock_efree() instead of own destructor. Ahmed S. Darwish's patch
converts the kvaser_usb driver to use unregister_candev(). A patch by
me removes a return from a void function in the m_can driver. Yegor
Yefremov contributes a patch for combined rx/tx LED trigger support. A
sparse warning in the esd_usb2 driver was fixes by Thomas Körper. Ben
Dooks converts the at91_can driver to use endian agnostic IO accessors.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:03:43 -04:00
David S. Miller
40451fd013 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next.
Basically, more incremental updates for br_netfilter from Florian
Westphal, small nf_tables updates (including one fix for rb-tree
locking) and small two-liner to add extra validation for the REJECT6
target.

More specifically, they are:

1) Use the conntrack status flags from br_netfilter to know that DNAT is
   happening. Patch for Florian Westphal.

2) nf_bridge->physoutdev == NULL already indicates that the traffic is
   bridged, so let's get rid of the BRNF_BRIDGED flag. Also from Florian.

3) Another patch to prepare voidization of seq_printf/seq_puts/seq_putc,
   from Joe Perches.

4) Consolidation of nf_tables_newtable() error path.

5) Kill nf_bridge_pad used by br_netfilter from ip_fragment(),
   from Florian Westphal.

6) Access rb-tree root node inside the lock and remove unnecessary
   locking from the get path (we already hold nfnl_lock there), from
   Patrick McHardy.

7) You cannot use a NFT_SET_ELEM_INTERVAL_END when the set doesn't
   support interval, also from Patrick.

8) Enforce IP6T_F_PROTO from ip6t_REJECT to make sure the core is
   actually restricting matches to TCP.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:02:46 -04:00
Alexander Drozdov
682f048bd4 af_packet: pass checksum validation status to the user
Introduce TP_STATUS_CSUM_VALID tp_status flag to tell the
af_packet user that at least the transport header checksum
has been already validated.

For now, the flag may be set for incoming packets only.

Signed-off-by: Alexander Drozdov <al.drozdov@gmail.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 22:01:28 -04:00
Eric Dumazet
85645bab57 ipv4: dccp: handle ICMP messages on DCCP_NEW_SYN_RECV request sockets
dccp_v4_err() can restrict lookups to ehash table, and not to listeners.

Note this patch creates the infrastructure, but this means that ICMP
messages for request sockets are ignored until complete conversion.

New dccp_req_err() helper is exported so that we can use it in IPv6
in following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet
26e3736090 ipv4: tcp: handle ICMP messages on TCP_NEW_SYN_RECV request sockets
tcp_v4_err() can restrict lookups to ehash table, and not to listeners.

Note this patch creates the infrastructure, but this means that ICMP
messages for request sockets are ignored until complete conversion.

New tcp_req_err() helper is exported so that we can use it in IPv6
in following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet
b282705336 net: convert syn_wait_lock to a spinlock
This is a low hanging fruit, as we'll get rid of syn_wait_lock eventually.

We hold syn_wait_lock for such small sections, that it makes no sense to use
a read/write lock. A spin lock is simply faster.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:26 -04:00
Eric Dumazet
42cb80a235 inet: remove sk_listener parameter from syn_ack_timeout()
It is not needed, and req->sk_listener points to the listener anyway.
request_sock argument can be const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:52:25 -04:00
Tadeusz Struk
66db37391d crypto: af_alg - Allow to link sgl
Allow to link af_alg sgls.

Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:41:37 -04:00
tadeusz.struk@intel.com
0345f93138 net: socket: add support for async operations
Add support for async operations.

Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-23 16:41:36 -04:00
Yegor Yefremov
c54eb70e3b can: add combined rx/tx LED trigger support
Add <ifname>-rxtx trigger, that will be activated both for tx
as rx events. This trigger mimics "activity" LED for Ethernet
devices.

Signed-off-by: Yegor Yefremov <yegorslists@googlemail.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2015-03-22 23:50:11 +01:00
Florian Westphal
2b290bbb60 can: use sock_efree instead of own destructor
It is identical to the can destructor.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2015-03-22 23:50:10 +01:00
Florian Westphal
8d0451638a netfilter: bridge: kill nf_bridge_pad
The br_netfilter frag output function calls skb_cow_head() so in
case it needs a larger headroom to e.g. re-add a previously stripped PPPOE
or VLAN header things will still work (at cost of reallocation).

We can then move nf_bridge_encap_header_len to br_netfilter.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-03-22 19:45:55 +01:00
YOSHIFUJI Hideaki/吉藤英明
8da86466b8 net: neighbour: Add mcast_resolicit to configure the number of multicast resolicitations in PROBE state.
We send unicast neighbor (ARP or NDP) solicitations ucast_probes
times in PROBE state.  Zhu Yanjun reported that some implementation
does not reply against them and the entry will become FAILED, which
is undesirable.

We had been dealt with such nodes by sending multicast probes mcast_
solicit times after unicast probes in PROBE state.  In 2003, I made
a change not to send them to improve compatibility with IPv6 NDP.

Let's introduce per-protocol per-interface sysctl knob "mcast_
reprobe" to configure the number of multicast (re)solicitation for
reconfirmation in PROBE state.  The default is 0, since we have
been doing so for 10+ years.

Reported-by: Zhu Yanjun <Yanjun.Zhu@windriver.com>
CC: Ulf Samuelsson <ulf.samuelsson@ericsson.com>
Signed-off-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 21:47:40 -04:00
Scott Feldman
13bb8e2eb3 switchdev: kernel-doc cleanup on swithdev ops
Suggested-by: Ben Hutchings <ben@decadent.org.uk>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 21:36:53 -04:00
Eric Dumazet
d3593b5cef Revert "selinux: add a skb_owned_by() hook"
This reverts commit ca10b9e9a8.

No longer needed after commit eb8895debe
("tcp: tcp_make_synack() should use sock_wmalloc")

When under SYNFLOOD, we build lot of SYNACK and hit false sharing
because of multiple modifications done on sk_listener->sk_wmem_alloc

Since tcp_make_synack() uses sock_wmalloc(), there is no need
to call skb_set_owner_w() again, as this adds two atomic operations.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 21:36:53 -04:00
Daniel Borkmann
a8cb5f556b act_bpf: add initial eBPF support for actions
This work extends the "classic" BPF programmable tc action by extending
its scope also to native eBPF code!

Together with commit e2e9b6541d ("cls_bpf: add initial eBPF support
for programmable classifiers") this adds the facility to implement fully
flexible classifier and actions for tc that can be implemented in a C
subset in user space, "safely" loaded into the kernel, and being run in
native speed when JITed.

Also, since eBPF maps can be shared between eBPF programs, it offers the
possibility that cls_bpf and act_bpf can share data 1) between themselves
and 2) between user space applications. That means that, f.e. customized
runtime statistics can be collected in user space, but also more importantly
classifier and action behaviour could be altered based on map input from
the user space application.

For the remaining details on the workflow and integration, see the cls_bpf
commit e2e9b6541d. Preliminary iproute2 part can be found under [1].

  [1] http://git.breakpoint.cc/cgit/dborkman/iproute2.git/log/?h=ebpf-act

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 19:10:44 -04:00
Daniel Borkmann
94caee8c31 ebpf: add sched_act_type and map it to sk_filter's verifier ops
In order to prepare eBPF support for tc action, we need to add
sched_act_type, so that the eBPF verifier is aware of what helper
function act_bpf may use, that it can load skb data and read out
currently available skb fields.

This is bascially analogous to 96be4325f4 ("ebpf: add sched_cls_type
and map it to sk_filter's verifier ops").

BPF_PROG_TYPE_SCHED_CLS and BPF_PROG_TYPE_SCHED_ACT need to be
separate since both will have a different set of functionality in
future (classifier vs action), thus we won't run into ABI troubles
when the point in time comes to diverge functionality from the
classifier.

The future plan for act_bpf would be that it will be able to write
into skb->data and alter selected fields mirrored in struct __sk_buff.

For an initial support, it's sufficient to map it to sk_filter_ops.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jiri Pirko <jiri@resnulli.us>
Reviewed-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 19:10:44 -04:00
David S. Miller
0fa74a4be4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	drivers/net/ethernet/emulex/benet/be_main.c
	net/core/sysctl_net_core.c
	net/ipv4/inet_diag.c

The be_main.c conflict resolution was really tricky.  The conflict
hunks generated by GIT were very unhelpful, to say the least.  It
split functions in half and moved them around, when the real actual
conflict only existed solely inside of one function, that being
be_map_pci_bars().

So instead, to resolve this, I checked out be_main.c from the top
of net-next, then I applied the be_main.c changes from 'net' since
the last time I merged.  And this worked beautifully.

The inet_diag.c and sysctl_net_core.c conflicts were simple
overlapping changes, and were easily to resolve.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 18:51:09 -04:00
Herbert Xu
6626af6926 rhashtable: Fix undeclared EEXIST build error on ia64
We need to include linux/errno.h in rhashtable.h since it doesn't
always get included otherwise.

Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 18:18:45 -04:00
Herbert Xu
dc0ee268d8 rhashtable: Rip out obsolete out-of-line interface
Now that all rhashtable users have been converted over to the
inline interface, this patch removes the unused out-of-line
interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 16:16:24 -04:00
Herbert Xu
02fd97c3d4 rhashtable: Allow hash/comparison functions to be inlined
This patch deals with the complaint that we make indirect function
calls on the fast paths unnecessarily in rhashtable.  We resolve
it by moving the fast paths into inline functions that take struct
rhashtable_param (which obviously must be the same set of parameters
supplied to rhashtable_init) as an argument.

The only remaining indirect call is to obj_hashfn (or key_hashfn it
obj_hashfn is unset) on the rehash as well as the insert-during-
rehash slow path.

This patch also extends the support of vairable-length keys to
include those where the key is fixed but scattered in the object.
For example, in netlink we want to key off the namespace and the
portid but they're not next to each other.

This patch does this by directly using the object hash function
as the indicator of whether the key is accessible or not.  It
also adds a new function obj_cmpfn to compare a key against an
object.  This means that the caller no longer needs to supply
explicit compare functions.

All this is done in a backwards compatible manner so no existing
users are affected until they convert to the new interface.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 16:16:24 -04:00
Herbert Xu
488fb86ee9 rhashtable: Make rhashtable_init params argument const
This patch marks the rhashtable_init params argument const as
there is no reason to modify it since we will always make a copy
of it in the rhashtable.

This patch also fixes a bug where we don't actually round up the
value of min_size unless it is less than HASH_MIN_SIZE.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 16:16:24 -04:00
Eric Dumazet
becb74f0ac net: increase sk_[max_]ack_backlog
sk_ack_backlog & sk_max_ack_backlog were 16bit fields, meaning
listen() backlog was limited to 65535.

It is time to increase the width to allow much bigger backlog,
if admins change /proc/sys/net/core/somaxconn &
/proc/sys/net/ipv4/tcp_max_syn_backlog default values.

Tested:

echo 5000000 >/proc/sys/net/core/somaxconn
echo 5000000 >/proc/sys/net/ipv4/tcp_max_syn_backlog

Ran a SYNFLOOD test against a listener using listen(fd, 5000000)

myhost~# grep request_sock_TCP /proc/slabinfo
request_sock_TCP  4185642 4411940    304   13    1 : tunables   54   27    8 : slabdata 339380 339380      0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 12:40:25 -04:00
Eric Dumazet
fa76ce7328 inet: get rid of central tcp/dccp listener timer
One of the major issue for TCP is the SYNACK rtx handling,
done by inet_csk_reqsk_queue_prune(), fired by the keepalive
timer of a TCP_LISTEN socket.

This function runs for awful long times, with socket lock held,
meaning that other cpus needing this lock have to spin for hundred of ms.

SYNACK are sent in huge bursts, likely to cause severe drops anyway.

This model was OK 15 years ago when memory was very tight.

We now can afford to have a timer per request sock.

Timer invocations no longer need to lock the listener,
and can be run from all cpus in parallel.

With following patch increasing somaxconn width to 32 bits,
I tested a listener with more than 4 million active request sockets,
and a steady SYNFLOOD of ~200,000 SYN per second.
Host was sending ~830,000 SYNACK per second.

This is ~100 times more what we could achieve before this patch.

Later, we will get rid of the listener hash and use ehash instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 12:40:25 -04:00
Eric Dumazet
52452c5425 inet: drop prev pointer handling in request sock
When request sock are put in ehash table, the whole notion
of having a previous request to update dl_next is pointless.

Also, following patch will get rid of big purge timer,
so we want to delete a request sock without holding listener lock.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-20 12:40:25 -04:00
Linus Torvalds
01d62ee520 A set of pin control fixes for the v4.0 release cycle:
- Fix up consumer return values on pin control stubs.
 - Four patches fixing up the interrupt handling and
   sleep context save in the Baytrail driver.
 - Make default output directions work properly in the
   Cherryview driver.
 - Fix interrupt locking in the AT91 driver.
 - Fix setting interrupt generating lines as input in
   the sunxi driver.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVCo4HAAoJEEEQszewGV1z+HQQAMAzKL9igh3iCDR8p3tmB1sp
 ZBTgWl4NHT9MFeA1AmilVEn1JbmUDmDetTN7P/sVC8mxaJiheY8PrFbj4bwBsvzA
 JV0mlSBSj7jw8CxNM9rzSWZxRNtdpKr45siLA1SBPdni2x711SRW1H57eK73UQnQ
 PEVW9hWzXY4cHU6Q7dX67YDJuteQsu5A1QCy6hBYX4Kyyy5gT8RJs6lAvx2f5k8g
 Gsdgs60T/bTmIAiQT3FIf6VUQezW2m1PZn2fhuJeUCZWM17ej2YjVoanQUKu3Bz5
 GPvV3wt2FpbKJsum5p4FJQPRD2qPsuq4jg7Msk6QOiVWHOl/QtL30AnS6N/iQ97z
 TlblAH3ze2t182rHeI4J8d4FIX8jRfftb9DHlgBhLZFU4k0EanMvqEWN6/8yqgmy
 n5nYUB88y6rI5RRLoGAStudlRIHqpz0fFvsU6IW5aZ7wpobIv+JPtMBUbfIKdQBV
 Xj39LDJj9W5jtI7Icl2v8q7oTknnEa7rUuH/VYbptMLkXBxndWPKG2JLFwfhQ3Py
 KMZvFdLP7E2uAR89KNvqQxbQQuYOK8wx5T2nFV57wVPX6VFv/G9sKUqdS5F7iraS
 qg8aBloBN9k79mBBWyIo/XEdluYC/zuKf2KPRvH7UhXep+e9iqCSxWyXqgbgv6F9
 MtWvNLoK9kZyxymnqbmJ
 =qp0z
 -----END PGP SIGNATURE-----

Merge tag 'pinctrl-v4.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:
 "Here is a slew of pin control fixes I've accumulated for the v4.0
  kernel.  Nothing special, just driver fixes (mainly embedded Intel it
  seems) and a misunderstanding regarding the stub functions was
  reverted:

   - Fix up consumer return values on pin control stubs.
   - Four patches fixing up the interrupt handling and sleep context
     save in the Baytrail driver.
   - Make default output directions work properly in the Cherryview
     driver.
   - Fix interrupt locking in the AT91 driver.
   - Fix setting interrupt generating lines as input in the sunxi
     driver"

* tag 'pinctrl-v4.0-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: sun4i: GPIOs configured as irq must be set to input before reading
  pinctrl: at91: move lock/unlock_as_irq calls into request/release
  pinctrl: update direction_output function of cherryview driver
  pinctrl: baytrail: Save pin context over system sleep
  pinctrl: baytrail: Rework interrupt handling
  pinctrl: baytrail: Clear interrupt triggering from pins that are in GPIO mode
  pinctrl: baytrail: Relax GPIO request rules
  Revert "pinctrl: consumer: use correct retval for placeholder functions"
2015-03-19 15:52:28 -07:00
David S. Miller
970282d0e1 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2015-03-19

This wont the last 4.1 bluetooth-next pull request, but we've piled up
enough patches in less than a week that I wanted to save you from a
single huge "last-minute" pull somewhere closer to the merge window.

The main changes are:

 - Simultaneous LE & BR/EDR discovery support for HW that can do it
 - Complete LE OOB pairing support
 - More fine-grained mgmt-command access control (normal user can now do
   harmless read-only operations).
 - Added RF power amplifier support in cc2520 ieee802154 driver
 - Some cleanups/fixes in ieee802154 code

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-19 15:18:04 -04:00
Linus Torvalds
47226fe1b5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix packet header offset calculation in _decode_session6(), from
    Hajime Tazaki.

 2) Fix route leak in error paths of xfrm_lookup(), from Huaibin Wang.

 3) Be sure to clear state properly when scans fail in iwlwifi mvm code,
    from Luciano Coelho.

 4) iwlwifi tries to stop scans that aren't actually running, also from
    Luciano Coelho.

 5) mac80211 should drop mesh frames that are not encrypted, fix from
    Bob Copeland.

 6) Add new device ID to b43 wireless driver for BCM432228 chips, from
    Rafał Miłecki.

 7) Fix accidental addition of members after variable sized array in
    struct tc_u_hnode, from WANG Cong.

 8) Don't re-enable interrupts until after we call napi_complete() in
    ibmveth and WIZnet drivers, frm Yongbae Park.

 9) Fix regression in vlan tag handling of fec driver, from Fugang Duan.

10) If a network namespace change fails during rtnl_newlink(), we don't
    unwind the device registry properly.

11) Fix two TCP regressions, from Neal Cardwell:
  - Don't allow snd_cwnd_cnt to accumulate huge values due to missing
    test in tcp_cong_avoid_ai().
  - Restore CUBIC back to advancing cwnd by 1.5x packets per RTT.

12) Fix performance regression in xne-netback involving push TX
    notifications, from David Vrabel.

13) __skb_tstamp_tx() can be called with a NULL sk pointer, do not
    dereference blindly.  From Willem de Bruijn.

14) Fix potential stack overflow in RDS protocol stack, from Arnd
    Bergmann.

15) VXLAN_VID_MASK used incorrectly in new remote checksum offload
    support of VXLAN driver.  Fix from Alexey Kodanev.

16) Fix too small netlink SKB allocation in inet_diag layer, from Eric
    Dumazet.

17) ieee80211_check_combinations() does not count interfaces correctly,
    from Andrei Otcheretianski.

18) Hardware feature determination in bxn2x driver references a piece of
    software state that actually isn't initialized yet, fix from Michal
    Schmidt.

19) inet_csk_wait_for_connect() needs a sched_annotate_sleep()
    annoation, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (56 commits)
  Revert "net: cx82310_eth: use common match macro"
  net/mlx4_en: Set statistics bitmap at port init
  IB/mlx4: Saturate RoCE port PMA counters in case of overflow
  net/mlx4_en: Fix off-by-one in ethtool statistics display
  IB/mlx4: Verify net device validity on port change event
  act_bpf: allow non-default TC_ACT opcodes as BPF exec outcome
  Revert "smc91x: retrieve IRQ and trigger flags in a modern way"
  inet: Clean up inet_csk_wait_for_connect() vs. might_sleep()
  ip6_tunnel: fix error code when tunnel exists
  netdevice.h: fix ndo_bridge_* comments
  bnx2x: fix encapsulation features on 57710/57711
  mac80211: ignore CSA to same channel
  nl80211: ignore HT/VHT capabilities without QoS/WMM
  mac80211: ask for ECSA IE to be considered for beacon parse CRC
  mac80211: count interfaces correctly for combination checks
  isdn: icn: use strlcpy() when parsing setup options
  rxrpc: bogus MSG_PEEK test in rxrpc_recvmsg()
  caif: fix MSG_OOB test in caif_seqpkt_recvmsg()
  bridge: reset bridge mtu after deleting an interface
  can: kvaser_usb: Fix tx queue start/stop race conditions
  ...
2015-03-19 11:19:44 -07:00
Jörg Thalheim
af615762e9 bridge: add ageing_time, stp_state, priority over netlink
Signed-off-by: Jörg Thalheim <joerg@higgsboson.tk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 23:21:06 -04:00
David S. Miller
99c4a26a15 net: Fix high overhead of vlan sub-device teardown.
When a networking device is taken down that has a non-trivial number
of VLAN devices configured under it, we eat a full synchronize_net()
for every such VLAN device.

This is because of the call chain:

	NETDEV_DOWN notifier
	--> vlan_device_event()
		--> dev_change_flags()
		--> __dev_change_flags()
		--> __dev_close()
		--> __dev_close_many()
		--> dev_deactivate_many()
			--> synchronize_net()

This is kind of rediculous because we already have infrastructure for
batching doing operation X to a list of net devices so that we only
incur one sync.

So make use of that by exporting dev_close_many() and adjusting it's
interfaace so that the caller can fully manage the batch list.  Use
this in vlan_device_event() and all the overhead goes away.

Reported-by: Salam Noureddine <noureddine@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:52:56 -04:00
David Ahern
db24a9044e net: add support for phys_port_name
Similar to port id allow netdevices to specify port names and export
the name via sysfs. Drivers can implement the netdevice operation to
assist udev in having sane default names for the devices using the
rule:

$ cat /etc/udev/rules.d/80-net-setup-link.rules
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_port_name}!="",
NAME="$attr{phys_port_name}"

Use of phys_name versus phys_id was suggested-by Jiri Pirko.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Scott Feldman <sfeldma@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:30:35 -04:00
Marcelo Ricardo Leitner
54ff9ef36b ipv4, ipv6: kill ip_mc_{join, leave}_group and ipv6_sock_mc_{join, drop}
in favor of their inner __ ones, which doesn't grab rtnl.

As these functions need to operate on a locked socket, we can't be
grabbing rtnl by then. It's too late and doing so causes reversed
locking.

So this patch:
- move rtnl handling to callers instead while already fixing some
  reversed locking situations, like on vxlan and ipvs code.
- renames __ ones to not have the __ mark:
  __ip_mc_{join,leave}_group -> ip_mc_{join,leave}_group
  __ipv6_sock_mc_{join,drop} -> ipv6_sock_mc_{join,drop}

Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:05:09 -04:00
Eric Dumazet
08d2cc3b26 inet: request sock should init IPv6/IPv4 addresses
In order to be able to use sk_ehashfn() for request socks,
we need to initialize their IPv6/IPv4 addresses.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:35 -04:00
Eric Dumazet
b4d6444ea3 inet: get rid of last __inet_hash_connect() argument
We now always call __inet_hash_nolisten(), no need to pass it
as an argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:35 -04:00
Eric Dumazet
77a6a471bc ipv6: get rid of __inet6_hash()
We can now use inet_hash() and __inet_hash() instead of private
functions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:35 -04:00
Eric Dumazet
d1e559d0b1 inet: add IPv6 support to sk_ehashfn()
Intent is to converge IPv4 & IPv6 inet_hash functions to
factorize code.

IPv4 sockets initialize sk_rcv_saddr and sk_v6_daddr
in this patch, thanks to new sk_daddr_set() and sk_rcv_saddr_set()
helpers.

__inet6_hash can now use sk_ehashfn() instead of a private
inet6_sk_ehashfn() and will simply use __inet_hash() in a
following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:34 -04:00
Eric Dumazet
5b441f76f1 net: introduce sk_ehashfn() helper
Goal is to unify IPv4/IPv6 inet_hash handling, and use common helpers
for all kind of sockets (full sockets, timewait and request sockets)

inet_sk_ehashfn() becomes sk_ehashfn() but still only copes with IPv4

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:34 -04:00
Eric Dumazet
6eada0110c netns: constify net_hash_mix() and various callers
const qualifiers ease code review by making clear
which objects are not written in a function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 22:00:34 -04:00
Or Gerlitz
fc31e2560a net/mlx4_core: Add basic support for QP max-rate limiting
Add the low-level device commands and definitions used for QP max-rate limiting.

This is done through the following elements:

  - read rate-limit device caps in QUERY_DEV_CAP: number of different
    rates and the min/max rates in Kbs/Mbs/Gbs units

  - enhance the QP context struct to contain rate limit units and value

  - allow to do run time rate-limit setting to QPs through the
    update-qp firmware command

  - QP rate-limiting is disallowed for VFs

Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 14:55:19 -04:00
John Fastabend
822b3b2ebf net: Add max rate tx queue attribute
This adds a tx_maxrate attribute to the tx queue sysfs entry allowing
for max-rate limiting. Along with DCB-ETS and BQL this provides another
knob to tune queue performance. The limit units are Mbps.

By default it is disabled. To disable the rate limitation after it
has been set for a queue, it should be set to zero.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-03-18 14:55:18 -04:00