capability overrides apply only to the default case; if fs has ->permission()
that does _not_ call generic_permission(), we have no business doing them.
Moreover, if it has ->permission() that does call generic_permission(), we
have no need to recheck capabilities.
Besides, the capability overrides should apply only if we got EACCES from
acl_permission_check(); any other value (-EIO, etc.) should be returned
to caller, capabilities or not capabilities.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Btrfs (and I'd venture most other fs's) stores its indexes in nice disk order
for readdir, but unfortunately in the case of anything that stats the files in
order that readdir spits back (like oh say ls) that means we still have to do
the normal lookup of the file, which means looking up our other index and then
looking up the inode. What I want is a way to create dummy dentries when we
find them in readdir so that when ls or anything else subsequently does a
stat(), we already have the location information in the dentry and can go
straight to the inode itself. The lookup stuff just assumes that if it finds a
dentry it is done, it doesn't perform a lookup. So add a DCACHE_NEED_LOOKUP
flag so that the lookup code knows it still needs to run i_op->lookup() on the
parent to get the inode for the dentry. I have tested this with btrfs and I
went from something that looks like this
http://people.redhat.com/jwhiter/ls-noreada.png
To this
http://people.redhat.com/jwhiter/ls-good.png
Thats a savings of 1300 seconds, or 22 minutes. That is a significant savings.
Thanks,
Signed-off-by: Josef Bacik <josef@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Don't update *inode in __follow_mount_rcu() until we'd verified that
there is mountpoint there. Kudos to Hugh Dickins for catching that
one in the first place and eventually figuring out the solution (and
catching a braino in the earlier version of patch).
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Make sure that child is still a child of parent before nested locking
of child->d_lock in unlazy_walk(); otherwise we are risking a violation
of locking order and deadlocks.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
[Kudos to dhowells for tracking that crap down]
If two processes attempt to cause automounting on the same mountpoint at the
same time, the vfsmount holding the mountpoint will be left with one too few
references on it, causing a BUG when the kernel tries to clean up.
The problem is that lock_mount() drops the caller's reference to the
mountpoint's vfsmount in the case where it finds something already mounted on
the mountpoint as it transits to the mounted filesystem and replaces path->mnt
with the new mountpoint vfsmount.
During a pathwalk, however, we don't take a reference on the vfsmount if it is
the same as the one in the nameidata struct, but do_add_mount() doesn't know
this.
The fix is to make sure we have a ref on the vfsmount of the mountpoint before
calling do_add_mount(). However, if lock_mount() doesn't transit, we're then
left with an extra ref on the mountpoint vfsmount which needs releasing.
We can handle that in follow_managed() by not making assumptions about what
we can and what we cannot get from lookup_mnt() as the current code does.
The callers of follow_managed() expect that reference to path->mnt will be
grabbed iff path->mnt has been changed. follow_managed() and follow_automount()
keep track of whether such reference has been grabbed and assume that it'll
happen in those and only those cases that'll have us return with changed
path->mnt. That assumption is almost correct - it breaks in case of
racing automounts and in even harder to hit race between following a mountpoint
and a couple of mount --move. The thing is, we don't need to make that
assumption at all - after the end of loop in follow_manage() we can check
if path->mnt has ended up unchanged and do mntput() if needed.
The BUG can be reproduced with the following test program:
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <sys/wait.h>
int main(int argc, char **argv)
{
int pid, ws;
struct stat buf;
pid = fork();
stat(argv[1], &buf);
if (pid > 0) wait(&ws);
return 0;
}
and the following procedure:
(1) Mount an NFS volume that on the server has something else mounted on a
subdirectory. For instance, I can mount / from my server:
mount warthog:/ /mnt -t nfs4 -r
On the server /data has another filesystem mounted on it, so NFS will see
a change in FSID as it walks down the path, and will mark /mnt/data as
being a mountpoint. This will cause the automount code to be triggered.
!!! Do not look inside the mounted fs at this point !!!
(2) Run the above program on a file within the submount to generate two
simultaneous automount requests:
/tmp/forkstat /mnt/data/testfile
(3) Unmount the automounted submount:
umount /mnt/data
(4) Unmount the original mount:
umount /mnt
At this point the kernel should throw a BUG with something like the
following:
BUG: Dentry ffff880032e3c5c0{i=2,n=} still in use (1) [unmount of nfs4 0:12]
Note that the bug appears on the root dentry of the original mount, not the
mountpoint and not the submount because sys_umount() hasn't got to its final
mntput_no_expire() yet, but this isn't so obvious from the call trace:
[<ffffffff8117cd82>] shrink_dcache_for_umount+0x69/0x82
[<ffffffff8116160e>] generic_shutdown_super+0x37/0x15b
[<ffffffffa00fae56>] ? nfs_super_return_all_delegations+0x2e/0x1b1 [nfs]
[<ffffffff811617f3>] kill_anon_super+0x1d/0x7e
[<ffffffffa00d0be1>] nfs4_kill_super+0x60/0xb6 [nfs]
[<ffffffff81161c17>] deactivate_locked_super+0x34/0x83
[<ffffffff811629ff>] deactivate_super+0x6f/0x7b
[<ffffffff81186261>] mntput_no_expire+0x18d/0x199
[<ffffffff811862a8>] mntput+0x3b/0x44
[<ffffffff81186d87>] release_mounts+0xa2/0xbf
[<ffffffff811876af>] sys_umount+0x47a/0x4ba
[<ffffffff8109e1ca>] ? trace_hardirqs_on_caller+0x1fd/0x22f
[<ffffffff816ea86b>] system_call_fastpath+0x16/0x1b
as do_umount() is inlined. However, you can see release_mounts() in there.
Note also that it may be necessary to have multiple CPU cores to be able to
trigger this bug.
Tested-by: Jeff Layton <jlayton@redhat.com>
Tested-by: Ian Kent <raven@themaw.net>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Git bisection shows that commit e6bc45d65d causes
BUG_ONs under high I/O load:
kernel BUG at fs/inode.c:1368!
[ 2862.501007] Call Trace:
[ 2862.501007] [<ffffffff811691d8>] d_kill+0xf8/0x140
[ 2862.501007] [<ffffffff81169c19>] dput+0xc9/0x190
[ 2862.501007] [<ffffffff8115577f>] fput+0x15f/0x210
[ 2862.501007] [<ffffffff81152171>] filp_close+0x61/0x90
[ 2862.501007] [<ffffffff81152251>] sys_close+0xb1/0x110
[ 2862.501007] [<ffffffff814c14fb>] system_call_fastpath+0x16/0x1b
A reliable way to reproduce this bug is:
Login to KDE, run 'rsnapshot sync', and apt-get install openjdk-6-jdk,
and apt-get remove openjdk-6-jdk.
The buggy part of the patch is this:
struct inode *inode = NULL;
.....
- if (nd.last.name[nd.last.len])
- goto slashes;
inode = dentry->d_inode;
- if (inode)
- ihold(inode);
+ if (nd.last.name[nd.last.len] || !inode)
+ goto slashes;
+ ihold(inode)
...
if (inode)
iput(inode); /* truncate the inode here */
If nd.last.name[nd.last.len] is nonzero (and thus goto slashes branch is taken),
and dentry->d_inode is non-NULL, then this code now does an additional iput on
the inode, which is wrong.
Fix this by only setting the inode variable if nd.last.name[nd.last.len] is 0.
Reference: https://lkml.org/lkml/2011/6/15/50
Reported-by: Norbert Preining <preining@logic.at>
Reported-by: Török Edwin <edwintorok@gmail.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Török Edwin <edwintorok@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If user space attempts to remove a non-existent file or directory, and
the file system is mounted read-only, return ENOENT instead of EROFS.
Either error code is arguably valid/correct, but ENOENT is a more
specific error message.
Reported-by: Michael Tokarev <mjt@tls.msk.ru>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The dentry_unhash push-down series missed that shink_dcache_parent needs to
be called prior to rmdir or dir rename to clear DCACHE_REFERENCED and
allow efficient dentry reclaim.
Reported-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and kill a useless local variable in follow_dotdot_rcu(), while
we are at it - follow_mount_rcu(nd, path, inode) *always* assigned
value to *inode, and always it had been path->dentry->d_inode (aka
nd->path.dentry->d_inode, since it always got &nd->path as the second
argument).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (25 commits)
cifs: remove unnecessary dentry_unhash on rmdir/rename_dir
ocfs2: remove unnecessary dentry_unhash on rmdir/rename_dir
exofs: remove unnecessary dentry_unhash on rmdir/rename_dir
nfs: remove unnecessary dentry_unhash on rmdir/rename_dir
ext2: remove unnecessary dentry_unhash on rmdir/rename_dir
ext3: remove unnecessary dentry_unhash on rmdir/rename_dir
ext4: remove unnecessary dentry_unhash on rmdir/rename_dir
btrfs: remove unnecessary dentry_unhash in rmdir/rename_dir
ceph: remove unnecessary dentry_unhash calls
vfs: clean up vfs_rename_other
vfs: clean up vfs_rename_dir
vfs: clean up vfs_rmdir
vfs: fix vfs_rename_dir for FS_RENAME_DOES_D_MOVE filesystems
libfs: drop unneeded dentry_unhash
vfs: update dentry_unhash() comment
vfs: push dentry_unhash on rename_dir into file systems
vfs: push dentry_unhash on rmdir into file systems
vfs: remove dget() from dentry_unhash()
vfs: dentry_unhash immediately prior to rmdir
vfs: Block mmapped writes while the fs is frozen
...
vfs_rename_dir() doesn't properly account for filesystems with
FS_RENAME_DOES_D_MOVE. If new_dentry has a target inode attached, it
unhashes the new_dentry prior to the rename() iop and rehashes it after,
but doesn't account for the possibility that rename() may have swapped
{old,new}_dentry. For FS_RENAME_DOES_D_MOVE filesystems, it rehashes
new_dentry (now the old renamed-from name, which d_move() expected to go
away), such that a subsequent lookup will find it. Currently all
FS_RENAME_DOES_D_MOVE filesystems compensate for this by failing in
d_revalidate.
The bug was introduced by: commit 349457ccf2
"[PATCH] Allow file systems to manually d_move() inside of ->rename()"
Fix by not rehashing the new dentry. Rehashing used to be needed by
d_move() but isn't anymore.
Reported-by: Sage Weil <sage@newdream.net>
Signed-off-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The helper is now only called by file systems, not the VFS.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Only a few file systems need this. Start by pushing it down into each
rename method (except gfs2 and xfs) so that it can be dealt with on a
per-fs basis.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Only a few file systems need this. Start by pushing it down into each
fs rmdir method (except gfs2 and xfs) so it can be dealt with on a per-fs
basis.
This does not change behavior for any in-tree file systems.
Acked-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This serves no useful purpose that I can discern. All callers (rename,
rmdir) hold their own reference to the dentry.
A quick audit of all file systems showed no relevant checks on the value
of d_count in vfs_rmdir/vfs_rename_dir paths.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This presumes that there is no reason to unhash a dentry if we fail because
it is a mountpoint or the LSM check fails, and that the LSM checks do not
depend on the dentry being unhashed.
Signed-off-by: Sage Weil <sage@newdream.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: complete_walk(). Done on successful completion
of walk, drops out of RCU mode, does d_revalidate of final
result if that hadn't been done already.
handle_reval_dot() and nameidata_drop_rcu_last() subsumed into
that one; callers converted to use of complete_walk().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
This solves a serious VFS-level bug in nested_symlink (which was
rewritten from do_follow_link), and follows the order of depth tests
that existed before.
The bug triggers a BUG_ON in fs/namei.c:1381, when running racer with
symlink and rename ops.
Signed-off-by: Erez Zadok <ezk@cs.sunysb.edu>
Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It's a hot function, and we're better off not mixing types in the mask
calculations. The compiler just ends up mixing 16-bit and 32-bit
operations, for no good reason.
So do everything in 'unsigned int' rather than mixing 'unsigned int'
masking with a 'umode_t' (16-bit) mode variable.
This, together with the parent commit (47a150edc2: "Cache user_ns in
struct cred") makes acl_permission_check() much nicer.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During RCU walk in path_lookupat and path_openat, the rcu lookup
frequently failed if looking up an absolute path, because when root
directory was looked up, seq number was not properly set in nameidata.
We dropped out of RCU walk in nameidata_drop_rcu due to mismatch in
directory entry's seq number. We reverted to slow path walk that need
to take references.
With the following patch, I saw a 50% increase in an exim mail server
benchmark throughput on a 4-socket Nehalem-EX system.
Signed-off-by: Tim Chen <tim.c.chen@linux.intel.com>
Reviewed-by: Andi Kleen <ak@linux.intel.com>
Cc: stable@kernel.org (v2.6.38)
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When following a mount in rcu-walk mode we must check if the incoming dentry
is telling us it may need to block, even if it isn't actually a mountpoint.
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
deal with races in /proc/*/{syscall,stack,personality}
proc: enable writing to /proc/pid/mem
proc: make check_mem_permission() return an mm_struct on success
proc: hold cred_guard_mutex in check_mem_permission()
proc: disable mem_write after exec
mm: implement access_remote_vm
mm: factor out main logic of access_process_vm
mm: use mm_struct to resolve gate vma's in __get_user_pages
mm: arch: rename in_gate_area_no_task to in_gate_area_no_mm
mm: arch: make in_gate_area take an mm_struct instead of a task_struct
mm: arch: make get_gate_vma take an mm_struct instead of a task_struct
x86: mark associated mm when running a task in 32 bit compatibility mode
x86: add context tag to mark mm when running a task in 32-bit compatibility mode
auxv: require the target to be tracable (or yourself)
close race in /proc/*/environ
report errors in /proc/*/*map* sanely
pagemap: close races with suid execve
make sessionid permissions in /proc/*/task/* match those in /proc/*
fix leaks in path_lookupat()
Fix up trivial conflicts in fs/proc/base.c
And give it a kernel-doc comment.
[akpm@linux-foundation.org: btrfs changed in linux-next]
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Cheat for now and say all files belong to init_user_ns. Next step will be
to let superblocks belong to a user_ns, and derive inode_userns(inode)
from inode->i_sb->s_user_ns. Finally we'll introduce more flexible
arrangements.
Changelog:
Feb 15: make is_owner_or_cap take const struct inode
Feb 23: make is_owner_or_cap bool
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
Acked-by: David Howells <dhowells@redhat.com>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* pull the handling of current->total_link_count into
__do_follow_link()
* put the common "do ->put_link() if needed and path_put() the link"
stuff into a helper (put_link(nd, link, cookie))
* rename __do_follow_link() to follow_link(), while we are at it
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The last remaining place (resolution of nested symlink) converted
to the loop of the same kind we have in path_lookupat() and
path_openat().
Note that we still *do* have a recursion in pathname resolution;
can't avoid it, really. However, it's strictly for nested symlinks
now - i.e. ones in the middle of a pathname.
link_path_walk() has lost the tail now - it always walks everything
except the last component.
do_follow_link() renamed to nested_symlink() and moved down.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now that link_path_walk() is called without LOOKUP_PARENT
only from do_follow_link(), we can simplify the checks in
last component handling. First of all, checking if we'd
arrived to a directory is not needed - the caller will check
it anyway. And LOOKUP_FOLLOW is guaranteed to be there,
since we only get to that place with nd->depth > 0.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: walk_component(). Handles everything except symlinks;
returns negative on error, 0 on success and 1 on symlinks we decided
to follow. Drops out of RCU mode on such symlinks.
link_path_walk() and do_last() switched to using that.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We don't want to allow creation of private hardlinks by different application
using the fd passed to them via SCM_RIGHTS. So limit the null relative name
usage in linkat syscall to CAP_DAC_READ_SEARCH
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
At that point we can't do almost nothing with them. They can be opened
with O_PATH, we can manipulate such descriptors with dup(), etc. and
we can see them in /proc/*/{fd,fdinfo}/*.
We can't (and won't be able to) follow /proc/*/fd/* symlinks for those;
there's simply not enough information for pathname resolution to go on
from such point - to resolve a symlink we need to know which directory
does it live in.
We will be able to do useful things with them after the next commit, though -
readlinkat() and fchownat() will be possible to use with dfd being an
O_PATH-opened symlink and empty relative pathname. Combined with
open_by_handle() it'll give us a way to do realink-by-handle and
lchown-by-handle without messing with more redundant syscalls.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New flag for open(2) - O_PATH. Semantics:
* pathname is resolved, but the file itself is _NOT_ opened
as far as filesystem is concerned.
* almost all operations on the resulting descriptors shall
fail with -EBADF. Exceptions are:
1) operations on descriptors themselves (i.e.
close(), dup(), dup2(), dup3(), fcntl(fd, F_DUPFD),
fcntl(fd, F_DUPFD_CLOEXEC, ...), fcntl(fd, F_GETFD),
fcntl(fd, F_SETFD, ...))
2) fcntl(fd, F_GETFL), for a common non-destructive way to
check if descriptor is open
3) "dfd" arguments of ...at(2) syscalls, i.e. the starting
points of pathname resolution
* closing such descriptor does *NOT* affect dnotify or
posix locks.
* permissions are checked as usual along the way to file;
no permission checks are applied to the file itself. Of course,
giving such thing to syscall will result in permission checks (at
the moment it means checking that starting point of ....at() is
a directory and caller has exec permissions on it).
fget() and fget_light() return NULL on such descriptors; use of
fget_raw() and fget_raw_light() is needed to get them. That protects
existing code from dealing with those things.
There are two things still missing (they come in the next commits):
one is handling of symlinks (right now we refuse to open them that
way; see the next commit for semantics related to those) and another
is descriptor passing via SCM_RIGHTS datagrams.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add inode->i_nlink == 0 check in VFS. Some of the file systems
do this internally. A followup patch will remove those instance.
This is needed to ensure that with link by handle we don't allow
to create hardlink of an unlinked file. The check also prevent a race
between unlink and link
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
For name_to_handle_at(2) we'll want both ...at()-style syscall that
would be usable for non-directory descriptors (with empty relative
pathname). Introduce new flag (AT_EMPTY_PATH) to deal with that and
corresponding LOOKUP_EMPTY; teach user_path_at() and path_init() to
deal with the latter.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new function: file_open_root(dentry, mnt, name, flags) opens the file
vfs_path_lookup would arrive to.
Note that name can be empty; in that case the usual requirement that
dentry should be a directory is lifted.
open-coded equivalents switched to it, may_open() got down exactly
one caller and became static.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New lookup flag: LOOKUP_ROOT. nd->root is set (and held) by caller,
path_init() starts walking from that place and all pathname resolution
machinery never drops nd->root if that flag is set. That turns
vfs_path_lookup() into a special case of do_path_lookup() *and*
gets us down to 3 callers of link_path_walk(), making it finally
feasible to rip the handling of trailing symlink out of link_path_walk().
That will not only simply the living hell out of it, but make life
much simpler for unionfs merge. Trailing symlink handling will
become iterative, which is a good thing for stack footprint in
a lot of situations as well.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
That thing has devolved into rats nest of gotos; sane use of unlikely()
gets rid of that horror and gives much more readable structure:
* make a fast attempt to find a dentry; false negatives are OK.
In RCU mode if everything went fine, we are done, otherwise just drop
out of RCU. If we'd done (RCU) ->d_revalidate() and it had not refused
outright (i.e. didn't give us -ECHILD), remember its result.
* now we are not in RCU mode and hopefully have a dentry. If we
do not, lock parent, do full d_lookup() and if that has not found anything,
allocate and call ->lookup(). If we'd done that ->lookup(), remember that
dentry is good and we don't need to revalidate it.
* now we have a dentry. If it has ->d_revalidate() and we can't
skip it, call it.
* hopefully dentry is good; if not, either fail (in case of error)
or try to invalidate it. If d_invalidate() has succeeded, drop it and
retry everything as if original attempt had not found a dentry.
* now we can finish it up - deal with mountpoint crossing and
automount.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
There used to be time when ->d_revalidate() couldn't return an error.
So intents code had lookup_instantiate_filp() stash ERR_PTR(error)
in nd->intent.open.filp and had it checked after lookup_hash(), to
catch the otherwise silent failures. That had been introduced by
commit 4af4c52f34. These days
->d_revalidate() can and does propagate errors back to callers
explicitly, so this check isn't needed anymore.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We have a bunch of diverging codepaths in do_last(); some of
them converge, but the case of having to create a new file
duplicates large part of common tail of the rest and exits
separately. Massage them so that they could be merged.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Lift it to lookup_one_len() and link_path_walk() resp. into the
same place where we calculated default hash function of the same
name.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of path_lookupat() doing trailing symlink resolution,
use the same scheme as on the O_CREAT side. Walk with
LOOKUP_PARENT, then (in do_last()) look the final component
up, then either open it or return error or, if it's a symlink,
give the symlink back to path_openat() to be resolved there.
The really messy complication here is RCU. We don't want to drop
out of RCU mode before the final lookup, since we don't want to
bounce parent directory ->d_count without a good reason.
Result is _not_ pretty; later in the series we'll clean it up.
For now we are roughly back where we'd been before the revert
done by Nick's series - top-level logics of path_openat() is
cleaned up, do_last() does actual opening, symlink resolution is
done uniformly.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Don't stash the struct file * used as starting point of walk in nameidata;
pass file ** to path_init() instead.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New helper: terminate_walk(). An error has happened during pathname
resolution and we either drop nd->path or terminate RCU, depending
the mode we had been in. After that, nd is essentially empty.
Switch link_path_walk() to using that for cleanup.
Now the top-level logics in link_path_walk() is back to sanity. RCU
dependencies are in the lower-level functions.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Now we have do_follow_link() guaranteed to leave without dangling RCU
and the next step will get LOOKUP_RCU logics completely out of
link_path_walk().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
new helper: path_openat(). Does what do_filp_open() does, except
that it tries only the walk mode (RCU/normal/force revalidation)
it had been told to.
Both create and non-create branches are using path_lookupat() now.
Fixed the double audit_inode() in non-create branch.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
take calculation of open_flags by open(2) arguments into new helper
in fs/open.c, move filp_open() over there, have it and do_sys_open()
use that helper, switch exec.c callers of do_filp_open() to explicit
(and constant) struct open_flags.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
No point messing with passing shitloads of "operation mode" arguments
to do_open() one by one, especially since they are not going to change
during do_filp_open(). Collect them into a struct, fill it and pass
to do_last() by reference.
Make sure that lookup intent flags are correctly set and removed - we
want them for do_last(), but they make no sense for __do_follow_link().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
instead of ad-hackery around need_reval_dot(), do the following:
set a flag (LOOKUP_JUMPED) in the beginning of path, on absolute
symlink traversal, on ".." and on procfs-style symlinks. Clear on
normal components, leave unchanged on ".". Non-nested callers of
link_path_walk() call handle_reval_path(), which checks that flag
is set and that fs does want the final revalidate thing, then does
->d_revalidate(). In link_path_walk() all the return_reval stuff
is gone.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Actual dependency on whether we want RCU or not is in 3 small areas
(as it ought to be) and everything around those is the same in both
versions. Since each function has only one caller and those callers
are on two sides of if (flags & LOOKUP_RCU), it's easier and cleaner
to merge them and pull the checks inside.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
New helper: path_lookupat(). Basically, what do_path_lookup() boils to
modulo -ECHILD/-ESTALE handler. path_walk* family is gone; vfs_path_lookup()
is using link_path_walk() directly, do_path_lookup() and do_filp_open()
are using path_lookupat().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
all remaining callers pass LOOKUP_PARENT to it, so
flags argument can die; renamed to kern_path_parent()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We leave it at whatever it had been pointing to after the
first link_path_walk() had failed with -ESTALE. Things
do not work well after that...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
failure exits on the no-O_CREAT side of do_filp_open() merge with
those of O_CREAT one; unfortunately, if do_path_lookup() returns
-ESTALE, we'll get out_filp:, notice that we are about to return
-ESTALE without having trying to create the sucker with LOOKUP_REVAL
and jump right into the O_CREAT side of code. And proceed to try
and create a file. Usually that'll fail with -ESTALE again, but
we can race and get that attempt of pathname resolution to succeed.
open() without O_CREAT really shouldn't end up creating files, races
or not. The real fix is to rearchitect the whole do_filp_open(),
but for now splitting the failure exits will do.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
When Al moved the nameidata_dentry_drop_rcu_maybe() call into the
do_follow_link function in commit 844a391799 ("nothing in
do_follow_link() is going to see RCU"), he mistakenly left the
BUG_ON(inode != path->dentry->d_inode);
behind. Which would otherwise be ok, but that BUG_ON() really needs to
be _after_ dropping RCU, since the dentry isn't necessarily stable
otherwise.
So complete the code movement in that commit, and move the BUG_ON() into
do_follow_link() too. This means that we need to pass in 'inode' as an
argument (just for this one use), but that's a small thing. And
eventually we may be confident enough in our path lookup that we can
just remove the BUG_ON() and the unnecessary inode argument.
Reported-and-tested-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In commit 31e6b01f41 ("fs: rcu-walk for path lookup") we started doing
path lookup using RCU, which then falls back to a careful non-RCU lookup
in case of problems (LOOKUP_REVAL). So do_filp_open() has this "re-do
the lookup carefully" looping case.
However, that means that we must not release the open-intent file data
if we are going to loop around and use it once more!
Fix this by moving the release of the open-intent data to the function
that allocates it (do_filp_open() itself) rather than the helper
functions that can get called multiple times (finish_open() and
do_last()). This makes the logic for the lifetime of that field much
more obvious, and avoids the possible double free.
Reported-by: J. R. Okajima <hooanon05@yahoo.co.jp>
Acked-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There is a ref count problem in fs/namei.c:do_lookup().
When walking in ref-walk mode, if follow_managed() returns a fail we
need to drop dentry and possibly vfsmount. Clean up properly,
as we do in the other caller of follow_managed().
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Instead of splitting refcount between (per-cpu) mnt_count
and (SMP-only) mnt_longrefs, make all references contribute
to mnt_count again and keep track of how many are longterm
ones.
Accounting rules for longterm count:
* 1 for each fs_struct.root.mnt
* 1 for each fs_struct.pwd.mnt
* 1 for having non-NULL ->mnt_ns
* decrement to 0 happens only under vfsmount lock exclusive
That allows nice common case for mntput() - since we can't drop the
final reference until after mnt_longterm has reached 0 due to the rules
above, mntput() can grab vfsmount lock shared and check mnt_longterm.
If it turns out to be non-zero (which is the common case), we know
that this is not the final mntput() and can just blindly decrement
percpu mnt_count. Otherwise we grab vfsmount lock exclusive and
do usual decrement-and-check of percpu mnt_count.
For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm();
namespace.c uses the latter in places where we don't already hold
vfsmount lock exclusive and opencodes a few remaining spots where
we need to manipulate mnt_longterm.
Note that we mostly revert the code outside of fs/namespace.c back
to what we used to have; in particular, normal code doesn't need
to care about two kinds of references, etc. And we get to keep
the optimization Nick's variant had bought us...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Unexport do_add_mount() and make ->d_automount() return the vfsmount to be
added rather than calling do_add_mount() itself. follow_automount() will then
do the addition.
This slightly complicates things as ->d_automount() normally wants to add the
new vfsmount to an expiration list and start an expiration timer. The problem
with that is that the vfsmount will be deleted if it has a refcount of 1 and
the timer will not repeat if the expiration list is empty.
To this end, we require the vfsmount to be returned from d_automount() with a
refcount of (at least) 2. One of these refs will be dropped unconditionally.
In addition, follow_automount() must get a 3rd ref around the call to
do_add_mount() lest it eat a ref and return an error, leaving the mount we
have open to being expired as we would otherwise have only 1 ref on it.
d_automount() should also add the the vfsmount to the expiration list (by
calling mnt_set_expiry()) and start the expiration timer before returning, if
this mechanism is to be used. The vfsmount will be unlinked from the
expiration list by follow_automount() if do_add_mount() fails.
This patch also fixes the call to do_add_mount() for AFS to propagate the mount
flags from the parent vfsmount.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Allow d_manage() to be called from pathwalk when it is in RCU-walk mode as well
as when it is in Ref-walk mode. This permits __follow_mount_rcu() to call
d_manage() directly. d_manage() needs a parameter to indicate that it is in
RCU-walk mode as it isn't allowed to sleep if in that mode (but should return
-ECHILD instead).
autofs4_d_manage() can then be set to retain RCU-walk mode if the daemon
accesses it and otherwise request dropping back to ref-walk mode.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove a further kludge from __do_follow_link() as it's no longer required with
the automount code.
This reverts the non-helper-function parts of
051d381259, which breaks union mounts.
Reported-by: vaurora@redhat.com
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Remove the automount through follow_link() kludge code from pathwalk in favour
of using d_automount().
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add an AT_NO_AUTOMOUNT flag to suppress terminal automounting of automount
point directories. This can be used by fstatat() users to permit the
gathering of attributes on an automount point and also prevent
mass-automounting of a directory of automount points by ls.
Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a dentry op (d_manage) to permit a filesystem to hold a process and make it
sleep when it tries to transit away from one of that filesystem's directories
during a pathwalk. The operation is keyed off a new dentry flag
(DCACHE_MANAGE_TRANSIT).
The filesystem is allowed to be selective about which processes it holds and
which it permits to continue on or prohibits from transiting from each flagged
directory. This will allow autofs to hold up client processes whilst letting
its userspace daemon through to maintain the directory or the stuff behind it
or mounted upon it.
The ->d_manage() dentry operation:
int (*d_manage)(struct path *path, bool mounting_here);
takes a pointer to the directory about to be transited away from and a flag
indicating whether the transit is undertaken by do_add_mount() or
do_move_mount() skipping through a pile of filesystems mounted on a mountpoint.
It should return 0 if successful and to let the process continue on its way;
-EISDIR to prohibit the caller from skipping to overmounted filesystems or
automounting, and to use this directory; or some other error code to return to
the user.
->d_manage() is called with namespace_sem writelocked if mounting_here is true
and no other locks held, so it may sleep. However, if mounting_here is true,
it may not initiate or wait for a mount or unmount upon the parameter
directory, even if the act is actually performed by userspace.
Within fs/namei.c, follow_managed() is extended to check with d_manage() first
on each managed directory, before transiting away from it or attempting to
automount upon it.
follow_down() is renamed follow_down_one() and should only be used where the
filesystem deliberately intends to avoid management steps (e.g. autofs).
A new follow_down() is added that incorporates the loop done by all other
callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS
and CIFS do use it, their use is removed by converting them to use
d_automount()). The new follow_down() calls d_manage() as appropriate. It
also takes an extra parameter to indicate if it is being called from mount code
(with namespace_sem writelocked) which it passes to d_manage(). follow_down()
ignores automount points so that it can be used to mount on them.
__follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with
DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to
sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have
that determine whether to abort or not itself. That would allow the autofs
daemon to continue on in rcu-walk mode.
Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't
required as every tranist from that directory will cause d_manage() to be
invoked. It can always be set again when necessary.
==========================
WHAT THIS MEANS FOR AUTOFS
==========================
Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to
trigger the automounting of indirect mounts, and both of these can be called
with i_mutex held.
autofs knows that the i_mutex will be held by the caller in lookup(), and so
can drop it before invoking the daemon - but this isn't so for d_revalidate(),
since the lock is only held on _some_ of the code paths that call it. This
means that autofs can't risk dropping i_mutex from its d_revalidate() function
before it calls the daemon.
The bug could manifest itself as, for example, a process that's trying to
validate an automount dentry that gets made to wait because that dentry is
expired and needs cleaning up:
mkdir S ffffffff8014e05a 0 32580 24956
Call Trace:
[<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897
[<ffffffff80127f7d>] avc_has_perm+0x46/0x58
[<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e
[<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b
[<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149
[<ffffffff80036d96>] __lookup_hash+0xa0/0x12f
[<ffffffff80057a2f>] lookup_create+0x46/0x80
[<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4
versus the automount daemon which wants to remove that dentry, but can't
because the normal process is holding the i_mutex lock:
automount D ffffffff8014e05a 0 32581 1 32561
Call Trace:
[<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b
[<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1
[<ffffffff80063c89>] .text.lock.mutex+0xf/0x14
[<ffffffff800e6d55>] do_rmdir+0x77/0xde
[<ffffffff8005d229>] tracesys+0x71/0xe0
[<ffffffff8005d28d>] tracesys+0xd5/0xe0
which means that the system is deadlocked.
This patch allows autofs to hold up normal processes whilst the daemon goes
ahead and does things to the dentry tree behind the automouter point without
risking a deadlock as almost no locks are held in d_manage() and none in
d_automount().
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation. The operation is keyed off a new
dentry flag (DCACHE_NEED_AUTOMOUNT).
This also makes it easier to add an AT_ flag to suppress terminal segment
automount during pathwalk and removes the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.
The ->d_automount() dentry operation:
struct vfsmount *(*d_automount)(struct path *mountpoint);
takes a pointer to the directory to be mounted upon, which is expected to
provide sufficient data to determine what should be mounted. If successful, it
should return the vfsmount struct it creates (which it should also have added
to the namespace using do_add_mount() or similar). If there's a collision with
another automount attempt, NULL should be returned. If the directory specified
by the parameter should be used directly rather than being mounted upon,
-EISDIR should be returned. In any other case, an error code should be
returned.
The ->d_automount() operation is called with no locks held and may sleep. At
this point the pathwalk algorithm will be in ref-walk mode.
Within fs/namei.c itself, a new pathwalk subroutine (follow_automount()) is
added to handle mountpoints. It will return -EREMOTE if the automount flag was
set, but no d_automount() op was supplied, -ELOOP if we've encountered too many
symlinks or mountpoints, -EISDIR if the walk point should be used without
mounting and 0 if successful. The path will be updated to point to the mounted
filesystem if a successful automount took place.
__follow_mount() is replaced by follow_managed() which is more generic
(especially with the patch that adds ->d_manage()). This handles transits from
directories during pathwalk, including automounting and skipping over
mountpoints (and holding processes with the next patch).
__follow_mount_rcu() will jump out of RCU-walk mode if it encounters an
automount point with nothing mounted on it.
follow_dotdot*() does not handle automounts as you don't want to trigger them
whilst following "..".
I've also extracted the mount/don't-mount logic from autofs4 and included it
here. It makes the mount go ahead anyway if someone calls open() or creat(),
tries to traverse the directory, tries to chdir/chroot/etc. into the directory,
or sticks a '/' on the end of the pathname. If they do a stat(), however,
they'll only trigger the automount if they didn't also say O_NOFOLLOW.
I've also added an inode flag (S_AUTOMOUNT) so that filesystems can mark their
inodes as automount points. This flag is automatically propagated to the
dentry as DCACHE_NEED_AUTOMOUNT by __d_instantiate(). This saves NFS and could
save AFS a private flag bit apiece, but is not strictly necessary. It would be
preferable to do the propagation in d_set_d_op(), but that doesn't normally
have access to the inode.
[AV: fixed breakage in case if __follow_mount_rcu() fails and nameidata_drop_rcu()
succeeds in RCU case of do_lookup(); we need to fall through to non-RCU case after
that, rather than just returning with ungrabbed *path]
Signed-off-by: David Howells <dhowells@redhat.com>
Was-Acked-by: Ian Kent <raven@themaw.net>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
do_lookup() has a path leading from LOOKUP_RCU case to non-RCU
crossing of mountpoints, which breaks things badly. If we
hit need_revalidate: and do nothing in there, we need to come
back into LOOKUP_RCU half of things, not to done: in non-RCU
one.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
kernel: fix hlist_bl again
cgroups: Fix a lockdep warning at cgroup removal
fs: namei fix ->put_link on wrong inode in do_filp_open
J. R. Okajima noticed that ->put_link is being attempted on the
wrong inode, and suggested the way to fix it. I changed it a bit
according to Al's suggestion to keep an explicit link path around.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
* 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin:
fs: fix do_last error case when need_reval_dot
nfs: add missing rcu-walk check
fs: hlist UP debug fixup
fs: fix dropping of rcu-walk from force_reval_path
fs: force_reval_path drop rcu-walk before d_invalidate
fs: small rcu-walk documentation fixes
Fixed up trivial conflicts in Documentation/filesystems/porting
When open(2) without O_DIRECTORY opens an existing dir, it should return
EISDIR. In do_last(), the variable 'error' is initialized EISDIR, but it
is changed by d_revalidate() which returns any positive to represent
'the target dir is valid.'
Should we keep and return the initialized 'error' in this case.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
As J. R. Okajima noted, force_reval_path passes in the same dentry to
d_revalidate as the one in the nameidata structure (other callers pass in a
child), so the locking breaks. This can oops with a chrooted nfs mount, for
example. Similarly there can be other problems with revalidating a dentry
which is already in nameidata of the path walk.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
d_revalidate can return in rcu-walk mode even when it returns 0. We can't just
call any old dcache function on rcu-walk dentry (the dentry is unstable, so
even through d_lock can safely be taken, the result may no longer be what we
expect -- careful re-checks would be required). So just drop rcu in this case.
(I missed this conversion when switching to the rcu-walk convention that Linus
suggested)
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
When a file is opened with O_TRUNC, the truncate processing is handled
by handle_truncate(). This function however doesn't receive any info
about the newly instantiated filp, and therefore can't pass that info
along so that the setattr can use it.
This makes NFSv4 misbehave. The client does an open and gets a valid
stateid, and then doesn't use that stateid on the subsequent truncate.
It uses the zero-stateid instead. Most servers ignore this fact and
just do the truncate anyway, but some don't like it (notably, RHEL4).
It seems more correct that since we have a fully instantiated file at
the time that handle_truncate is called, that we pass that along so
that the truncate operation can properly use it.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fix new kernel-doc notation warnings in fs/namei.c and spell
ECHILD correctly.
Warning(fs/namei.c:218): No description found for parameter 'flags'
Warning(fs/namei.c:425): Excess function parameter 'Returns' description in 'nameidata_drop_rcu'
Warning(fs/namei.c:478): Excess function parameter 'Returns' description in 'nameidata_dentry_drop_rcu'
Warning(fs/namei.c:540): Excess function parameter 'Returns' description in 'nameidata_drop_rcu_last'
Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The problem that this patch aims to fix is vfsmount refcounting scalability.
We need to take a reference on the vfsmount for every successful path lookup,
which often go to the same mount point.
The fundamental difficulty is that a "simple" reference count can never be made
scalable, because any time a reference is dropped, we must check whether that
was the last reference. To do that requires communication with all other CPUs
that may have taken a reference count.
We can make refcounts more scalable in a couple of ways, involving keeping
distributed counters, and checking for the global-zero condition less
frequently.
- check the global sum once every interval (this will delay zero detection
for some interval, so it's probably a showstopper for vfsmounts).
- keep a local count and only taking the global sum when local reaches 0 (this
is difficult for vfsmounts, because we can't hold preempt off for the life of
a reference, so a counter would need to be per-thread or tied strongly to a
particular CPU which requires more locking).
- keep a local difference of increments and decrements, which allows us to sum
the total difference and hence find the refcount when summing all CPUs. Then,
keep a single integer "long" refcount for slow and long lasting references,
and only take the global sum of local counters when the long refcount is 0.
This last scheme is what I implemented here. Attached mounts and process root
and working directory references are "long" references, and everything else is
a short reference.
This allows scalable vfsmount references during path walking over mounted
subtrees and unattached (lazy umounted) mounts with processes still running
in them.
This results in one fewer atomic op in the fastpath: mntget is now just a
per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock
and non-atomic decrement in the common case. However code is otherwise bigger
and heavier, so single threaded performance is basically a wash.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Require filesystems be aware of .d_revalidate being called in rcu-walk
mode (nd->flags & LOOKUP_RCU). For now do a simple push down, returning
-ECHILD from all implementations.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Reduce some branches and memory accesses in dcache lookup by adding dentry
flags to indicate common d_ops are set, rather than having to check them.
This saves a pointer memory access (dentry->d_op) in common path lookup
situations, and saves another pointer load and branch in cases where we
have d_op but not the particular operation.
Patched with:
git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Use a seqlock in the fs_struct to enable us to take an atomic copy of the
complete cwd and root paths. Use this in the RCU lookup path to avoid a
thread-shared spinlock in RCU lookup operations.
Multi-threaded apps may now perform path lookups with scalability matching
multi-process apps. Operations such as stat(2) become very scalable for
multi-threaded workload.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Perform common cases of path lookups without any stores or locking in the
ancestor dentry elements. This is called rcu-walk, as opposed to the current
algorithm which is a refcount based walk, or ref-walk.
This results in far fewer atomic operations on every path element,
significantly improving path lookup performance. It also avoids cacheline
bouncing on common dentries, significantly improving scalability.
The overall design is like this:
* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk.
* Take the RCU lock for the entire path walk, starting with the acquiring
of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are
not required for dentry persistence.
* synchronize_rcu is called when unregistering a filesystem, so we can
access d_ops and i_ops during rcu-walk.
* Similarly take the vfsmount lock for the entire path walk. So now mnt
refcounts are not required for persistence. Also we are free to perform mount
lookups, and to assume dentry mount points and mount roots are stable up and
down the path.
* Have a per-dentry seqlock to protect the dentry name, parent, and inode,
so we can load this tuple atomically, and also check whether any of its
members have changed.
* Dentry lookups (based on parent, candidate string tuple) recheck the parent
sequence after the child is found in case anything changed in the parent
during the path walk.
* inode is also RCU protected so we can load d_inode and use the inode for
limited things.
* i_mode, i_uid, i_gid can be tested for exec permissions during path walk.
* i_op can be loaded.
When we reach the destination dentry, we lock it, recheck lookup sequence,
and increment its refcount and mountpoint refcount. RCU and vfsmount locks
are dropped. This is termed "dropping rcu-walk". If the dentry refcount does
not match, we can not drop rcu-walk gracefully at the current point in the
lokup, so instead return -ECHILD (for want of a better errno). This signals the
path walking code to re-do the entire lookup with a ref-walk.
Aside from the final dentry, there are other situations that may be encounted
where we cannot continue rcu-walk. In that case, we drop rcu-walk (ie. take
a reference on the last good dentry) and continue with a ref-walk. Again, if
we can drop rcu-walk gracefully, we return -ECHILD and do the whole lookup
using ref-walk. But it is very important that we can continue with ref-walk
for most cases, particularly to avoid the overhead of double lookups, and to
gain the scalability advantages on common path elements (like cwd and root).
The cases where rcu-walk cannot continue are:
* NULL dentry (ie. any uncached path element)
* parent with d_inode->i_op->permission or ACLs
* dentries with d_revalidate
* Following links
In future patches, permission checks and d_revalidate become rcu-walk aware. It
may be possible eventually to make following links rcu-walk aware.
Uncached path elements will always require dropping to ref-walk mode, at the
very least because i_mutex needs to be grabbed, and objects allocated.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a
0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when
we start protecting many other dentry members with d_lock.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Change d_hash so it may be called from lock-free RCU lookups. See similar
patch for d_compare for details.
For in-tree filesystems, this is just a mechanical change.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Unsetting FMODE_NONOTIFY in fsnotify_open() is too late, since fsnotify_perm()
is called before. If FMODE_NONOTIFY is set fsnotify_perm() will skip permission
checks, so a user can still disable permission checks by setting this flag
in an open() call.
This patch corrects this by unsetting the flag before fsnotify_perm is called.
Signed-off-by: Lino Sanfilippo <LinoSanfilippo@gmx.de>
Signed-off-by: Eric Paris <eparis@redhat.com>
nameidata_to_filp() drops nd->path or transfers it to opened
file. In the former case it's a Bad Idea(tm) to do mnt_drop_write()
on nd->path.mnt, since we might race with umount and vfsmount in
question might be gone already.
Fix: don't drop it, then... IOW, have nameidata_to_filp() grab nd->path
in case it transfers it to file and do path_drop() in callers. After
they are through with accessing nd->path...
Reported-by: Miklos Szeredi <miklos@szeredi.hu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs: brlock vfsmount_lock
Use a brlock for the vfsmount lock. It must be taken for write whenever
modifying the mount hash or associated fields, and may be taken for read when
performing mount hash lookups.
A new lock is added for the mnt-id allocator, so it doesn't need to take
the heavy vfsmount write-lock.
The number of atomics should remain the same for fastpath rlock cases, though
code would be slightly slower due to per-cpu access. Scalability is not not be
much improved in common cases yet, due to other locks (ie. dcache_lock) getting
in the way. However path lookups crossing mountpoints should be one case where
scalability is improved (currently requiring the global lock).
The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node
Altix system (high latency to remote nodes), a simple umount microbenchmark
(mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it
took 6.8s, afterwards took 7.1s, about 5% slower.
Cc: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs: remove extra lookup in __lookup_hash
Optimize lookup for create operations, where no dentry should often be
common-case. In cases where it is not, such as unlink, the added overhead
is much smaller than the removed.
Also, move comments about __d_lookup racyness to the __d_lookup call site.
d_lookup is intuitive; __d_lookup is what needs commenting. So in that same
vein, add kerneldoc comments to __d_lookup and clean up some of the comments:
- We are interested in how the RCU lookup works here, particularly with
renames. Make that explicit, and point to the document where it is explained
in more detail.
- RCU is pretty standard now, and macros make implementations pretty mindless.
If we want to know about RCU barrier details, we look in RCU code.
- Delete some boring legacy comments because we don't care much about how the
code used to work, more about the interesting parts of how it works now. So
comments about lazy LRU may be interesting, but would better be done in the
LRU or refcount management code.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs: dentry allocation consolidation
There are 2 duplicate copies of code in dentry allocation in path lookup.
Consolidate them into a single function.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
fs: fix do_lookup false negative
In do_lookup, if we initially find no dentry, we take the directory i_mutex and
re-check the lookup. If we find a dentry there, then we revalidate it if
needed. However if that revalidate asks for the dentry to be invalidated, we
return -ENOENT from do_lookup. What should happen instead is an attempt to
allocate and lookup a new dentry.
This is probably not noticed because it is rare. It is only reached if a
concurrent create races in first (in which case, the dentry probably won't be
invalidated anyway), or if the racy __d_lookup has failed due to a
false-negative (which is very rare).
Fix this by removing code and have it use the normal reval path.
Signed-off-by: Nick Piggin <npiggin@kernel.dk>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Add three helpers that retrieve a refcounted copy of the root and cwd
from the supplied fs_struct.
get_fs_root()
get_fs_pwd()
get_fs_root_and_pwd()
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits)
fanotify: use both marks when possible
fsnotify: pass both the vfsmount mark and inode mark
fsnotify: walk the inode and vfsmount lists simultaneously
fsnotify: rework ignored mark flushing
fsnotify: remove global fsnotify groups lists
fsnotify: remove group->mask
fsnotify: remove the global masks
fsnotify: cleanup should_send_event
fanotify: use the mark in handler functions
audit: use the mark in handler functions
dnotify: use the mark in handler functions
inotify: use the mark in handler functions
fsnotify: send fsnotify_mark to groups in event handling functions
fsnotify: Exchange list heads instead of moving elements
fsnotify: srcu to protect read side of inode and vfsmount locks
fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called
fsnotify: use _rcu functions for mark list traversal
fsnotify: place marks on object in order of group memory address
vfs/fsnotify: fsnotify_close can delay the final work in fput
fsnotify: store struct file not struct path
...
Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.
SELinux needs to pass the MAY_ACCESS flag so it can handle auditting
correctly. Presently the masking of MAY_* flags is done in the VFS. In
order to allow LSMs to decide what flags they care about and what flags
they don't just pass them all and the each LSM mask off what they don't
need. This patch should contain no functional changes to either the VFS or
any LSM.
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Stephen D. Smalley <sds@tycho.nsa.gov>
Signed-off-by: James Morris <jmorris@namei.org>
When commit be6d3e56a6 "introduce new LSM hooks
where vfsmount is available." was proposed, regarding security_path_truncate(),
only "struct file *" argument (which AppArmor wanted to use) was removed.
But length and time_attrs arguments are not used by TOMOYO nor AppArmor.
Thus, let's remove these arguments.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: James Morris <jmorris@namei.org>
fsnotify was using char * when it passed around the d_name.name string
internally but it is actually an unsigned char *. This patch switches
fsnotify to use unsigned and should silence some pointer signess warnings
which have popped out of xfs. I do not add -Wpointer-sign to the fsnotify
code as there are still issues with kstrdup and strlen which would pop
out needless warnings.
Signed-off-by: Eric Paris <eparis@redhat.com>
Commit 1f36f774b2 broke FS_REVAL_DOT semantics.
In particular, before this patch, the command
ls -l
in an NFS mounted directory would always check if the directory on the server
had changed and if so would flush and refill the pagecache for the dir.
After this patch, the same "ls -l" will repeatedly return stale date until
the cached attributes for the directory time out.
The following patch fixes this by ensuring the d_revalidate is called by
do_last when "." is being looked-up.
link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN
is not set so nfs_lookup_verify_inode chooses not to do any validation.
The following patch restores the original behaviour.
Cc: stable@kernel.org
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
update the mnt of the path when it is not equal to the new one.
Signed-off-by: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
1) i_flags simply doesn't work for mount/unlink race prevention;
we may have many links to file and rm on one of those obviously
shouldn't prevent bind on top of another later on. To fix it
right way we need to mark _dentry_ as unsuitable for mounting
upon; new flag (DCACHE_CANT_MOUNT) is protected by d_flags and
i_mutex on the inode in question. Set it (with dont_mount(dentry))
in unlink/rmdir/etc., check (with cant_mount(dentry)) in places
in namespace.c that used to check for S_DEAD. Setting S_DEAD
is still needed in places where we used to set it (for directories
getting killed), since we rely on it for readdir/rmdir race
prevention.
2) rename()/mount() protection has another bogosity - we unhash
the target before we'd checked that it's not a mountpoint. Fixed.
3) ancient bogosity in pivot_root() - we locked i_mutex on the
right directory, but checked S_DEAD on the different (and wrong)
one. Noticed and fixed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
According to specification
mkdir d; ln -s d a; open("a/", O_NOFOLLOW | O_RDONLY)
should return success but currently it returns ELOOP. This is a
regression caused by path lookup cleanup patch series.
Fix the code to ignore O_NOFOLLOW in case the provided path has trailing
slashes.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: Marius Tolzmann <tolzmann@molgen.mpg.de>
Acked-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lose want_dir argument, while we are at it - since now
nd->flags & LOOKUP_DIRECTORY is equivalent to it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
We managed to lose O_DIRECTORY testing due to a stupid typo in commit
1f36f774b2 ("Switch !O_CREAT case to use of do_last()")
Reported-by: Walter Sheets <w41ter@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs-2.6: (33 commits)
quota: stop using QUOTA_OK / NO_QUOTA
dquot: cleanup dquot initialize routine
dquot: move dquot initialization responsibility into the filesystem
dquot: cleanup dquot drop routine
dquot: move dquot drop responsibility into the filesystem
dquot: cleanup dquot transfer routine
dquot: move dquot transfer responsibility into the filesystem
dquot: cleanup inode allocation / freeing routines
dquot: cleanup space allocation / freeing routines
ext3: add writepage sanity checks
ext3: Truncate allocated blocks if direct IO write fails to update i_size
quota: Properly invalidate caches even for filesystems with blocksize < pagesize
quota: generalize quota transfer interface
quota: sb_quota state flags cleanup
jbd: Delay discarding buffers in journal_unmap_buffer
ext3: quota_write cross block boundary behaviour
quota: drop permission checks from xfs_fs_set_xstate/xfs_fs_set_xquota
quota: split out compat_sys_quotactl support from quota.c
quota: split out netlink notification support from quota.c
quota: remove invalid optimization from quota_sync_all
...
Fixed trivial conflicts in fs/namei.c and fs/ufs/inode.c
Now that nd->last stays around until ->put_link() is called, we can
just postpone that ->put_link() in do_filp_open() a bit and don't
bother with copying.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
If we'd passed through 32 trailing symlinks already, there's
no sense following the 33rd - we'll bail out anyway. Better
bugger off earlier.
It *does* change behaviour, after a fashion - if the 33rd happens
to be a procfs-style symlink, original code *would* allow it.
This one will not. Cry me a river if that hurts you. Please, do.
And post a video of that, while you are at it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>