This reverts commit 55956b59df.
commit 55956b59df ("vfs: Allow userns root to call mknod on owned filesystems.")
enabled mknod() in user namespaces for userns root if CAP_MKNOD is
available. However, these device nodes are useless since any filesystem
mounted from a non-initial user namespace will set the SB_I_NODEV flag on
the filesystem. Now, when a device node s created in a non-initial user
namespace a call to open() on said device node will fail due to:
bool may_open_dev(const struct path *path)
{
return !(path->mnt->mnt_flags & MNT_NODEV) &&
!(path->mnt->mnt_sb->s_iflags & SB_I_NODEV);
}
The problem with this is that as of the aforementioned commit mknod()
creates partially functional device nodes in non-initial user namespaces.
In particular, it has the consequence that as of the aforementioned commit
open() will be more privileged with respect to device nodes than mknod().
Before it was the other way around. Specifically, if mknod() succeeded
then it was transparent for any userspace application that a fatal error
must have occured when open() failed.
All of this breaks multiple userspace workloads and a widespread assumption
about how to handle mknod(). Basically, all container runtimes and systemd
live by the slogan "ask for forgiveness not permission" when running user
namespace workloads. For mknod() the assumption is that if the syscall
succeeds the device nodes are useable irrespective of whether it succeeds
in a non-initial user namespace or not. This logic was chosen explicitly
to allow for the glorious day when mknod() will actually be able to create
fully functional device nodes in user namespaces.
A specific problem people are already running into when running 4.18 rc
kernels are failing systemd services. For any distro that is run in a
container systemd services started with the PrivateDevices= property set
will fail to start since the device nodes in question cannot be
opened (cf. the arguments in [1]).
Full disclosure, Seth made the very sound argument that it is already
possible to end up with partially functional device nodes. Any filesystem
mounted with MS_NODEV set will allow mknod() to succeed but will not allow
open() to succeed. The difference to the case here is that the MS_NODEV
case is transparent to userspace since it is an explicitly set mount option
while the SB_I_NODEV case is an implicit property enforced by the kernel
and hence opaque to userspace.
[1]: https://github.com/systemd/systemd/pull/9483
Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Cc: Serge Hallyn <serge@hallyn.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Disallows open of FIFOs or regular files not owned by the user in world
writable sticky directories, unless the owner is the same as that of the
directory or the file is opened without the O_CREAT flag. The purpose
is to make data spoofing attacks harder. This protection can be turned
on and off separately for FIFOs and regular files via sysctl, just like
the symlinks/hardlinks protection. This patch is based on Openwall's
"HARDEN_FIFO" feature by Solar Designer.
This is a brief list of old vulnerabilities that could have been prevented
by this feature, some of them even allow for privilege escalation:
CVE-2000-1134
CVE-2007-3852
CVE-2008-0525
CVE-2009-0416
CVE-2011-4834
CVE-2015-1838
CVE-2015-7442
CVE-2016-7489
This list is not meant to be complete. It's difficult to track down all
vulnerabilities of this kind because they were often reported without any
mention of this particular attack vector. In fact, before
hardlinks/symlinks restrictions, fifos/regular files weren't the favorite
vehicle to exploit them.
[s.mesoraca16@gmail.com: fix bug reported by Dan Carpenter]
Link: https://lkml.kernel.org/r/20180426081456.GA7060@mwanda
Link: http://lkml.kernel.org/r/1524829819-11275-1-git-send-email-s.mesoraca16@gmail.com
[keescook@chromium.org: drop pr_warn_ratelimited() in favor of audit changes in the future]
[keescook@chromium.org: adjust commit subjet]
Link: http://lkml.kernel.org/r/20180416175918.GA13494@beast
Signed-off-by: Salvatore Mesoraca <s.mesoraca16@gmail.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
Suggested-by: Solar Designer <solar@openwall.com>
Suggested-by: Kees Cook <keescook@chromium.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This contains two new features:
1) Stack file operations: this allows removal of several hacks from the
VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
2) Metadata only copy-up: when file is on lower layer and only metadata is
modified (except size) then only copy up the metadata and continue to
use the data from the lower file.
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSQHSd0lITzzeNWNm3h3BK/laaZPAUCW3srhAAKCRDh3BK/laaZ
PC6tAQCP+KklcN+TvNp502f+O/kATahSpgnun4NY1/p4I8JV+AEAzdlkTN3+MiAO
fn9brN6mBK7h59DO3hqedPLJy2vrgwg=
=QDXH
-----END PGP SIGNATURE-----
Merge tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs
Pull overlayfs updates from Miklos Szeredi:
"This contains two new features:
- Stack file operations: this allows removal of several hacks from
the VFS, proper interaction of read-only open files with copy-up,
possibility to implement fs modifying ioctls properly, and others.
- Metadata only copy-up: when file is on lower layer and only
metadata is modified (except size) then only copy up the metadata
and continue to use the data from the lower file"
* tag 'ovl-update-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (66 commits)
ovl: Enable metadata only feature
ovl: Do not do metacopy only for ioctl modifying file attr
ovl: Do not do metadata only copy-up for truncate operation
ovl: add helper to force data copy-up
ovl: Check redirect on index as well
ovl: Set redirect on upper inode when it is linked
ovl: Set redirect on metacopy files upon rename
ovl: Do not set dentry type ORIGIN for broken hardlinks
ovl: Add an inode flag OVL_CONST_INO
ovl: Treat metacopy dentries as type OVL_PATH_MERGE
ovl: Check redirects for metacopy files
ovl: Move some dir related ovl_lookup_single() code in else block
ovl: Do not expose metacopy only dentry from d_real()
ovl: Open file with data except for the case of fsync
ovl: Add helper ovl_inode_realdata()
ovl: Store lower data inode in ovl_inode
ovl: Fix ovl_getattr() to get number of blocks from lower
ovl: Add helper ovl_dentry_lowerdata() to get lower data dentry
ovl: Copy up meta inode data from lowest data inode
ovl: Modify ovl_lookup() and friends to lookup metacopy dentry
...
Pull misc vfs updates from Al Viro:
"Misc cleanups from various folks all over the place
I expected more fs/dcache.c cleanups this cycle, so that went into a
separate branch. Said cleanups have missed the window, so in the
hindsight it could've gone into work.misc instead. Decided not to
cherry-pick, thus the 'work.dcache' branch"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: dcache: Use true and false for boolean values
fold generic_readlink() into its only caller
fs: shave 8 bytes off of struct inode
fs: Add more kernel-doc to the produced documentation
fs: Fix attr.c kernel-doc
removed extra extern file_fdatawait_range
* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
kill dentry_update_name_case()
There is a check for IS_ERR(name) immediately upstream of each call
of link_path_walk(name, nd), with positives treated as if link_path_walk()
failed with PTR_ERR(name). Taking that check into link_path_walk() itself
simplifies things nicely.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
caller can tell "opened" from "open it yourself" by looking at ->f_mode.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
FMODE_OPENED can be used to distingusish "successful open" from the
"called finish_no_open(), do it yourself" cases. Since finish_no_open()
has been adjusted, no changes in the instances were actually needed.
The caller has been adjusted.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
just check ->f_mode in ima_appraise_measurement()
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Parallel to FILE_CREATED, goes into ->f_mode instead of *opened.
NFS is a bit of a wart here - it doesn't have file at the point
where FILE_CREATED used to be set, so we need to propagate it
there (for now). IMA is another one (here and everywhere)...
Note that this needs do_dentry_open() to leave old bits in ->f_mode
alone - we want it to preserve FMODE_CREATED if it had been already
set (no other bit can be there).
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and don't bother with setting FILE_OPENED at all.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
These checks are better off in do_dentry_open(); the reason we couldn't
put them there used to be that callers couldn't tell what kind of cleanup
would do_dentry_open() failure call for. Now that we have FMODE_OPENED,
cleanup is the same in all cases - it's simply fput(). So let's fold
that into do_dentry_open(), as Christoph's patch tried to.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Just check FMODE_OPENED in __fput() and be done with that...
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
... and rename get_empty_filp() to alloc_empty_file().
dentry_open() gets creds as argument, but the only thing that sees those is
security_file_open() - file->f_cred still ends up with current_cred(). For
almost all callers it's the same thing, but there are several broken cases.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull AFS updates from Al Viro:
"Assorted AFS stuff - ended up in vfs.git since most of that consists
of David's AFS-related followups to Christoph's procfs series"
* 'afs-proc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
afs: Optimise callback breaking by not repeating volume lookup
afs: Display manually added cells in dynamic root mount
afs: Enable IPv6 DNS lookups
afs: Show all of a server's addresses in /proc/fs/afs/servers
afs: Handle CONFIG_PROC_FS=n
proc: Make inline name size calculation automatic
afs: Implement network namespacing
afs: Mark afs_net::ws_cell as __rcu and set using rcu functions
afs: Fix a Sparse warning in xdr_decode_AFSFetchStatus()
proc: Add a way to make network proc files writable
afs: Rearrange fs/afs/proc.c to remove remaining predeclarations.
afs: Rearrange fs/afs/proc.c to move the show routines up
afs: Rearrange fs/afs/proc.c by moving fops and open functions down
afs: Move /proc management functions to the end of the file
Alter the dynroot mount so that cells created by manipulation of
/proc/fs/afs/cells and /proc/fs/afs/rootcell and by specification of a root
cell as a module parameter will cause directories for those cells to be
created in the dynamic root superblock for the network namespace[*].
To this end:
(1) Only one dynamic root superblock is now created per network namespace
and this is shared between all attempts to mount it. This makes it
easier to find the superblock to modify.
(2) When a dynamic root superblock is created, the list of cells is walked
and directories created for each cell already defined.
(3) When a new cell is added, if a dynamic root superblock exists, a
directory is created for it.
(4) When a cell is destroyed, the directory is removed.
(5) These directories are created by calling lookup_one_len() on the root
dir which automatically creates them if they don't exist.
[*] Inasmuch as network namespaces are currently supported here.
Signed-off-by: David Howells <dhowells@redhat.com>
Pull userns updates from Eric Biederman:
"This is the last couple of vfs bits to enable root in a user namespace
to mount and manipulate a filesystem with backing store (AKA not a
virtual filesystem like proc, but a filesystem where the unprivileged
user controls the content). The target filesystem for this work is
fuse, and Miklos should be sending you the pull request for the fuse
bits this merge window.
The two key patches are "evm: Don't update hmacs in user ns mounts"
and "vfs: Don't allow changing the link count of an inode with an
invalid uid or gid". Those close small gaps in the vfs that would be a
problem if an unprivileged fuse filesystem is mounted.
The rest of the changes are things that are now safe to allow a root
user in a user namespace to do with a filesystem they have mounted.
The most interesting development is that remount is now safe"
* 'userns-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
fs: Allow CAP_SYS_ADMIN in s_user_ns to freeze and thaw filesystems
capabilities: Allow privileged user in s_user_ns to set security.* xattrs
fs: Allow superblock owner to access do_remount_sb()
fs: Allow superblock owner to replace invalid owners of inodes
vfs: Allow userns root to call mknod on owned filesystems.
vfs: Don't allow changing the link count of an inode with an invalid uid or gid
evm: Don't update hmacs in user ns mounts
Pull misc vfs updates from Al Viro:
"Misc bits and pieces not fitting into anything more specific"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
vfs: delete unnecessary assignment in vfs_listxattr
Documentation: filesystems: update filesystem locking documentation
vfs: namei: use path_equal() in follow_dotdot()
fs.h: fix outdated comment about file flags
__inode_security_revalidate() never gets NULL opt_dentry
make xattr_getsecurity() static
vfat: simplify checks in vfat_lookup()
get rid of dead code in d_find_alias()
it's SB_BORN, not MS_BORN...
msdos_rmdir(): kill BS comment
remove rpc_rmdir()
fs: avoid fdput() after failed fdget() in vfs_dedupe_file_range()
Pull rmdir update from Al Viro:
"More shrink_dcache_parent()-related stuff - killing the main source of
potentially contended calls of that on large subtrees"
* 'work.rmdir' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
rmdir(),rename(): do shrink_dcache_parent() only on success
This reverts commit cab64df194.
Having vfs_open() in some cases drop the reference to
struct file combined with
error = vfs_open(path, f, cred);
if (error) {
put_filp(f);
return ERR_PTR(error);
}
return f;
is flat-out wrong. It used to be
error = vfs_open(path, f, cred);
if (!error) {
/* from now on we need fput() to dispose of f */
error = open_check_o_direct(f);
if (error) {
fput(f);
f = ERR_PTR(error);
}
} else {
put_filp(f);
f = ERR_PTR(error);
}
and sure, having that open_check_o_direct() boilerplate gotten rid of is
nice, but not that way...
Worse, another call chain (via finish_open()) is FUBAR now wrt
FILE_OPENED handling - in that case we get error returned, with file
already hit by fput() *AND* FILE_OPENED not set. Guess what happens in
path_openat(), when it hits
if (!(opened & FILE_OPENED)) {
BUG_ON(!error);
put_filp(file);
}
The root cause of all that crap is that the callers of do_dentry_open()
have no way to tell which way did it fail; while that could be fixed up
(by passing something like int *opened to do_dentry_open() and have it
marked if we'd called ->open()), it's probably much too late in the
cycle to do so right now.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Once upon a time ->rmdir() instances used to check if victim inode
had more than one (in-core) reference and failed with -EBUSY if it
had. The reason was race avoidance - emptiness check is worthless
if somebody could just go and create new objects in the victim
directory afterwards.
With introduction of dcache the checks had been replaced with
checking the refcount of dentry. However, since a cached negative
lookup leaves a negative child dentry, such check had lead to false
positives - with empty foo/ doing stat foo/bar before rmdir foo
ended up with -EBUSY unless the negative dentry of foo/bar happened
to be evicted by the time of rmdir(2). That had been fixed by
doing shrink_dcache_parent() just before the refcount check.
At the same time, ext2_rmdir() has grown a private solution that
eliminated those -EBUSY - it did something (setting ->i_size to 0)
which made any subsequent ext2_add_entry() fail.
Unfortunately, even with shrink_dcache_parent() the check had been
racy - after all, the victim itself could be found by dcache lookup
just after we'd checked its refcount. That got fixed by a new
helper (dentry_unhash()) that did shrink_dcache_parent() and unhashed
the sucker if its refcount ended up equal to 1. That got called before
->rmdir(), turning the checks in ->rmdir() instances into "if not
unhashed fail with -EBUSY". Which reduced the boilerplate nicely, but
had an unpleasant side effect - now shrink_dcache_parent() had been
done before the emptiness checks, leading to easily triggerable calls
of shrink_dcache_parent() on arbitrary large subtrees, quite possibly
nested into each other.
Several years later the ext2-private trick had been generalized -
(in-core) inodes of dead directories are flagged and calls of
lookup, readdir and all directory-modifying methods were prevented
in so marked directories. Remaining boilerplate in ->rmdir() instances
became redundant and some instances got rid of it.
In 2011 the call of dentry_unhash() got shifted into ->rmdir() instances
and then killed off in all of them. That has lead to another problem,
though - in case of successful rmdir we *want* any (negative) child
dentries dropped and the victim itself made negative. There's no point
keeping cached negative lookups in foo when we can get the negative
lookup of foo itself cached. So shrink_dcache_parent() call had been
restored; unfortunately, it went into the place where dentry_unhash()
used to be, i.e. before the ->rmdir() call. Note that we don't unhash
anymore, so any "is it busy" checks would be racy; fortunately, all of
them are gone.
We should've done that call right *after* successful ->rmdir(). That
reduces contention caused by tree-walking in shrink_dcache_parent()
and, especially, contention caused by evictions in two nested subtrees
going on in parallel. The same goes for directory-overwriting rename() -
the story there had been parallel to that of rmdir().
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
These filesystems already always set SB_I_NODEV so mknod will not be
useful for gaining control of any devices no matter their permissions.
This will allow overlayfs and applications like to fakeroot to use
device nodes to represent things on disk.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Changing the link count of an inode via unlink or link will cause a
write back of that inode. If the uids or gids are invalid (aka not known
to the kernel) writing the inode back may change the uid or gid in the
filesystem. To prevent possible filesystem and to avoid the need for
filesystem maintainers to worry about it don't allow operations on
inodes with an invalid uid or gid.
Acked-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Use path_equal() to detect whether we're already in root.
Signed-off-by: Danilo Krummrich <danilokrummrich@dk-develop.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull vfs namei updates from Al Viro:
- make lookup_one_len() safe with parent locked only shared(incoming
afs series wants that)
- fix of getname_kernel() regression from 2015 (-stable fodder, that
one).
* 'work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
getname_kernel() needs to make sure that ->name != ->iname in long case
make lookup_one_len() safe to use with directory locked shared
new helper: __lookup_slow()
merge common parts of lookup_one_len{,_unlocked} into common helper
-----BEGIN PGP SIGNATURE-----
iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAlrD6T4UHHBhdWxAcGF1
bC1tb29yZS5jb20ACgkQVeRaWujKfIqGOg/9FPgs5cESBrocEOBAqqcmO3qjxaEy
NKQWGTPppZwI5f5pOStL5GT3oU8jQp3IMjzUM2yIElFUg+RM5cwb0bLmhAMCJFCd
vtrJmGDdQ0QEj5wqkprupaVEKENlSKaKePJq3NESFtcHs2cgfcIRsycj1LaOThNi
fUcltiocBDS/jxurCgi2s4O2JTGEXfZaI0GojKjWDddL3N6QcD5aZgPQd/67T0Pt
5dDgkXbGkd5pR97F+LovaTuLTaMXnUx5plMUd/LsueZbOxHjZL2O2E/h4aoXATMX
zKdtG03wEebb65cQyczeTXRIBURIQCka0U0fHx7ZhS8vK2HVgr6oGfsJfyZhSp+l
IIb/T1dSbgUURpMH0DiGs/pQrXO/9o7Rp7wIakycIHD0kcw503hbauqJEc6pwlx6
/WQQTo6GKwHWW67OQ7AbIt4Gh9P/s96s6kEZGRH2NAjKY9xTZVM7+nnKL8hHk0xq
uDN20AZuD5i9cZpVqw+MYdmeuHRuNPglY9S33MyaBbFeWl48voFxiabVpV3ENfLB
Iyc5WzpxekJi9JLneEt6/r6XIissvHxsoIPL1lCYSAPIJQRmqg4sGHKAQ9o5NtFD
MrRZSbBQVwt3+YFKixUcU+nvnhroJsQExejZoFmAdQl8f0TiihwYl8E4lSmy7ntr
IzNm7li+y9VRJ54=
=n1dk
-----END PGP SIGNATURE-----
Merge tag 'audit-pr-20180403' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit updates from Paul Moore:
"We didn't have anything to send for v4.16, but we're back with a
little more than usual for v4.17.
Eleven patches in total, most fall into the small fix category, but
there are three non-trivial changes worth calling out:
- the audit entry filter is being removed after deprecating it for
quite a while (years of no one really using it because it turns out
to be not very practical)
- created our own version of "__mutex_owner()" because the locking
folks were upset we were using theirs
- improved our handling of kernel command line parameters to make
them more forgiving
- we fixed auditing of symlink operations
Everything passes the audit-testsuite and as of a few minutes ago it
merges well with your tree"
* tag 'audit-pr-20180403' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: add refused symlink to audit_names
audit: remove path param from link denied function
audit: link denied should not directly generate PATH record
audit: make ANOM_LINK obey audit_enabled and audit_dummy_context
audit: do not panic on invalid boot parameter
audit: track the owner of the command mutex ourselves
audit: return on memory error to avoid null pointer dereference
audit: bail before bug check if audit disabled
audit: deprecate the AUDIT_FILTER_ENTRY filter
audit: session ID should not set arch quick field pointer
audit: update bugtracker and source URIs
Pull vfs dcache updates from Al Viro:
"Part of this is what the trylock loop elimination series has turned
into, part making d_move() preserve the parent (and thus the path) of
victim, plus some general cleanups"
* 'work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (22 commits)
d_genocide: move export to definition
fold dentry_lock_for_move() into its sole caller and clean it up
make non-exchanging __d_move() copy ->d_parent rather than swap them
oprofilefs: don't oops on allocation failure
lustre: get rid of pointless casts to struct dentry *
debugfs_lookup(): switch to lookup_one_len_unlocked()
fold lookup_real() into __lookup_hash()
take out orphan externs (empty_string/slash_string)
split d_path() and friends into a separate file
dcache.c: trim includes
fs/dcache: Avoid a try_lock loop in shrink_dentry_list()
get rid of trylock loop around dentry_kill()
handle move to LRU in retain_dentry()
dput(): consolidate the "do we need to retain it?" into an inlined helper
split the slow part of lock_parent() off
now lock_parent() can't run into killed dentry
get rid of trylock loop in locking dentries on shrink list
d_delete(): get rid of trylock loop
fs/dcache: Move dentry_kill() below lock_parent()
fs/dcache: Remove stale comment from dentry_kill()
...
Pull removal of in-kernel calls to syscalls from Dominik Brodowski:
"System calls are interaction points between userspace and the kernel.
Therefore, system call functions such as sys_xyzzy() or
compat_sys_xyzzy() should only be called from userspace via the
syscall table, but not from elsewhere in the kernel.
At least on 64-bit x86, it will likely be a hard requirement from
v4.17 onwards to not call system call functions in the kernel: It is
better to use use a different calling convention for system calls
there, where struct pt_regs is decoded on-the-fly in a syscall wrapper
which then hands processing over to the actual syscall function. This
means that only those parameters which are actually needed for a
specific syscall are passed on during syscall entry, instead of
filling in six CPU registers with random user space content all the
time (which may cause serious trouble down the call chain). Those
x86-specific patches will be pushed through the x86 tree in the near
future.
Moreover, rules on how data may be accessed may differ between kernel
data and user data. This is another reason why calling sys_xyzzy() is
generally a bad idea, and -- at most -- acceptable in arch-specific
code.
This patchset removes all in-kernel calls to syscall functions in the
kernel with the exception of arch/. On top of this, it cleans up the
three places where many syscalls are referenced or prototyped, namely
kernel/sys_ni.c, include/linux/syscalls.h and include/linux/compat.h"
* 'syscalls-next' of git://git.kernel.org/pub/scm/linux/kernel/git/brodo/linux: (109 commits)
bpf: whitelist all syscalls for error injection
kernel/sys_ni: remove {sys_,sys_compat} from cond_syscall definitions
kernel/sys_ni: sort cond_syscall() entries
syscalls/x86: auto-create compat_sys_*() prototypes
syscalls: sort syscall prototypes in include/linux/compat.h
net: remove compat_sys_*() prototypes from net/compat.h
syscalls: sort syscall prototypes in include/linux/syscalls.h
kexec: move sys_kexec_load() prototype to syscalls.h
x86/sigreturn: use SYSCALL_DEFINE0
x86: fix sys_sigreturn() return type to be long, not unsigned long
x86/ioport: add ksys_ioperm() helper; remove in-kernel calls to sys_ioperm()
mm: add ksys_readahead() helper; remove in-kernel calls to sys_readahead()
mm: add ksys_mmap_pgoff() helper; remove in-kernel calls to sys_mmap_pgoff()
mm: add ksys_fadvise64_64() helper; remove in-kernel call to sys_fadvise64_64()
fs: add ksys_fallocate() wrapper; remove in-kernel calls to sys_fallocate()
fs: add ksys_p{read,write}64() helpers; remove in-kernel calls to syscalls
fs: add ksys_truncate() wrapper; remove in-kernel calls to sys_truncate()
fs: add ksys_sync_file_range helper(); remove in-kernel calls to syscall
kernel: add ksys_setsid() helper; remove in-kernel call to sys_setsid()
kernel: add ksys_unshare() helper; remove in-kernel calls to sys_unshare()
...
Using the fs-internal do_linkat() helper allows us to get rid of
fs-internal calls to the sys_linkat() syscall.
Introducing the ksys_link() wrapper allows us to avoid the in-kernel
calls to sys_link() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it uses
the same calling convention as sys_link().
In the near future, the only fs-external user of ksys_link() should be
converted to use vfs_link() instead.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_mknodat() helper allows us to get rid of
fs-internal calls to the sys_mknodat() syscall.
Introducing the ksys_mknod() wrapper allows us to avoid the in-kernel
calls to sys_mknod() syscall. The ksys_ prefix denotes that this function
is meant as a drop-in replacement for the syscall. In particular, it uses
the same calling convention as sys_mknod().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_symlinkat() helper allows us to get rid of
fs-internal calls to the sys_symlinkat() syscall.
Introducing the ksys_symlink() wrapper allows us to avoid the in-kernel
calls to the sys_symlink() syscall. The ksys_ prefix denotes that this
function is meant as a drop-in replacement for the syscall. In particular,
it uses the same calling convention as sys_symlink().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using the fs-internal do_mkdirat() helper allows us to get rid of
fs-internal calls to the sys_mkdirat() syscall.
Introducing the ksys_mkdir() wrapper allows us to avoid the in-kernel calls
to the sys_mkdir() syscall. The ksys_ prefix denotes that this function is
meant as a drop-in replacement for the syscall. In particular, it uses the
same calling convention as sys_mkdir().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this wrapper allows us to avoid the in-kernel calls to the
sys_rmdir() syscall. The ksys_ prefix denotes that this function is meant
as a drop-in replacement for the syscall. In particular, it uses the same
calling convention as sys_rmdir().
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
Using this helper removes in-kernel calls to the sys_renameat2() syscall.
This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>