Pull btrfs updates from David Sterba:
"The changes range through all types: cleanups, core chagnes, sanity
checks, fixes, other user visible changes, detailed list below:
- deprecated: user transaction ioctl
- mount option ssd does not change allocation alignments
- degraded read-write mount is allowed if all the raid profile
constraints are met, now based on more accurate check
- defrag: do not reset compression afterwards; the NOCOMPRESS flag
can be now overriden by defrag
- prep work for better extent reference tracking (related to the
qgroup slowness with balance)
- prep work for compression heuristics
- memory allocation reductions (may help latencies on a loaded
system)
- better accounting for io waiting states
- error handling improvements (removed BUGs)
- added more sanity checks for shared refs
- fix readdir vs pagefault deadlock under some circumstances
- fix for 'no-hole' mode, certain combination of compressed and
inline extents
- send: fix emission of invalid clone operations
- fixup file mode if setting acls fail
- more fixes from fuzzing
- oher cleanups"
* 'for-4.14' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (104 commits)
btrfs: submit superblock io with REQ_META and REQ_PRIO
btrfs: remove unnecessary memory barrier in btrfs_direct_IO
btrfs: remove superfluous chunk_tree argument from btrfs_alloc_dev_extent
btrfs: Remove chunk_objectid parameter of btrfs_alloc_dev_extent
btrfs: pass fs_info to btrfs_del_root instead of tree_root
Btrfs: add one more sanity check for shared ref type
Btrfs: remove BUG_ON in __add_tree_block
Btrfs: remove BUG() in add_data_reference
Btrfs: remove BUG() in print_extent_item
Btrfs: remove BUG() in btrfs_extent_inline_ref_size
Btrfs: convert to use btrfs_get_extent_inline_ref_type
Btrfs: add a helper to retrive extent inline ref type
btrfs: scrub: simplify scrub worker initialization
btrfs: scrub: clean up division in scrub_find_csum
btrfs: scrub: clean up division in __scrub_mark_bitmap
btrfs: scrub: use bool for flush_all_writes
btrfs: preserve i_mode if __btrfs_set_acl() fails
btrfs: Remove extraneous chunk_objectid variable
btrfs: Remove chunk_objectid argument from btrfs_make_block_group
btrfs: Remove extra parentheses from condition in copy_items()
...
We may use hex2bin() instead of custom approach.
Link: http://lkml.kernel.org/r/87zibktpil.fsf@devron
Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The standard types unsigned int and unsigned long should be used for
.compat_ioctl. autofs is the only fs using uing/ulong for this, and these
are even the only uint/ulong in the entire autofs code.
Drop unneeded long cast in return value of autofs_dev_ioctl_compat().
It's already long.
Link: http://lkml.kernel.org/r/150285069709.4670.3884827966280147529.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This comment was correct when it was added in 8d7b48e0 ("autofs4: add
miscellaneous device for ioctls") in 2008, but not after 4e44b685 "Get rid
of path_lookup in autofs4" in 2009 which introduced find_autofs_mount().
Link: http://lkml.kernel.org/r/150285069148.4670.17959501481201077445.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use a macro which defines misc-dev ioctl parameter size (excluding a path
beyond &path[0]) since it's been used to initialize and copy this
structure ever since it first appeared in 8d7b48e0 in 2008.
(or simply get rid of this if this is just unnecessary abstraction when
all it needs is sizeof(struct autofs_dev_ioctl))
Edit: raven@themaw.net
That's a good point but I'd prefer to keep the macro define.
End edit: raven@themaw.net
Link: http://lkml.kernel.org/r/150285068577.4670.2599968823770600622.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Having header includes before any macro (without any dependency) simply
looks normal. No reason to have these macros in between.
Link: http://lkml.kernel.org/r/150285068011.4670.10271483982093996996.stgit@pluto.themaw.net
Signed-off-by: Tomohiro Kusumi <tkusumi@tuxera.com>
Signed-off-by: Ian Kent <raven@themaw.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some of the autofs miscellaneous device ioctls need to be accessable to
user space applications without CAP_SYS_ADMIN to get information about
autofs mounts.
Link: http://lkml.kernel.org/r/150216642517.11652.2338933266137331637.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The autofs miscellanous device ioctls that shouldn't require
CAP_SYS_ADMIN need to be accessible to user space applications in order
to be able to get information about autofs mounts.
The module checks capabilities so the miscelaneous device should be fine
with broad permissions.
Link: http://lkml.kernel.org/r/150216641928.11652.7388977863125547969.stgit@pluto.themaw.net
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The fstatat(2) and statx() calls can pass the flag AT_NO_AUTOMOUNT which
is meant to clear the LOOKUP_AUTOMOUNT flag and prevent triggering of an
automount by the call. But this flag is unconditionally cleared for all
stat family system calls except statx().
stat family system calls have always triggered mount requests for the
negative dentry case in follow_automount() which is intended but prevents
the fstatat(2) and statx() AT_NO_AUTOMOUNT case from being handled.
In order to handle the AT_NO_AUTOMOUNT for both system calls the negative
dentry case in follow_automount() needs to be changed to return ENOENT
when the LOOKUP_AUTOMOUNT flag is clear (and the other required flags are
clear).
AFAICT this change doesn't have any noticable side effects and may, in
some use cases (although I didn't see it in testing) prevent unnecessary
callbacks to the automount daemon.
It's also possible that a stat family call has been made with a path that
is in the process of being mounted by some other process. But stat family
calls should return the automount state of the path as it is "now" so it
shouldn't wait for mount completion.
This is the same semantic as the positive dentry case already handled.
Link: http://lkml.kernel.org/r/150216641255.11652.4204561328197919771.stgit@pluto.themaw.net
Fixes: deccf497d8 ("Make stat/lstat/fstatat pass AT_NO_AUTOMOUNT to vfs_statx()")
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: David Howells <dhowells@redhat.com>
Cc: Colin Walters <walters@redhat.com>
Cc: Ondrej Holy <oholy@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Omit extra messages for a memory allocation failure in this function.
This issue was detected by using the Coccinelle software.
Link: http://lkml.kernel.org/r/f92aac79-b05e-321a-1a19-d38c7159ee9c@users.sourceforge.net
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... such that we can avoid the tree walks to get the node with the
smallest key. Semantically the same, as the previously used rb_first(),
but O(1). The main overhead is the extra footprint for the cached rb_node
pointer, which should not matter for epoll.
Link: http://lkml.kernel.org/r/20170719014603.19029-15-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
... such that we can avoid the tree walks to get the node with the
smallest key. Semantically the same, as the previously used rb_first(),
but O(1). The main overhead is the extra footprint for the cached rb_node
pointer, which should not matter for procfs.
Link: http://lkml.kernel.org/r/20170719014603.19029-14-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Allow interval trees to quickly check for overlaps to avoid unnecesary
tree lookups in interval_tree_iter_first().
As of this patch, all interval tree flavors will require using a
'rb_root_cached' such that we can have the leftmost node easily
available. While most users will make use of this feature, those with
special functions (in addition to the generic insert, delete, search
calls) will avoid using the cached option as they can do funky things
with insertions -- for example, vma_interval_tree_insert_after().
[jglisse@redhat.com: fix deadlock from typo vm_lock_anon_vma()]
Link: http://lkml.kernel.org/r/20170808225719.20723-1-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170719014603.19029-12-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Acked-by: Christian König <christian.koenig@amd.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Doug Ledford <dledford@redhat.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Cc: David Airlie <airlied@linux.ie>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Christian Benvenuti <benve@cisco.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If there are large numbers of hugepages to iterate while reading
/proc/pid/smaps, the page walk never does cond_resched(). On archs
without split pmd locks, there can be significant and observable
contention on mm->page_table_lock which cause lengthy delays without
rescheduling.
Always reschedule in smaps_pte_range() if necessary since the pagewalk
iteration can be expensive.
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708211405520.131071@chino.kir.corp.google.com
Signed-off-by: David Rientjes <rientjes@google.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit b18cb64ead ("fs/proc: Stop trying to report thread stacks")
removed the priv parameter user in is_stack so the argument is
redundant. Drop it.
[arnd@arndb.de: remove unused variable]
Link: http://lkml.kernel.org/r/20170801120150.1520051-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170728075833.7241-1-mhocko@kernel.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Cc: Andy Lutomirski <luto@amacapital.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is an enhancement to avoid a non cooperative userfaultfd manager
having to unregister all regions before it can close the uffd after all
userfaultfd activity completed.
The UFFDIO_UNREGISTER would serialize against the handle_userfault by
taking the mmap_sem for writing, but we can simply repeat the page fault
if we detect the uffd was closed and so the regular page fault paths
should takeover.
Link: http://lkml.kernel.org/r/20170823181227.19926-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Platform with advance system bus (like CAPI or CCIX) allow device memory
to be accessible from CPU in a cache coherent fashion. Add a new type of
ZONE_DEVICE to represent such memory. The use case are the same as for
the un-addressable device memory but without all the corners cases.
Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Introduce a new migration mode that allow to offload the copy to a device
DMA engine. This changes the workflow of migration and not all
address_space migratepage callback can support this.
This is intended to be use by migrate_vma() which itself is use for thing
like HMM (see include/linux/hmm.h).
No additional per-filesystem migratepage testing is needed. I disables
MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
added comment in those to explain why (part of this patch). The commit
message is unclear it should say that any callback that wish to support
this new mode need to be aware of the difference in the migration flow
from other mode.
Some of these callbacks do extra locking while copying (aio, zsmalloc,
balloon, ...) and for DMA to be effective you want to copy multiple
pages in one DMA operations. But in the problematic case you can not
easily hold the extra lock accross multiple call to this callback.
Usual flow is:
For each page {
1 - lock page
2 - call migratepage() callback
3 - (extra locking in some migratepage() callback)
4 - migrate page state (freeze refcount, update page cache, buffer
head, ...)
5 - copy page
6 - (unlock any extra lock of migratepage() callback)
7 - return from migratepage() callback
8 - unlock page
}
The new mode MIGRATE_SYNC_NO_COPY:
1 - lock multiple pages
For each page {
2 - call migratepage() callback
3 - abort in all problematic migratepage() callback
4 - migrate page state (freeze refcount, update page cache, buffer
head, ...)
} // finished all calls to migratepage() callback
5 - DMA copy multiple pages
6 - unlock all the pages
To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
new callback migratepages() (for instance) that deals with multiple
pages in one transaction.
Because the problematic cases are not important for current usage I did
not wanted to complexify this patchset even more for no good reason.
Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
HMM (heterogeneous memory management) need struct page to support
migration from system main memory to device memory. Reasons for HMM and
migration to device memory is explained with HMM core patch.
This patch deals with device memory that is un-addressable memory (ie CPU
can not access it). Hence we do not want those struct page to be manage
like regular memory. That is why we extend ZONE_DEVICE to support
different types of memory.
A persistent memory type is define for existing user of ZONE_DEVICE and a
new device un-addressable type is added for the un-addressable memory
type. There is a clear separation between what is expected from each
memory type and existing user of ZONE_DEVICE are un-affected by new
requirement and new use of the un-addressable type. All specific code
path are protect with test against the memory type.
Because memory is un-addressable we use a new special swap type for when a
page is migrated to device memory (this reduces the number of maximum swap
file).
The main two additions beside memory type to ZONE_DEVICE is two callbacks.
First one, page_free() is call whenever page refcount reach 1 (which
means the page is free as ZONE_DEVICE page never reach a refcount of 0).
This allow device driver to manage its memory and associated struct page.
The second callback page_fault() happens when there is a CPU access to an
address that is back by a device page (which are un-addressable by the
CPU). This callback is responsible to migrate the page back to system
main memory. Device driver can not block migration back to system memory,
HMM make sure that such page can not be pin into device memory.
If device is in some error condition and can not migrate memory back then
a CPU page fault to device memory should end with SIGBUS.
[arnd@arndb.de: fix warning]
Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mark Hairgrove <mhairgrove@nvidia.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Sherry Cheung <SCheung@nvidia.com>
Cc: Subhash Gutti <sgutti@nvidia.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Bob Liu <liubo95@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Soft dirty bit is designed to keep tracked over page migration. This
patch makes it work in the same manner for thp migration too.
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When THP migration is being used, memory management code needs to handle
pmd migration entries properly. This patch uses !pmd_present() or
is_swap_pmd() (depending on whether pmd_none() needs separate code or
not) to check pmd migration entries at the places where a pmd entry is
present.
Since pmd-related code uses split_huge_page(), split_huge_pmd(),
pmd_trans_huge(), pmd_trans_unstable(), or
pmd_none_or_trans_huge_or_clear_bad(), this patch:
1. adds pmd migration entry split code in split_huge_pmd(),
2. takes care of pmd migration entries whenever pmd_trans_huge() is present,
3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.
Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
them.
Until this commit, a pmd entry should be:
1. pointing to a pte page,
2. is_swap_pmd(),
3. pmd_trans_huge(),
4. pmd_devmap(), or
5. pmd_none().
Signed-off-by: Zi Yan <zi.yan@cs.rutgers.edu>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Nellans <dnellans@nvidia.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
running set*id processes. To do this, the bprm_secureexec LSM hook is
collapsed into the bprm_set_creds hook so the secureexec-ness of an exec
can be determined early enough to make decisions about rlimits and the
resulting memory layouts. Other logic acting on the secureexec-ness of an
exec is similarly consolidated. Capabilities needed some special handling,
but the refactoring removed other special handling, so that was a wash.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJZrwRKAAoJEIly9N/cbcAmhboP/iwLbYfWngIJdu3pYKrW+CEg
uUVY6RNnsumJ5yEhD/yQKXSPmZ8PkC8vexPYvf8TcPOlMRQuhVvdiR0FfSUvkMWy
pB8ZVCyAV1uSnW4BH61FCxHInrahy8jlvQwnAujvw+FNxhcQjyEGKupOLIMGLioQ
8G5Ihf+hOjiXRhKbXueQi89n8i4jEI5YTH1RnC+Gsy8jG11EC9BhPddKSMaUKZA3
HYYqUyV0daYpGuxTOxaRdDO5wb6rlS+B46hqtOsSsIBOQkCjnLCRcdeMCqvXjQmv
kyZj03cPlUjEHqh3d3nB6utvVWReGf/p986//kQjT1OZPhATbySAu7wUHoLik3dU
zuexudNTBROf6YXahMxSJp348GS++xoBFARa78402E++U7C4/eoclbLCWAylBwVA
H+QAHFYRC2WFoskejSYBRPz6HLr1SIaSYMsKbkHqP07zi6p3ic2Uq3XvOP2zL/5p
l/mXa1Fs2vcDOWPER8a8b9mVkJDvuXj6J11lG+q80UWAWC3sd9GkSwOen80ps3Xo
/7dd+h2BAJSSVxZQFxd5YCx99mT0ntQZ797PhjxOY6SX/xUdOCAp9x1zDU5OUovP
q2ty3UTd7tq8h1RnHOnrn9cKmMmI7kpBvEfPGM507cEVjyfsMu2jJtUxN9dXOAkB
aebEsg3C8M6z5OdGVpWH
=Yva4
-----END PGP SIGNATURE-----
Merge tag 'secureexec-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull secureexec update from Kees Cook:
"This series has the ultimate goal of providing a sane stack rlimit
when running set*id processes.
To do this, the bprm_secureexec LSM hook is collapsed into the
bprm_set_creds hook so the secureexec-ness of an exec can be
determined early enough to make decisions about rlimits and the
resulting memory layouts. Other logic acting on the secureexec-ness of
an exec is similarly consolidated. Capabilities needed some special
handling, but the refactoring removed other special handling, so that
was a wash"
* tag 'secureexec-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
exec: Consolidate pdeath_signal clearing
exec: Use sane stack rlimit under secureexec
exec: Consolidate dumpability logic
smack: Remove redundant pdeath_signal clearing
exec: Use secureexec for clearing pdeath_signal
exec: Use secureexec for setting dumpability
LSM: drop bprm_secureexec hook
commoncap: Move cap_elevated calculation into bprm_set_creds
commoncap: Refactor to remove bprm_secureexec hook
smack: Refactor to remove bprm_secureexec hook
selinux: Refactor to remove bprm_secureexec hook
apparmor: Refactor to remove bprm_secureexec hook
binfmt: Introduce secureexec flag
exec: Correct comments about "point of no return"
exec: Rename bprm->cred_prepared to called_set_creds
and defining more restrictive root directory DAC permissions default
(0750, which can be adjust after boot unlike the CAP_SYSLOG check).
Suggested by Nick Kralevich.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
Comment: Kees Cook <kees@outflux.net>
iQIcBAABCgAGBQJZrv+iAAoJEIly9N/cbcAmZXcP/jZ7dW3zQiZ2q6YQDokaABT4
AZxGdDrogLQ6wWmV+ApHIYEOTcVvbswvBLwKIE7l9XpG41tIKUe4h9iCVvpBSARP
SpyeawztJ8KNw00EFZWP/hOxCXHeausilea/1zh/+Rt5VhU2YIw/fhew821bjLmh
3exBjoLcWSHHCUY/e9ByMB0mB0SYUmnqhFub77Z6zZMhaRw9/gvPibS1DdmjGPPI
Rq0zejFAqXy50rmbKVTT2QQPq/gQnUyb/Q216ytbSUntaAwfISDrwN74slupjG3S
Vrca+BxThJYZ+rnbqjMDoROgKAYNqyIlvFVCO3H6DUqnPnGROIAeGELAcGyncUo+
6Mdpumhy25K0+YbJkNYxm1cyH0w47EWpIqBqPTh1IhuedDB5cpdamR88dShmMzNA
XhvMhe9eNxI5ZzOg8X8qCEc/hRZoZj5F4m2R+Wh55YRH3rDtuaIzONPvGyJfYYVS
tY8ut/r8+qMID9I4qLtIAmVX2rzR/6BG7H3ofApY0OGFRmCt0nicUdN56JJ+GNRf
7XfpEXDL+sG3fkUk8oQSfSEhLuOseTazLuxrQAWJIZ3FZ4JnRW/a/izlbsI2+nvy
FcC1+tG43ISwir5jZzNznYNrGM01TdFwQ5izKE3E1U+xsBRbR7OT8Y0005Z+GUwW
6feSKts8UKq4tFNt1WY9
=+gsj
-----END PGP SIGNATURE-----
Merge tag 'pstore-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux
Pull pstore update from Kees Cook:
"Make pstore permissions more versatile by removing CAP_SYSLOG
requirement and defining more restrictive root directory DAC
permissions default (0750, which can be adjust after boot unlike the
CAP_SYSLOG check).
Suggested by Nick Kralevich"
* tag 'pstore-v4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
Revert "pstore: Honor dmesg_restrict sysctl on dmesg dumps"
pstore: Make default pstorefs root dir perms 0750
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQGcBAABAgAGBQJZsb0OAAoJEIosvXAHck9RgOYL/0AAqRPEt1mvZTIm51LXxSYy
4Iyyz8AEHFp4ZW5jctblm49/4mh3rJ4O7Ig50de00y+XkoJIzACgf1lazaTTCcNQ
lspHnfqCgR1DvF+A7OxbfS4CkF43SXFnL3GcoiW5zpV8QndM3DWLVIj1AHpEtEOX
sa3mSdFVhdEP6ka5Q98vam4N6jRnvz1gapLQKHpRTuVAGYZAw44+8HJKxN5btTIP
F+9X4zCyNjDDsuKoxAkVZmo/k7cxbKQBRjq9fNLHPR/GPEy06I+j2AqTnd7Gc/o+
2hw0X4eF8N7HbEujnvEcqPLCqwNVtPNdlOelICNmNSo4OYhrhb3u93hEv7oFuVLH
/PlGrzRYi6TEf1aBiftfA9aNVNzCHtnqRIZiC9+LhCiI77H9ef4/Fl7i6xsoKcAi
XA0RyslWNbhlQBWMXeQN7k2lyZIx4+Kq7xnnW1FBPLIuLMCrZAS0/IAcakKf3Mud
zDzOKI7ZRWEIEaSsM/I/QFfagkMuKTSea6XovkogUw==
=+DYj
-----END PGP SIGNATURE-----
Merge tag '4.14-smb3-xattr-enable' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs update from Steve French:
"Enable xattr support for smb3 and also a bugfix"
* tag '4.14-smb3-xattr-enable' of git://git.samba.org/sfrench/cifs-2.6:
cifs: Check for timeout on Negotiate stage
cifs: Add support for writing attributes on SMB2+
cifs: Add support for reading attributes on SMB2+
Pull aio fix from Ben LaHaise:
"Improve aio-nr counting on large SMP systems.
It has been in linux-next for quite some time"
* git://git.kvack.org/~bcrl/aio-next:
fs: aio: fix the increment of aio-nr and counting against aio-max-nr
Pull quota scaling updates from Jan Kara:
"This contains changes to make the quota subsystem more scalable.
Reportedly it improves number of files created per second on ext4
filesystem on fast storage by about a factor of 2x"
* 'quota_scaling' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: (28 commits)
quota: Add lock annotations to struct members
quota: Reduce contention on dq_data_lock
fs: Provide __inode_get_bytes()
quota: Inline dquot_[re]claim_reserved_space() into callsite
quota: Inline inode_{incr,decr}_space() into callsites
quota: Inline functions into their callsites
ext4: Disable dirty list tracking of dquots when journalling quotas
quota: Allow disabling tracking of dirty dquots in a list
quota: Remove dq_wait_unused from dquot
quota: Move locking into clear_dquot_dirty()
quota: Do not dirty bad dquots
quota: Fix possible corruption of dqi_flags
quota: Propagate ->quota_read errors from v2_read_file_info()
quota: Fix error codes in v2_read_file_info()
quota: Push dqio_sem down to ->read_file_info()
quota: Push dqio_sem down to ->write_file_info()
quota: Push dqio_sem down to ->get_next_id()
quota: Push dqio_sem down to ->release_dqblk()
quota: Remove locking for writing to the old quota format
quota: Do not acquire dqio_sem for dquot overwrites in v2 format
...
Pull UDF, reiserfs, quota, fsnotify cleanups from Jan Kara:
"Several UDF, reiserfs, quota and fsnotify cleanups.
Note that there is also a patch updating MAINTAINERS entry for
notification subsystem to point to me as a maintainer since current
entries are stale"
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
fsnotify: make dnotify_fsnotify_ops const
isofs: Delete an unnecessary variable initialisation in isofs_read_inode()
isofs: Adjust four checks for null pointers
isofs: Delete an error message for a failed memory allocation in isofs_read_inode()
quota_v2: Delete an error message for a failed memory allocation in v2_read_file_info()
fs-udf: Delete an error message for a failed memory allocation in two functions
fs-udf: Improve six size determinations
fs-udf: Adjust two checks for null pointers
reiserfs: fix spelling mistake: "tranasction" -> "transaction"
MAINTAINERS: Update entries for notification subsystem
uapi/linux/quota.h: Do not include linux/errno.h
-----BEGIN PGP SIGNATURE-----
iQIcBAABAgAGBQJZsSBoAAoJEAhfPr2O5OEVDc4QAJZSuVYmyLgvtmPxhyqgCvkz
I0DmWM4ZtK2VT/xJ/AA23z8IiLKi2+pDC0Xx6/aIiA665cyl3oPUdkKIaHW9Z6+A
fV8gSFkmGkluQb9mP/KdHYI2oSeEv2ivCa1kfaApYcoBa904z8uU++z15Iu5p/+m
fjpc2vnc9rax0Vuwmgv7p1CL4j4e/ja0siCSCGbu2ad50KqP4ytnBooNPQOQt89D
L+Av5MeGml/CTUUnAFjWfSmQ72Ht8GhoBBKc6wGoq9x3GTckDDTqy8BAqGt4UQnu
fR0mb71zuSVmTjxRe7tc/74m3ReaeSHzQeHJhjdQslvNmV3RVQgk/6CCsmqNEegr
rbC3glQCM+gp5YywCjRL6DCPsoqvjexLtPQjMZIGYxgSYQUyXGOxilgmj9+73761
6aOl0nqdgN+vlWzaSeDF9EQxRsc+cCq/Po8/xuPE/Pzs6zTQwU+6b+ADLf9jCyDP
LTC49wOj24SoWiTlG1FTct2ogZ3h5wNPWlurBtmyiFJn+43RpsH5IW9wLilCjeiE
6JeCWEIBglCCq/TVCzETKNSaixDL6/lMQ9uRdCpIO4VLyoS6S9pZASNPBmQ1h7h/
oTjYDeWirIthNOccstbBoJQYSX62CqAIW3wq5ME6PAgM+ioiLXLYk0fV3yBKoBNW
Z0SBeTcuPxWmfzuxMtik
=fNM2
-----END PGP SIGNATURE-----
Merge tag 'media/v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media updates from Mauro Carvalho Chehab:
"Brazil's Independence Day pull request :-)
This is one of the biggest media pull requests, with 625 patches
affecting almost all parts of media (RC, DVB, V4L2, CEC, docs).
This contains:
- A lot of new drivers:
* DVB frontends: mxl5xx, stv0910, stv6111;
* camera flash: as3645a led driver;
* HDMI receiver: adv748X;
* camera sensor: Omnivision 6650 5M driver (ov6650);
* HDMI CEC: ao-cec meson driver;
* V4L2: Qualcom camss driver;
* Remote controller: gpio-ir-tx, pwm-ir-tx and zx-irdec drivers.
- The DDbridge DVB driver got a massive update, with makes it in sync
with modern hardware from that vendor;
- There's an important milestone on this series: the DVB
documentation was written in 2003, but only started to be updated
in 2007. It also used to contain several gaps from the time it was
kept out of tree, mentioning error codes and device nodes that
never existed upstream. On this series, it received a massive
update: all non-deprecated digital TV APIs are now in sync with the
current implementation;
- Some DVB APIs that aren't used by any upstream driver got removed;
- Other parts of the media documentation algo got updated, fixing
some bugs on its PDF output and making it compatible with Sphinx
version 1.6.
As the number of hacks required to build PDF output reduced, I hope
we'll have less troubles as newer versions of our documentation
toolchain are released (famous last words);
- As usual, lots of driver cleanups and improvements"
* tag 'media/v4.14-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (624 commits)
media: leds: as3645a: add V4L2_FLASH_LED_CLASS dependency
media: get rid of removed DMX_GET_CAPS and DMX_SET_SOURCE leftovers
media: Revert "[media] v4l: async: make v4l2 coexist with devicetree nodes in a dt overlay"
media: staging: atomisp: sh_css_calloc shall return a pointer to the allocated space
media: Revert "[media] lirc_dev: remove superfluous get/put_device() calls"
media: add qcom_camss.rst to v4l-drivers rst file
media: dvb headers: make checkpatch happier
media: dvb uapi: move frontend legacy API to another part of the book
media: pixfmt-srggb12p.rst: better format the table for PDF output
media: docs-rst: media: Don't use \small for V4L2_PIX_FMT_SRGGB10 documentation
media: index.rst: don't write "Contents:" on PDF output
media: pixfmt*.rst: replace a two dots by a comma
media: vidioc-g-fmt.rst: adjust table format
media: vivid.rst: add a blank line to correct ReST format
media: v4l2 uapi book: get rid of driver programming's chapter
media: format.rst: use the right markup for important notes
media: docs-rst: cardlists: change their format to flat-tables
media: em28xx-cardlist.rst: update to reflect last changes
media: v4l2-event.rst: adjust table to fit on PDF output
media: docs: don't show ToC for each part on PDF output
...
Pull block layer updates from Jens Axboe:
"This is the first pull request for 4.14, containing most of the code
changes. It's a quiet series this round, which I think we needed after
the churn of the last few series. This contains:
- Fix for a registration race in loop, from Anton Volkov.
- Overflow complaint fix from Arnd for DAC960.
- Series of drbd changes from the usual suspects.
- Conversion of the stec/skd driver to blk-mq. From Bart.
- A few BFQ improvements/fixes from Paolo.
- CFQ improvement from Ritesh, allowing idling for group idle.
- A few fixes found by Dan's smatch, courtesy of Dan.
- A warning fixup for a race between changing the IO scheduler and
device remova. From David Jeffery.
- A few nbd fixes from Josef.
- Support for cgroup info in blktrace, from Shaohua.
- Also from Shaohua, new features in the null_blk driver to allow it
to actually hold data, among other things.
- Various corner cases and error handling fixes from Weiping Zhang.
- Improvements to the IO stats tracking for blk-mq from me. Can
drastically improve performance for fast devices and/or big
machines.
- Series from Christoph removing bi_bdev as being needed for IO
submission, in preparation for nvme multipathing code.
- Series from Bart, including various cleanups and fixes for switch
fall through case complaints"
* 'for-4.14/block' of git://git.kernel.dk/linux-block: (162 commits)
kernfs: checking for IS_ERR() instead of NULL
drbd: remove BIOSET_NEED_RESCUER flag from drbd_{md_,}io_bio_set
drbd: Fix allyesconfig build, fix recent commit
drbd: switch from kmalloc() to kmalloc_array()
drbd: abort drbd_start_resync if there is no connection
drbd: move global variables to drbd namespace and make some static
drbd: rename "usermode_helper" to "drbd_usermode_helper"
drbd: fix race between handshake and admin disconnect/down
drbd: fix potential deadlock when trying to detach during handshake
drbd: A single dot should be put into a sequence.
drbd: fix rmmod cleanup, remove _all_ debugfs entries
drbd: Use setup_timer() instead of init_timer() to simplify the code.
drbd: fix potential get_ldev/put_ldev refcount imbalance during attach
drbd: new disk-option disable-write-same
drbd: Fix resource role for newly created resources in events2
drbd: mark symbols static where possible
drbd: Send P_NEG_ACK upon write error in protocol != C
drbd: add explicit plugging when submitting batches
drbd: change list_for_each_safe to while(list_first_entry_or_null)
drbd: introduce drbd_recv_header_maybe_unplug
...
Currently, aio-nr is incremented in steps of 'num_possible_cpus() * 8'
for io_setup(nr_events, ..) with 'nr_events < num_possible_cpus() * 4':
ioctx_alloc()
...
nr_events = max(nr_events, num_possible_cpus() * 4);
nr_events *= 2;
...
ctx->max_reqs = nr_events;
...
aio_nr += ctx->max_reqs;
....
This limits the number of aio contexts actually available to much less
than aio-max-nr, and is increasingly worse with greater number of CPUs.
For example, with 64 CPUs, only 256 aio contexts are actually available
(with aio-max-nr = 65536) because the increment is 512 in that scenario.
Note: 65536 [max aio contexts] / (64*4*2) [increment per aio context]
is 128, but make it 256 (double) as counting against 'aio-max-nr * 2':
ioctx_alloc()
...
if (aio_nr + nr_events > (aio_max_nr * 2UL) ||
...
goto err_ctx;
...
This patch uses the original value of nr_events (from userspace) to
increment aio-nr and count against aio-max-nr, which resolves those.
Signed-off-by: Mauricio Faria de Oliveira <mauricfo@linux.vnet.ibm.com>
Reported-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com>
Tested-by: Lekshmi C. Pillai <lekshmi.cpillai@in.ibm.com>
Tested-by: Paul Nguyen <nguyenp@us.ibm.com>
Reviewed-by: Jeff Moyer <jmoyer@redhat.com>
Signed-off-by: Benjamin LaHaise <bcrl@kvack.org>
Merge updates from Andrew Morton:
- various misc bits
- DAX updates
- OCFS2
- most of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (119 commits)
mm,fork: introduce MADV_WIPEONFORK
x86,mpx: make mpx depend on x86-64 to free up VMA flag
mm: add /proc/pid/smaps_rollup
mm: hugetlb: clear target sub-page last when clearing huge page
mm: oom: let oom_reap_task and exit_mmap run concurrently
swap: choose swap device according to numa node
mm: replace TIF_MEMDIE checks by tsk_is_oom_victim
mm, oom: do not rely on TIF_MEMDIE for memory reserves access
z3fold: use per-cpu unbuddied lists
mm, swap: don't use VMA based swap readahead if HDD is used as swap
mm, swap: add sysfs interface for VMA based swap readahead
mm, swap: VMA based swap readahead
mm, swap: fix swap readahead marking
mm, swap: add swap readahead hit statistics
mm/vmalloc.c: don't reinvent the wheel but use existing llist API
mm/vmstat.c: fix wrong comment
selftests/memfd: add memfd_create hugetlbfs selftest
mm/shmem: add hugetlbfs support to memfd_create()
mm, devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups
mm/vmalloc.c: halve the number of comparisons performed in pcpu_get_vm_areas()
...
Introduce MADV_WIPEONFORK semantics, which result in a VMA being empty
in the child process after fork. This differs from MADV_DONTFORK in one
important way.
If a child process accesses memory that was MADV_WIPEONFORK, it will get
zeroes. The address ranges are still valid, they are just empty.
If a child process accesses memory that was MADV_DONTFORK, it will get a
segmentation fault, since those address ranges are no longer valid in
the child after fork.
Since MADV_DONTFORK also seems to be used to allow very large programs
to fork in systems with strict memory overcommit restrictions, changing
the semantics of MADV_DONTFORK might break existing programs.
MADV_WIPEONFORK only works on private, anonymous VMAs.
The use case is libraries that store or cache information, and want to
know that they need to regenerate it in the child process after fork.
Examples of this would be:
- systemd/pulseaudio API checks (fail after fork) (replacing a getpid
check, which is too slow without a PID cache)
- PKCS#11 API reinitialization check (mandated by specification)
- glibc's upcoming PRNG (reseed after fork)
- OpenSSL PRNG (reseed after fork)
The security benefits of a forking server having a re-inialized PRNG in
every child process are pretty obvious. However, due to libraries
having all kinds of internal state, and programs getting compiled with
many different versions of each library, it is unreasonable to expect
calling programs to re-initialize everything manually after fork.
A further complication is the proliferation of clone flags, programs
bypassing glibc's functions to call clone directly, and programs calling
unshare, causing the glibc pthread_atfork hook to not get called.
It would be better to have the kernel take care of this automatically.
The patch also adds MADV_KEEPONFORK, to undo the effects of a prior
MADV_WIPEONFORK.
This is similar to the OpenBSD minherit syscall with MAP_INHERIT_ZERO:
https://man.openbsd.org/minherit.2
[akpm@linux-foundation.org: numerically order arch/parisc/include/uapi/asm/mman.h #defines]
Link: http://lkml.kernel.org/r/20170811212829.29186-3-riel@redhat.com
Signed-off-by: Rik van Riel <riel@redhat.com>
Reported-by: Florian Weimer <fweimer@redhat.com>
Reported-by: Colm MacCártaigh <colm@allcosts.net>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Andy Lutomirski <luto@amacapital.net>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Helge Deller <deller@gmx.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Drewry <wad@chromium.org>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
/proc/pid/smaps_rollup is a new proc file that improves the performance
of user programs that determine aggregate memory statistics (e.g., total
PSS) of a process.
Android regularly "samples" the memory usage of various processes in
order to balance its memory pool sizes. This sampling process involves
opening /proc/pid/smaps and summing certain fields. For very large
processes, sampling memory use this way can take several hundred
milliseconds, due mostly to the overhead of the seq_printf calls in
task_mmu.c.
smaps_rollup improves the situation. It contains most of the fields of
/proc/pid/smaps, but instead of a set of fields for each VMA,
smaps_rollup instead contains one synthetic smaps-format entry
representing the whole process. In the single smaps_rollup synthetic
entry, each field is the summation of the corresponding field in all of
the real-smaps VMAs. Using a common format for smaps_rollup and smaps
allows userspace parsers to repurpose parsers meant for use with
non-rollup smaps for smaps_rollup, and it allows userspace to switch
between smaps_rollup and smaps at runtime (say, based on the
availability of smaps_rollup in a given kernel) with minimal fuss.
By using smaps_rollup instead of smaps, a caller can avoid the
significant overhead of formatting, reading, and parsing each of a large
process's potentially very numerous memory mappings. For sampling
system_server's PSS in Android, we measured a 12x speedup, representing
a savings of several hundred milliseconds.
One alternative to a new per-process proc file would have been including
PSS information in /proc/pid/status. We considered this option but
thought that PSS would be too expensive (by a few orders of magnitude)
to collect relative to what's already emitted as part of
/proc/pid/status, and slowing every user of /proc/pid/status for the
sake of readers that happen to want PSS feels wrong.
The code itself works by reusing the existing VMA-walking framework we
use for regular smaps generation and keeping the mem_size_stats
structure around between VMA walks instead of using a fresh one for each
VMA. In this way, summation happens automatically. We let seq_file
walk over the VMAs just as it does for regular smaps and just emit
nothing to the seq_file until we hit the last VMA.
Benchmarks:
using smaps:
iterations:1000 pid:1163 pss:220023808
0m29.46s real 0m08.28s user 0m20.98s system
using smaps_rollup:
iterations:1000 pid:1163 pss:220702720
0m04.39s real 0m00.03s user 0m04.31s system
We're using the PSS samples we collect asynchronously for
system-management tasks like fine-tuning oom_adj_score, memory use
tracking for debugging, application-level memory-use attribution, and
deciding whether we want to kill large processes during system idle
maintenance windows. Android has been using PSS for these purposes for
a long time; as the average process VMA count has increased and and
devices become more efficiency-conscious, PSS-collection inefficiency
has started to matter more. IMHO, it'd be a lot safer to optimize the
existing PSS-collection model, which has been fine-tuned over the years,
instead of changing the memory tracking approach entirely to work around
smaps-generation inefficiency.
Tim said:
: There are two main reasons why Android gathers PSS information:
:
: 1. Android devices can show the user the amount of memory used per
: application via the settings app. This is a less important use case.
:
: 2. We log PSS to help identify leaks in applications. We have found
: an enormous number of bugs (in the Android platform, in Google's own
: apps, and in third-party applications) using this data.
:
: To do this, system_server (the main process in Android userspace) will
: sample the PSS of a process three seconds after it changes state (for
: example, app is launched and becomes the foreground application) and about
: every ten minutes after that. The net result is that PSS collection is
: regularly running on at least one process in the system (usually a few
: times a minute while the screen is on, less when screen is off due to
: suspend). PSS of a process is an incredibly useful stat to track, and we
: aren't going to get rid of it. We've looked at some very hacky approaches
: using RSS ("take the RSS of the target process, subtract the RSS of the
: zygote process that is the parent of all Android apps") to reduce the
: accounting time, but it regularly overestimated the memory used by 20+
: percent. Accordingly, I don't think that there's a good alternative to
: using PSS.
:
: We started looking into PSS collection performance after we noticed random
: frequency spikes while a phone's screen was off; occasionally, one of the
: CPU clusters would ramp to a high frequency because there was 200-300ms of
: constant CPU work from a single thread in the main Android userspace
: process. The work causing the spike (which is reasonable governor
: behavior given the amount of CPU time needed) was always PSS collection.
: As a result, Android is burning more power than we should be on PSS
: collection.
:
: The other issue (and why I'm less sure about improving smaps as a
: long-term solution) is that the number of VMAs per process has increased
: significantly from release to release. After trying to figure out why we
: were seeing these 200-300ms PSS collection times on Android O but had not
: noticed it in previous versions, we found that the number of VMAs in the
: main system process increased by 50% from Android N to Android O (from
: ~1800 to ~2700) and varying increases in every userspace process. Android
: M to N also had an increase in the number of VMAs, although not as much.
: I'm not sure why this is increasing so much over time, but thinking about
: ASLR and ways to make ASLR better, I expect that this will continue to
: increase going forward. I would not be surprised if we hit 5000 VMAs on
: the main Android process (system_server) by 2020.
:
: If we assume that the number of VMAs is going to increase over time, then
: doing anything we can do to reduce the overhead of each VMA during PSS
: collection seems like the right way to go, and that means outputting an
: aggregate statistic (to avoid whatever overhead there is per line in
: writing smaps and in reading each line from userspace).
Link: http://lkml.kernel.org/r/20170812022148.178293-1-dancol@google.com
Signed-off-by: Daniel Colascione <dancol@google.com>
Cc: Tim Murray <timmurray@google.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Sonny Rao <sonnyrao@chromium.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
No ABI change, but this will make it more explicit to software that ptid
is only available if requested by passing UFFD_FEATURE_THREAD_ID to
UFFDIO_API. The fact it's a union will also self document it shouldn't
be taken for granted there's a tpid there.
Link: http://lkml.kernel.org/r/20170802165145.22628-7-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Alexey Perevalov <a.perevalov@samsung.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It could be useful for calculating downtime during postcopy live
migration per vCPU. Side observer or application itself will be
informed about proper task's sleep during userfaultfd processing.
Process's thread id is being provided when user requeste it by setting
UFFD_FEATURE_THREAD_ID bit into uffdio_api.features.
Link: http://lkml.kernel.org/r/20170802165145.22628-6-aarcange@redhat.com
Signed-off-by: Alexey Perevalov <a.perevalov@samsung.com>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: Maxime Coquelin <maxime.coquelin@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In some cases, userfaultfd mechanism should just deliver a SIGBUS signal
to the faulting process, instead of the page-fault event. Dealing with
page-fault event using a monitor thread can be an overhead in these
cases. For example applications like the database could use the
signaling mechanism for robustness purpose.
Database uses hugetlbfs for performance reason. Files on hugetlbfs
filesystem are created and huge pages allocated using fallocate() API.
Pages are deallocated/freed using fallocate() hole punching support.
These files are mmapped and accessed by many processes as shared memory.
The database keeps track of which offsets in the hugetlbfs file have
pages allocated.
Any access to mapped address over holes in the file, which can occur due
to bugs in the application, is considered invalid and expect the process
to simply receive a SIGBUS. However, currently when a hole in the file
is accessed via the mapped address, kernel/mm attempts to automatically
allocate a page at page fault time, resulting in implicitly filling the
hole in the file. This may not be the desired behavior for applications
like the database that want to explicitly manage page allocations of
hugetlbfs files.
Using userfaultfd mechanism with this support to get a signal, database
application can prevent pages from being allocated implicitly when
processes access mapped address over holes in the file.
This patch adds UFFD_FEATURE_SIGBUS feature to userfaultfd mechnism to
request for a SIGBUS signal.
See following for previous discussion about the database requirement
leading to this proposal as suggested by Andrea.
http://www.spinics.net/lists/linux-mm/msg129224.html
Link: http://lkml.kernel.org/r/1501552446-748335-2-git-send-email-prakash.sangappa@oracle.com
Signed-off-by: Prakash Sangappa <prakash.sangappa@oracle.com>
Reviewed-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Shuah Khan <shuah@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
global_page_state is error prone as a recent bug report pointed out [1].
It only returns proper values for zone based counters as the enum it
gets suggests. We already have global_node_page_state so let's rename
global_page_state to global_zone_page_state to be more explicit here.
All existing users seems to be correct:
$ git grep "global_page_state(NR_" | sed 's@.*(\(NR_[A-Z_]*\)).*@\1@' | sort | uniq -c
2 NR_BOUNCE
2 NR_FREE_CMA_PAGES
11 NR_FREE_PAGES
1 NR_KERNEL_STACK_KB
1 NR_MLOCK
2 NR_PAGETABLE
This patch shouldn't introduce any functional change.
[1] http://lkml.kernel.org/r/201707260628.v6Q6SmaS030814@www262.sakura.ne.jp
Link: http://lkml.kernel.org/r/20170801134256.5400-2-hannes@cmpxchg.org
Signed-off-by: Michal Hocko <mhocko@suse.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fsync codepath assumes that f_mapping can never be NULL, but
sync_file_range has a check for that.
Remove the one from sync_file_range as I don't see how you'd ever get a
NULL pointer in here.
Link: http://lkml.kernel.org/r/20170525110509.9434-1-jlayton@redhat.com
Signed-off-by: Jeff Layton <jlayton@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now when shmem VMAs can be filled with zero page via userfaultfd we can
report that UFFDIO_ZEROPAGE is available for those VMAs
Link: http://lkml.kernel.org/r/1497939652-16528-7-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pavel Emelyanov <xemul@virtuozzo.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All users of pagevec_lookup() and pagevec_lookup_range() now pass
PAGEVEC_SIZE as a desired number of pages.
Just drop the argument.
Link: http://lkml.kernel.org/r/20170726114704.7626-11-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want only pages from given range in page_cache_seek_hole_data(). Use
pagevec_lookup_range() instead of pagevec_lookup() and remove
unnecessary code.
Note that the check for getting less pages than desired can be removed
because index gets updated by pagevec_lookup_range().
Link: http://lkml.kernel.org/r/20170726114704.7626-9-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We want only pages from given range in remove_inode_hugepages(). Use
pagevec_lookup_range() instead of pagevec_lookup().
Link: http://lkml.kernel.org/r/20170726114704.7626-8-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Nadia Yvette Chambers <nyc@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Both occurences of pagevec_lookup() actually want only pages from a
given range. Use pagevec_lookup_range() for the lookup.
Link: http://lkml.kernel.org/r/20170726114704.7626-7-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Use pagevec_lookup_range() in ext4_find_unwritten_pgoff() since we are
interested only in pages in the given range. Simplify the logic as a
result of not getting pages out of range and index getting automatically
advanced.
Link: http://lkml.kernel.org/r/20170726114704.7626-6-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit e64855c6cf ("fs: Add helper to clean bdev aliases under a bh
and use it") added a wrapper for clean_bdev_aliases() that invalidates
bdev aliases underlying a single buffer head.
However this has caused a performance regression for bonnie++ benchmark
on ext4 filesystem when delayed allocation is turned off (ext3 mode) -
average of 3 runs:
Hmean SeqOut Char 164787.55 ( 0.00%) 107189.06 (-34.95%)
Hmean SeqOut Block 219883.89 ( 0.00%) 168870.32 (-23.20%)
The reason for this regression is that clean_bdev_aliases() is slower
when called for a single block because pagevec_lookup() it uses will end
up iterating through the radix tree until it finds a page (which may
take a while) but we are only interested whether there's a page at a
particular index.
Fix the problem by using pagevec_lookup_range() instead which avoids the
needless iteration.
Fixes: e64855c6cf ("fs: Add helper to clean bdev aliases under a bh and use it")
Link: http://lkml.kernel.org/r/20170726114704.7626-5-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Jens Axboe <axboe@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make pagevec_lookup() (and underlying find_get_pages()) update index to
the next page where iteration should continue. Most callers want this
and also pagevec_lookup_tag() already does this.
Link: http://lkml.kernel.org/r/20170726114704.7626-3-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "Ranged pagevec lookup", v2.
In this series I make pagevec_lookup() update the index (to be
consistent with pagevec_lookup_tag() and also as a preparation for
ranged lookups), provide ranged variant of pagevec_lookup() and use it
in places where it makes sense. This not only removes some common code
but is also a measurable performance win for some use cases (see patch
4/10) where radix tree is sparse and searching & grabing of a page after
the end of the range has measurable overhead.
This patch (of 10):
The callback doesn't ever get called. Remove it.
Link: http://lkml.kernel.org/r/20170726114704.7626-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
clean up some unused functions and parameters.
Link: http://lkml.kernel.org/r/598A5E21.2080807@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Alex Chen <alex.chen@huawei.com>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The function is never called outside of fs/ocfs2/acl.c.
Link: http://lkml.kernel.org/r/20170801141252.19675-2-jack@suse.cz
Signed-off-by: Jan Kara <jack@suse.cz>
Cc: Mark Fasheh <mfasheh@versity.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Joseph Qi <jiangqi903@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>