linux-hardened/mm
Johannes Weiner 1276ad68e2 mm: vmscan: scan dirty pages even in laptop mode
Patch series "mm: vmscan: fix kswapd writeback regression".

We noticed a regression on multiple hadoop workloads when moving from
3.10 to 4.0 and 4.6, which involves kswapd getting tangled up in page
writeout, causing direct reclaim herds that also don't make progress.

I tracked it down to the thrash avoidance efforts after 3.10 that make
the kernel better at keeping use-once cache and use-many cache sorted on
the inactive and active list, with more aggressive protection of the
active list as long as there is inactive cache.  Unfortunately, our
workload's use-once cache is mostly from streaming writes.  Waiting for
writes to avoid potential reloads in the future is not a good tradeoff.

These patches do the following:

1. Wake the flushers when kswapd sees a lump of dirty pages. It's
   possible to be below the dirty background limit and still have cache
   velocity push them through the LRU. So start a-flushin'.

2. Let kswapd only write pages that have been rotated twice. This makes
   sure we really tried to get all the clean pages on the inactive list
   before resorting to horrible LRU-order writeback.

3. Move rotating dirty pages off the inactive list. Instead of churning
   or waiting on page writeback, we'll go after clean active cache. This
   might lead to thrashing, but in this state memory demand outstrips IO
   speed anyway, and reads are faster than writes.

Mel backported the series to 4.10-rc5 with one minor conflict and ran a
couple of tests on it.  Mix of read/write random workload didn't show
anything interesting.  Write-only database didn't show much difference
in performance but there were slight reductions in IO -- probably in the
noise.

simoop did show big differences although not as big as Mel expected.
This is Chris Mason's workload that similate the VM activity of hadoop.
Mel won't go through the full details but over the samples measured
during an hour it reported

                                         4.10.0-rc5            4.10.0-rc5
                                            vanilla         johannes-v1r1
Amean    p50-Read             21346531.56 (  0.00%) 21697513.24 ( -1.64%)
Amean    p95-Read             24700518.40 (  0.00%) 25743268.98 ( -4.22%)
Amean    p99-Read             27959842.13 (  0.00%) 28963271.11 ( -3.59%)
Amean    p50-Write                1138.04 (  0.00%)      989.82 ( 13.02%)
Amean    p95-Write             1106643.48 (  0.00%)    12104.00 ( 98.91%)
Amean    p99-Write             1569213.22 (  0.00%)    36343.38 ( 97.68%)
Amean    p50-Allocation          85159.82 (  0.00%)    79120.70 (  7.09%)
Amean    p95-Allocation         204222.58 (  0.00%)   129018.43 ( 36.82%)
Amean    p99-Allocation         278070.04 (  0.00%)   183354.43 ( 34.06%)
Amean    final-p50-Read       21266432.00 (  0.00%) 21921792.00 ( -3.08%)
Amean    final-p95-Read       24870912.00 (  0.00%) 26116096.00 ( -5.01%)
Amean    final-p99-Read       28147712.00 (  0.00%) 29523968.00 ( -4.89%)
Amean    final-p50-Write          1130.00 (  0.00%)      977.00 ( 13.54%)
Amean    final-p95-Write       1033216.00 (  0.00%)     2980.00 ( 99.71%)
Amean    final-p99-Write       1517568.00 (  0.00%)    32672.00 ( 97.85%)
Amean    final-p50-Allocation    86656.00 (  0.00%)    78464.00 (  9.45%)
Amean    final-p95-Allocation   211712.00 (  0.00%)   116608.00 ( 44.92%)
Amean    final-p99-Allocation   287232.00 (  0.00%)   168704.00 ( 41.27%)

The latencies are actually completely horrific in comparison to 4.4 (and
4.10-rc5 is worse than 4.9 according to historical data for reasons Mel
hasn't analysed yet).

Still, 95% of write latency (p95-write) is halved by the series and
allocation latency is way down.  Direct reclaim activity is one fifth of
what it was according to vmstats.  Kswapd activity is higher but this is
not necessarily surprising.  Kswapd efficiency is unchanged at 99% (99%
of pages scanned were reclaimed) but direct reclaim efficiency went from
77% to 99%

In the vanilla kernel, 627MB of data was written back from reclaim
context.  With the series, no data was written back.  With or without
the patch, pages are being immediately reclaimed after writeback
completes.  However, with the patch, only 1/8th of the pages are
reclaimed like this.

This patch (of 5):

We have an elaborate dirty/writeback throttling mechanism inside the
reclaim scanner, but for that to work the pages have to go through
shrink_page_list() and get counted for what they are.  Otherwise, we
mess up the LRU order and don't match reclaim speed to writeback.

Especially during deactivation, there is never a reason to skip dirty
pages; nothing is even trying to write them out from there.  Don't mess
up the LRU order for nothing, shuffle these pages along.

Link: http://lkml.kernel.org/r/20170123181641.23938-2-hannes@cmpxchg.org
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Minchan Kim <minchan@kernel.org>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Acked-by: Hillf Danton <hillf.zj@alibaba-inc.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:54 -08:00
..
kasan arm64 updates for 4.11: 2017-02-22 10:46:44 -08:00
backing-dev.c mm/backing-dev.c: use rb_entry() 2017-02-22 16:41:30 -08:00
balloon_compaction.c mm: balloon: use general non-lru movable page feature 2016-07-26 16:19:19 -07:00
bootmem.c mm/bootmem.c: cosmetic improvement of code readability 2017-02-22 16:41:29 -08:00
cleancache.c cleancache: constify cleancache_ops structure 2016-01-27 09:09:57 -05:00
cma.c mm/cma: Cleanup highmem check 2017-01-11 13:56:49 +00:00
cma.h mm: cma: mark cma_bitmap_maxno() inline in header 2015-08-14 15:56:32 -07:00
cma_debug.c
compaction.c mm,compaction: serialize waitqueue_active() checks 2017-02-22 16:41:29 -08:00
debug.c mm, debug: print raw struct page data in __dump_page() 2016-12-12 18:55:08 -08:00
debug_page_ref.c mm/page_ref: add tracepoint to track down page reference manipulation 2016-03-17 15:09:34 -07:00
dmapool.c mm: convert printk(KERN_<LEVEL> to pr_<level> 2016-03-17 15:09:34 -07:00
early_ioremap.c mm/early_ioremap: use offset_in_page macro 2015-11-05 19:34:48 -08:00
fadvise.c mm: fadvise: avoid expensive remote LRU cache draining after FADV_DONTNEED 2016-12-20 09:48:46 -08:00
failslab.c mm: fault-inject take over bootstrap kmem_cache check 2016-03-15 16:55:16 -07:00
filemap.c mm: fix filemap.c kernel-doc warnings 2017-02-22 16:41:29 -08:00
frame_vector.c mm: replace get_vaddr_frames() write/force parameters with gup_flags 2016-10-19 08:11:24 -07:00
frontswap.c mm, frontswap: convert frontswap_enabled to static key 2016-07-26 16:19:19 -07:00
gup.c userfaultfd: hugetlbfs: gup: support VM_FAULT_RETRY 2017-02-22 16:41:28 -08:00
highmem.c mm/highmem: make nr_free_highpages() handles all highmem zones by itself 2016-05-19 19:12:14 -07:00
huge_memory.c mm, thp: add new defer+madvise defrag option 2017-02-22 16:41:30 -08:00
hugetlb.c userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings 2017-02-22 16:41:28 -08:00
hugetlb_cgroup.c mm, hugetlb_cgroup: round limit_in_bytes down to hugepage size 2016-05-20 17:58:30 -07:00
hwpoison-inject.c hwpoison: use page_cgroup_ino for filtering by memcg 2015-09-10 13:29:01 -07:00
init-mm.c mm: Add a user_ns owner to mm_struct and fix ptrace permission checks 2016-11-22 11:49:48 -06:00
internal.h oom-reaper: use madvise_dontneed() logic to decide if unmap the VMA 2017-02-22 16:41:30 -08:00
interval_tree.c
Kconfig mm: THP page cache support for ppc64 2016-12-12 18:55:08 -08:00
Kconfig.debug PM / Hibernate: allow hibernation with PAGE_POISONING_ZERO 2016-09-13 02:35:27 +02:00
khugepaged.c mm: get rid of __GFP_OTHER_NODE 2017-01-10 18:31:55 -08:00
kmemcheck.c mm: convert printk(KERN_<LEVEL> to pr_<level> 2016-03-17 15:09:34 -07:00
kmemleak-test.c mm: convert printk(KERN_<LEVEL> to pr_<level> 2016-03-17 15:09:34 -07:00
kmemleak.c kmemleak: fix reference to Documentation 2016-12-12 18:55:07 -08:00
ksm.c mm/ksm: improve deduplication of zero pages with colouring 2017-02-24 17:46:53 -08:00
list_lru.c mm/list_lru.c: avoid error-path NULL pointer deref 2016-10-27 18:43:42 -07:00
maccess.c x86: remove more uaccess_32.h complexity 2016-05-22 17:21:27 -07:00
madvise.c userfaultfd: non-cooperative: add madvise() event for MADV_REMOVE request 2017-02-24 17:46:54 -08:00
Makefile mm/swap: add cache for swap slots allocation 2017-02-22 16:41:30 -08:00
memblock.c memblock: embed memblock type name within struct memblock_type 2017-02-24 17:46:54 -08:00
memcontrol.c slab: use memcg_kmem_cache_wq for slab destruction operations 2017-02-22 16:41:27 -08:00
memory-failure.c mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked 2016-12-25 11:54:48 -08:00
memory.c mm: drop unused argument of zap_page_range() 2017-02-22 16:41:30 -08:00
memory_hotplug.c mm/memory_hotplug.c: unexport __remove_pages() 2017-02-24 17:46:53 -08:00
mempolicy.c mm/mempolicy.c: do not put mempolicy before using its nodemask 2017-01-24 16:26:14 -08:00
mempool.c Revert "mm, mempool: only set __GFP_NOMEMALLOC if there are free elements" 2016-07-28 16:07:41 -07:00
memtest.c memtest: remove unused header files 2015-09-08 15:35:28 -07:00
migrate.c mm: Use owner_priv bit for PageSwapCache, valid when PageSwapBacked 2016-12-25 11:54:48 -08:00
mincore.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
mlock.c thp: fix corner case of munlock() of PTE-mapped THPs 2016-11-30 16:32:52 -08:00
mm_init.c mm: convert printk(KERN_<LEVEL> to pr_<level> 2016-03-17 15:09:34 -07:00
mmap.c powerpc: do not make the entire heap executable 2017-02-22 16:41:29 -08:00
mmu_context.c mm/mmu_context, sched/core: Fix mmu_context.h assumption 2016-04-28 11:44:19 +02:00
mmu_notifier.c fix Christoph's email addresses 2016-03-17 15:09:34 -07:00
mmzone.c mm/mmzone.c: swap likely to unlikely as code logic is different for next_zones_zonelist() 2017-02-22 16:41:29 -08:00
mprotect.c mm: mprotect: use pmd_trans_unstable instead of taking the pmd_lock 2017-02-22 16:41:29 -08:00
mremap.c userfaultfd: non-cooperative: optimize mremap_userfaultfd_complete() 2017-02-22 16:41:28 -08:00
msync.c mm/msync: use offset_in_page macro 2015-11-05 19:34:48 -08:00
nobootmem.c mm: kmemleak: avoid using __va() on addresses that don't have a lowmem mapping 2016-10-11 15:06:33 -07:00
nommu.c lib/show_mem.c: teach show_mem to work with the given nodemask 2017-02-22 16:41:30 -08:00
oom_kill.c mm, oom: header nodemask is NULL when cpusets are disabled 2017-02-24 17:46:53 -08:00
page-writeback.c block: Use pointer to backing_dev_info from request_queue 2017-02-02 08:20:48 -07:00
page_alloc.c mm, page_alloc: warn_alloc nodemask is NULL when cpusets are disabled 2017-02-22 16:41:30 -08:00
page_counter.c mm: page_counter: let page_counter_try_charge() return bool 2015-11-05 19:34:48 -08:00
page_ext.c mm/page_ext: support extra space allocation by page_ext user 2016-10-07 18:46:27 -07:00
page_idle.c mm, vmscan: move lru_lock to the node 2016-07-28 16:07:41 -07:00
page_io.c writeback: add wbc_to_write_flags() 2016-11-02 10:24:03 -06:00
page_isolation.c mm, page_alloc: avoid page_to_pfn() when merging buddies 2017-02-22 16:41:27 -08:00
page_owner.c mm/page_owner: don't define fields on struct page_ext by hard-coding 2016-10-07 18:46:27 -07:00
page_poison.c mm: check the return value of lookup_page_ext for all call sites 2016-06-03 15:06:22 -07:00
pagewalk.c thp: rename split_huge_page_pmd() to split_huge_pmd() 2016-01-15 17:56:32 -08:00
percpu-km.c mm: percpu: use pr_fmt to prefix output 2016-03-17 15:09:34 -07:00
percpu-vm.c
percpu.c Merge branch 'for-4.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu 2016-12-13 12:34:47 -08:00
pgtable-generic.c mm/thp/migration: switch from flush_tlb_range to flush_pmd_tlb_range 2016-03-17 15:09:34 -07:00
process_vm_access.c mm: unexport __get_user_pages_unlocked() 2016-12-14 16:04:09 -08:00
quicklist.c fix Christoph's email addresses 2016-03-17 15:09:34 -07:00
readahead.c mm: don't cap request size based on read-ahead setting 2016-12-12 18:55:08 -08:00
rmap.c mm, rmap: handle anon_vma_prepare() common case inline 2016-12-12 18:55:08 -08:00
shmem.c userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY 2017-02-22 16:41:29 -08:00
slab.c slab: introduce __kmemcg_cache_deactivate() 2017-02-22 16:41:27 -08:00
slab.h slab: remove synchronous synchronize_sched() from memcg cache deactivation path 2017-02-22 16:41:27 -08:00
slab_common.c slab: use memcg_kmem_cache_wq for slab destruction operations 2017-02-22 16:41:27 -08:00
slob.c slab: introduce __kmemcg_cache_deactivate() 2017-02-22 16:41:27 -08:00
slub.c slub: make sysfs directories for memcg sub-caches optional 2017-02-22 16:41:27 -08:00
sparse-vmemmap.c treewide: replace obsolete _refok by __ref 2016-08-02 17:31:41 -04:00
sparse.c mm/memory_hotplug: set magic number to page->freelist instead of page->lru.next 2017-02-22 16:41:29 -08:00
swap.c mm/swap: split swap cache into 64MB trunks 2017-02-22 16:41:30 -08:00
swap_cgroup.c mm: convert printk(KERN_<LEVEL> to pr_<level> 2016-03-17 15:09:34 -07:00
swap_slots.c mm/swap: skip readahead only when swap slot cache is enabled 2017-02-22 16:41:30 -08:00
swap_state.c mm/swap: skip readahead only when swap slot cache is enabled 2017-02-22 16:41:30 -08:00
swapfile.c mm/swap: enable swap slots cache usage 2017-02-22 16:41:30 -08:00
truncate.c mm: Invalidate DAX radix tree entries only if appropriate 2016-12-26 20:29:24 -08:00
usercopy.c mm/usercopy: Switch to using lm_alias 2017-01-11 13:56:50 +00:00
userfaultfd.c userfaultfd: hugetlbfs: add UFFDIO_COPY support for shared mappings 2017-02-22 16:41:28 -08:00
util.c Replace <asm/uaccess.h> with <linux/uaccess.h> globally 2016-12-24 11:46:01 -08:00
vmacache.c mm: unrig VMA cache hit ratio 2016-10-07 18:46:27 -07:00
vmalloc.c mm, page_alloc: warn_alloc print nodemask 2017-02-22 16:41:30 -08:00
vmpressure.c mm/vmpressure.c: fix subtree pressure detection 2016-02-03 08:28:43 -08:00
vmscan.c mm: vmscan: scan dirty pages even in laptop mode 2017-02-24 17:46:54 -08:00
vmstat.c mm, compaction: add vmstats for kcompactd work 2017-02-22 16:41:29 -08:00
workingset.c mm, vmscan: cleanup lru size claculations 2017-02-22 16:41:30 -08:00
z3fold.c mm/z3fold.c: limit first_num to the actual range of possible buddy indexes 2017-02-22 16:41:31 -08:00
zbud.c mm/zbud.c: use list_last_entry() instead of list_tail_entry() 2016-01-15 11:40:52 -08:00
zpool.c mm: zsmalloc: constify struct zs_pool name 2015-11-06 17:50:42 -08:00
zsmalloc.c mm: fix some typos in mm/zsmalloc.c 2017-02-22 16:41:29 -08:00
zswap.c zswap: disable changing params if init fails 2017-02-03 14:13:19 -08:00