2005-04-17 00:20:36 +02:00
|
|
|
#ifndef _LINUX_RMAP_H
|
|
|
|
#define _LINUX_RMAP_H
|
|
|
|
/*
|
|
|
|
* Declarations for Reverse Mapping functions in mm/rmap.c
|
|
|
|
*/
|
|
|
|
|
|
|
|
#include <linux/list.h>
|
|
|
|
#include <linux/slab.h>
|
|
|
|
#include <linux/mm.h>
|
2011-05-25 02:12:11 +02:00
|
|
|
#include <linux/mutex.h>
|
2008-02-07 09:14:01 +01:00
|
|
|
#include <linux/memcontrol.h>
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The anon_vma heads a list of private "related" vmas, to scan if
|
|
|
|
* an anonymous page pointing to this anon_vma needs to be unmapped:
|
|
|
|
* the vmas on the list will be related by forking, or by splitting.
|
|
|
|
*
|
|
|
|
* Since vmas come and go as they are split and merged (particularly
|
|
|
|
* in mprotect), the mapping field of an anonymous page cannot point
|
|
|
|
* directly to a vma: instead it points to an anon_vma, on whose list
|
|
|
|
* the related vmas can be easily linked or unlinked.
|
|
|
|
*
|
|
|
|
* After unlinking the last vma on the list, we must garbage collect
|
|
|
|
* the anon_vma object itself: we're guaranteed no page can be
|
|
|
|
* pointing to this anon_vma once its vma list is empty.
|
|
|
|
*/
|
|
|
|
struct anon_vma {
|
2010-08-10 02:18:39 +02:00
|
|
|
struct anon_vma *root; /* Root of this anon_vma tree */
|
2011-05-25 02:12:11 +02:00
|
|
|
struct mutex mutex; /* Serialize access to vma list */
|
2010-05-24 23:32:18 +02:00
|
|
|
/*
|
2011-03-23 00:32:48 +01:00
|
|
|
* The refcount is taken on an anon_vma when there is no
|
2010-05-24 23:32:18 +02:00
|
|
|
* guarantee that the vma of page tables will exist for
|
|
|
|
* the duration of the operation. A caller that takes
|
|
|
|
* the reference is responsible for clearing up the
|
|
|
|
* anon_vma if they are the last user on release
|
|
|
|
*/
|
2011-03-23 00:32:48 +01:00
|
|
|
atomic_t refcount;
|
|
|
|
|
2008-07-29 00:46:26 +02:00
|
|
|
/*
|
|
|
|
* NOTE: the LSB of the head.next is set by
|
|
|
|
* mm_take_all_locks() _after_ taking the above lock. So the
|
|
|
|
* head must only be read/written after taking the above lock
|
|
|
|
* to be sure to see a valid next pointer. The LSB bit itself
|
|
|
|
* is serialized by a system wide lock only visible to
|
|
|
|
* mm_take_all_locks() (mm_all_locks_mutex).
|
|
|
|
*/
|
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-05 22:42:07 +01:00
|
|
|
struct list_head head; /* Chain of private "related" vmas */
|
|
|
|
};
|
|
|
|
|
|
|
|
/*
|
|
|
|
* The copy-on-write semantics of fork mean that an anon_vma
|
|
|
|
* can become associated with multiple processes. Furthermore,
|
|
|
|
* each child process will have its own anon_vma, where new
|
|
|
|
* pages for that process are instantiated.
|
|
|
|
*
|
|
|
|
* This structure allows us to find the anon_vmas associated
|
|
|
|
* with a VMA, or the VMAs associated with an anon_vma.
|
|
|
|
* The "same_vma" list contains the anon_vma_chains linking
|
|
|
|
* all the anon_vmas associated with this VMA.
|
|
|
|
* The "same_anon_vma" list contains the anon_vma_chains
|
|
|
|
* which link all the VMAs associated with this anon_vma.
|
|
|
|
*/
|
|
|
|
struct anon_vma_chain {
|
|
|
|
struct vm_area_struct *vma;
|
|
|
|
struct anon_vma *anon_vma;
|
|
|
|
struct list_head same_vma; /* locked by mmap_sem & page_table_lock */
|
2011-05-25 02:12:11 +02:00
|
|
|
struct list_head same_anon_vma; /* locked by anon_vma->mutex */
|
2005-04-17 00:20:36 +02:00
|
|
|
};
|
|
|
|
|
|
|
|
#ifdef CONFIG_MMU
|
2010-08-10 02:18:41 +02:00
|
|
|
static inline void get_anon_vma(struct anon_vma *anon_vma)
|
|
|
|
{
|
2011-03-23 00:32:48 +01:00
|
|
|
atomic_inc(&anon_vma->refcount);
|
2010-08-10 02:18:41 +02:00
|
|
|
}
|
|
|
|
|
2011-03-23 00:32:49 +01:00
|
|
|
void __put_anon_vma(struct anon_vma *anon_vma);
|
|
|
|
|
|
|
|
static inline void put_anon_vma(struct anon_vma *anon_vma)
|
|
|
|
{
|
|
|
|
if (atomic_dec_and_test(&anon_vma->refcount))
|
|
|
|
__put_anon_vma(anon_vma);
|
|
|
|
}
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2009-12-15 02:58:57 +01:00
|
|
|
static inline struct anon_vma *page_anon_vma(struct page *page)
|
|
|
|
{
|
|
|
|
if (((unsigned long)page->mapping & PAGE_MAPPING_FLAGS) !=
|
|
|
|
PAGE_MAPPING_ANON)
|
|
|
|
return NULL;
|
|
|
|
return page_rmapping(page);
|
|
|
|
}
|
|
|
|
|
2010-08-10 02:18:37 +02:00
|
|
|
static inline void vma_lock_anon_vma(struct vm_area_struct *vma)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
|
|
|
struct anon_vma *anon_vma = vma->anon_vma;
|
|
|
|
if (anon_vma)
|
2011-05-25 02:12:11 +02:00
|
|
|
mutex_lock(&anon_vma->root->mutex);
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
2010-08-10 02:18:37 +02:00
|
|
|
static inline void vma_unlock_anon_vma(struct vm_area_struct *vma)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
|
|
|
struct anon_vma *anon_vma = vma->anon_vma;
|
|
|
|
if (anon_vma)
|
2011-05-25 02:12:11 +02:00
|
|
|
mutex_unlock(&anon_vma->root->mutex);
|
2005-04-17 00:20:36 +02:00
|
|
|
}
|
|
|
|
|
2010-08-10 02:18:38 +02:00
|
|
|
static inline void anon_vma_lock(struct anon_vma *anon_vma)
|
|
|
|
{
|
2011-05-25 02:12:11 +02:00
|
|
|
mutex_lock(&anon_vma->root->mutex);
|
2010-08-10 02:18:38 +02:00
|
|
|
}
|
|
|
|
|
|
|
|
static inline void anon_vma_unlock(struct anon_vma *anon_vma)
|
|
|
|
{
|
2011-05-25 02:12:11 +02:00
|
|
|
mutex_unlock(&anon_vma->root->mutex);
|
2010-08-10 02:18:38 +02:00
|
|
|
}
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
/*
|
|
|
|
* anon_vma helper functions.
|
|
|
|
*/
|
|
|
|
void anon_vma_init(void); /* create anon_vma_cachep */
|
|
|
|
int anon_vma_prepare(struct vm_area_struct *);
|
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-05 22:42:07 +01:00
|
|
|
void unlink_anon_vmas(struct vm_area_struct *);
|
|
|
|
int anon_vma_clone(struct vm_area_struct *, struct vm_area_struct *);
|
mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma()
migrate was doing an rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serializing properly against mremap
PT locks. But a second problem remains in the order of vmas in the
same_anon_vma list used by the rmap_walk.
If vma_merge succeeds in copy_vma, the src vma could be placed after the
dst vma in the same_anon_vma list. That could still lead to migrate
missing some pte.
This patch adds an anon_vma_moveto_tail() function to force the dst vma at
the end of the list before mremap starts to solve the problem.
If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than taking
the anon_vma root lock around every pte copy practically for the whole
duration of mremap.
Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.
This program exercises the anon_vma_moveto_tail:
===
int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");
return 0;
}
===
$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)
You can now use it on all perf tools, such as:
perf record -e probe:anon_vma_moveto_tail -aR sleep 1
$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Reported-by: Nai Xia <nai.xia@gmail.com>
Acked-by: Mel Gorman <mgorman@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Pawel Sikora <pluto@agmk.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-01-11 00:08:05 +01:00
|
|
|
void anon_vma_moveto_tail(struct vm_area_struct *);
|
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-05 22:42:07 +01:00
|
|
|
int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
mm: change anon_vma linking to fix multi-process server scalability issue
The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.
In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.
This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.
This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.
This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.
The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.
A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.
Some test results:
Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.
With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.
[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Larry Woodman <lwoodman@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-05 22:42:07 +01:00
|
|
|
static inline void anon_vma_merge(struct vm_area_struct *vma,
|
|
|
|
struct vm_area_struct *next)
|
|
|
|
{
|
|
|
|
VM_BUG_ON(vma->anon_vma != next->anon_vma);
|
|
|
|
unlink_anon_vmas(next);
|
|
|
|
}
|
|
|
|
|
2011-03-23 00:32:49 +01:00
|
|
|
struct anon_vma *page_get_anon_vma(struct page *page);
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
/*
|
|
|
|
* rmap interfaces called when adding or removing pte of page
|
|
|
|
*/
|
2010-03-05 22:42:09 +01:00
|
|
|
void page_move_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
|
2005-04-17 00:20:36 +02:00
|
|
|
void page_add_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
|
2010-08-10 02:19:48 +02:00
|
|
|
void do_page_add_anon_rmap(struct page *, struct vm_area_struct *,
|
|
|
|
unsigned long, int);
|
2006-01-06 09:11:12 +01:00
|
|
|
void page_add_new_anon_rmap(struct page *, struct vm_area_struct *, unsigned long);
|
2005-04-17 00:20:36 +02:00
|
|
|
void page_add_file_rmap(struct page *);
|
2009-01-06 23:40:11 +01:00
|
|
|
void page_remove_rmap(struct page *);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2010-05-28 02:29:16 +02:00
|
|
|
void hugepage_add_anon_rmap(struct page *, struct vm_area_struct *,
|
|
|
|
unsigned long);
|
|
|
|
void hugepage_add_new_anon_rmap(struct page *, struct vm_area_struct *,
|
|
|
|
unsigned long);
|
|
|
|
|
2009-09-22 02:01:59 +02:00
|
|
|
static inline void page_dup_rmap(struct page *page)
|
2005-04-17 00:20:36 +02:00
|
|
|
{
|
|
|
|
atomic_inc(&page->_mapcount);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Called from mm/vmscan.c to handle paging out
|
|
|
|
*/
|
2009-06-17 00:33:05 +02:00
|
|
|
int page_referenced(struct page *, int is_locked,
|
2012-01-13 02:18:32 +01:00
|
|
|
struct mem_cgroup *memcg, unsigned long *vm_flags);
|
ksm: let shared pages be swappable
Initial implementation for swapping out KSM's shared pages: add
page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
faced with a PageKsm page.
Most of what's needed can be got from the rmap_items listed from the
stable_node of the ksm page, without discovering the actual vma: so in
this patch just fake up a struct vma for page_referenced_one() or
try_to_unmap_one(), then refine that in the next patch.
Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
implicit there (being only set with VM_SHARED, already excluded), but
let's make it explicit, to help justify the lack of nonlinear unmap.
Rely on the page lock to protect against concurrent modifications to that
page's node of the stable tree.
The awkward part is not swapout but swapin: do_swap_page() and
page_add_anon_rmap() now have to allow for new possibilities - perhaps a
ksm page still in swapcache, perhaps a swapcache page associated with one
location in one anon_vma now needed for another location or anon_vma.
(And the vma might even be no longer VM_MERGEABLE when that happens.)
ksm_might_need_to_copy() checks for that case, and supplies a duplicate
page when necessary, simply leaving it to a subsequent pass of ksmd to
rediscover the identity and merge them back into one ksm page.
Disappointingly primitive: but the alternative would have to accumulate
unswappable info about the swapped out ksm pages, limiting swappability.
Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
particular case it was handling, so just use it instead.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 02:59:24 +01:00
|
|
|
int page_referenced_one(struct page *, struct vm_area_struct *,
|
|
|
|
unsigned long address, unsigned int *mapcount, unsigned long *vm_flags);
|
|
|
|
|
2009-09-16 11:50:10 +02:00
|
|
|
enum ttu_flags {
|
|
|
|
TTU_UNMAP = 0, /* unmap mode */
|
|
|
|
TTU_MIGRATION = 1, /* migration mode */
|
|
|
|
TTU_MUNLOCK = 2, /* munlock mode */
|
|
|
|
TTU_ACTION_MASK = 0xff,
|
|
|
|
|
|
|
|
TTU_IGNORE_MLOCK = (1 << 8), /* ignore mlock */
|
|
|
|
TTU_IGNORE_ACCESS = (1 << 9), /* don't age */
|
2009-09-16 11:50:11 +02:00
|
|
|
TTU_IGNORE_HWPOISON = (1 << 10),/* corrupted page is recoverable */
|
2009-09-16 11:50:10 +02:00
|
|
|
};
|
|
|
|
#define TTU_ACTION(x) ((x) & TTU_ACTION_MASK)
|
|
|
|
|
thp: transparent hugepage core
Lately I've been working to make KVM use hugepages transparently without
the usual restrictions of hugetlbfs. Some of the restrictions I'd like to
see removed:
1) hugepages have to be swappable or the guest physical memory remains
locked in RAM and can't be paged out to swap
2) if a hugepage allocation fails, regular pages should be allocated
instead and mixed in the same vma without any failure and without
userland noticing
3) if some task quits and more hugepages become available in the
buddy, guest physical memory backed by regular pages should be
relocated on hugepages automatically in regions under
madvise(MADV_HUGEPAGE) (ideally event driven by waking up the
kernel deamon if the order=HPAGE_PMD_SHIFT-PAGE_SHIFT list becomes
not null)
4) avoidance of reservation and maximization of use of hugepages whenever
possible. Reservation (needed to avoid runtime fatal faliures) may be ok for
1 machine with 1 database with 1 database cache with 1 database cache size
known at boot time. It's definitely not feasible with a virtualization
hypervisor usage like RHEV-H that runs an unknown number of virtual machines
with an unknown size of each virtual machine with an unknown amount of
pagecache that could be potentially useful in the host for guest not using
O_DIRECT (aka cache=off).
hugepages in the virtualization hypervisor (and also in the guest!) are
much more important than in a regular host not using virtualization,
becasue with NPT/EPT they decrease the tlb-miss cacheline accesses from 24
to 19 in case only the hypervisor uses transparent hugepages, and they
decrease the tlb-miss cacheline accesses from 19 to 15 in case both the
linux hypervisor and the linux guest both uses this patch (though the
guest will limit the addition speedup to anonymous regions only for
now...). Even more important is that the tlb miss handler is much slower
on a NPT/EPT guest than for a regular shadow paging or no-virtualization
scenario. So maximizing the amount of virtual memory cached by the TLB
pays off significantly more with NPT/EPT than without (even if there would
be no significant speedup in the tlb-miss runtime).
The first (and more tedious) part of this work requires allowing the VM to
handle anonymous hugepages mixed with regular pages transparently on
regular anonymous vmas. This is what this patch tries to achieve in the
least intrusive possible way. We want hugepages and hugetlb to be used in
a way so that all applications can benefit without changes (as usual we
leverage the KVM virtualization design: by improving the Linux VM at
large, KVM gets the performance boost too).
The most important design choice is: always fallback to 4k allocation if
the hugepage allocation fails! This is the _very_ opposite of some large
pagecache patches that failed with -EIO back then if a 64k (or similar)
allocation failed...
Second important decision (to reduce the impact of the feature on the
existing pagetable handling code) is that at any time we can split an
hugepage into 512 regular pages and it has to be done with an operation
that can't fail. This way the reliability of the swapping isn't decreased
(no need to allocate memory when we are short on memory to swap) and it's
trivial to plug a split_huge_page* one-liner where needed without
polluting the VM. Over time we can teach mprotect, mremap and friends to
handle pmd_trans_huge natively without calling split_huge_page*. The fact
it can't fail isn't just for swap: if split_huge_page would return -ENOMEM
(instead of the current void) we'd need to rollback the mprotect from the
middle of it (ideally including undoing the split_vma) which would be a
big change and in the very wrong direction (it'd likely be simpler not to
call split_huge_page at all and to teach mprotect and friends to handle
hugepages instead of rolling them back from the middle). In short the
very value of split_huge_page is that it can't fail.
The collapsing and madvise(MADV_HUGEPAGE) part will remain separated and
incremental and it'll just be an "harmless" addition later if this initial
part is agreed upon. It also should be noted that locking-wise replacing
regular pages with hugepages is going to be very easy if compared to what
I'm doing below in split_huge_page, as it will only happen when
page_count(page) matches page_mapcount(page) if we can take the PG_lock
and mmap_sem in write mode. collapse_huge_page will be a "best effort"
that (unlike split_huge_page) can fail at the minimal sign of trouble and
we can try again later. collapse_huge_page will be similar to how KSM
works and the madvise(MADV_HUGEPAGE) will work similar to
madvise(MADV_MERGEABLE).
The default I like is that transparent hugepages are used at page fault
time. This can be changed with
/sys/kernel/mm/transparent_hugepage/enabled. The control knob can be set
to three values "always", "madvise", "never" which mean respectively that
hugepages are always used, or only inside madvise(MADV_HUGEPAGE) regions,
or never used. /sys/kernel/mm/transparent_hugepage/defrag instead
controls if the hugepage allocation should defrag memory aggressively
"always", only inside "madvise" regions, or "never".
The pmd_trans_splitting/pmd_trans_huge locking is very solid. The
put_page (from get_user_page users that can't use mmu notifier like
O_DIRECT) that runs against a __split_huge_page_refcount instead was a
pain to serialize in a way that would result always in a coherent page
count for both tail and head. I think my locking solution with a
compound_lock taken only after the page_first is valid and is still a
PageHead should be safe but it surely needs review from SMP race point of
view. In short there is no current existing way to serialize the O_DIRECT
final put_page against split_huge_page_refcount so I had to invent a new
one (O_DIRECT loses knowledge on the mapping status by the time gup_fast
returns so...). And I didn't want to impact all gup/gup_fast users for
now, maybe if we change the gup interface substantially we can avoid this
locking, I admit I didn't think too much about it because changing the gup
unpinning interface would be invasive.
If we ignored O_DIRECT we could stick to the existing compound refcounting
code, by simply adding a get_user_pages_fast_flags(foll_flags) where KVM
(and any other mmu notifier user) would call it without FOLL_GET (and if
FOLL_GET isn't set we'd just BUG_ON if nobody registered itself in the
current task mmu notifier list yet). But O_DIRECT is fundamental for
decent performance of virtualized I/O on fast storage so we can't avoid it
to solve the race of put_page against split_huge_page_refcount to achieve
a complete hugepage feature for KVM.
Swap and oom works fine (well just like with regular pages ;). MMU
notifier is handled transparently too, with the exception of the young bit
on the pmd, that didn't have a range check but I think KVM will be fine
because the whole point of hugepages is that EPT/NPT will also use a huge
pmd when they notice gup returns pages with PageCompound set, so they
won't care of a range and there's just the pmd young bit to check in that
case.
NOTE: in some cases if the L2 cache is small, this may slowdown and waste
memory during COWs because 4M of memory are accessed in a single fault
instead of 8k (the payoff is that after COW the program can run faster).
So we might want to switch the copy_huge_page (and clear_huge_page too) to
not temporal stores. I also extensively researched ways to avoid this
cache trashing with a full prefault logic that would cow in 8k/16k/32k/64k
up to 1M (I can send those patches that fully implemented prefault) but I
concluded they're not worth it and they add an huge additional complexity
and they remove all tlb benefits until the full hugepage has been faulted
in, to save a little bit of memory and some cache during app startup, but
they still don't improve substantially the cache-trashing during startup
if the prefault happens in >4k chunks. One reason is that those 4k pte
entries copied are still mapped on a perfectly cache-colored hugepage, so
the trashing is the worst one can generate in those copies (cow of 4k page
copies aren't so well colored so they trashes less, but again this results
in software running faster after the page fault). Those prefault patches
allowed things like a pte where post-cow pages were local 4k regular anon
pages and the not-yet-cowed pte entries were pointing in the middle of
some hugepage mapped read-only. If it doesn't payoff substantially with
todays hardware it will payoff even less in the future with larger l2
caches, and the prefault logic would blot the VM a lot. If one is
emebdded transparent_hugepage can be disabled during boot with sysfs or
with the boot commandline parameter transparent_hugepage=0 (or
transparent_hugepage=2 to restrict hugepages inside madvise regions) that
will ensure not a single hugepage is allocated at boot time. It is simple
enough to just disable transparent hugepage globally and let transparent
hugepages be allocated selectively by applications in the MADV_HUGEPAGE
region (both at page fault time, and if enabled with the
collapse_huge_page too through the kernel daemon).
This patch supports only hugepages mapped in the pmd, archs that have
smaller hugepages will not fit in this patch alone. Also some archs like
power have certain tlb limits that prevents mixing different page size in
the same regions so they will not fit in this framework that requires
"graceful fallback" to basic PAGE_SIZE in case of physical memory
fragmentation. hugetlbfs remains a perfect fit for those because its
software limits happen to match the hardware limits. hugetlbfs also
remains a perfect fit for hugepage sizes like 1GByte that cannot be hoped
to be found not fragmented after a certain system uptime and that would be
very expensive to defragment with relocation, so requiring reservation.
hugetlbfs is the "reservation way", the point of transparent hugepages is
not to have any reservation at all and maximizing the use of cache and
hugepages at all times automatically.
Some performance result:
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largep
ages3
memset page fault 1566023
memset tlb miss 453854
memset second tlb miss 453321
random access tlb miss 41635
random access second tlb miss 41658
vmx andrea # LD_PRELOAD=/usr/lib64/libhugetlbfs.so HUGETLB_MORECORE=yes HUGETLB_PATH=/mnt/huge/ ./largepages3
memset page fault 1566471
memset tlb miss 453375
memset second tlb miss 453320
random access tlb miss 41636
random access second tlb miss 41637
vmx andrea # ./largepages3
memset page fault 1566642
memset tlb miss 453417
memset second tlb miss 453313
random access tlb miss 41630
random access second tlb miss 41647
vmx andrea # ./largepages3
memset page fault 1566872
memset tlb miss 453418
memset second tlb miss 453315
random access tlb miss 41618
random access second tlb miss 41659
vmx andrea # echo 0 > /proc/sys/vm/transparent_hugepage
vmx andrea # ./largepages3
memset page fault 2182476
memset tlb miss 460305
memset second tlb miss 460179
random access tlb miss 44483
random access second tlb miss 44186
vmx andrea # ./largepages3
memset page fault 2182791
memset tlb miss 460742
memset second tlb miss 459962
random access tlb miss 43981
random access second tlb miss 43988
============
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/time.h>
#define SIZE (3UL*1024*1024*1024)
int main()
{
char *p = malloc(SIZE), *p2;
struct timeval before, after;
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset page fault %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
memset(p, 0, SIZE);
gettimeofday(&after, NULL);
printf("memset second tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
for (p2 = p; p2 < p+SIZE; p2 += 4096)
*p2 = 0;
gettimeofday(&after, NULL);
printf("random access tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
gettimeofday(&before, NULL);
for (p2 = p; p2 < p+SIZE; p2 += 4096)
*p2 = 0;
gettimeofday(&after, NULL);
printf("random access second tlb miss %Lu\n",
(after.tv_sec-before.tv_sec)*1000000UL +
after.tv_usec-before.tv_usec);
return 0;
}
============
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14 00:46:52 +01:00
|
|
|
bool is_vma_temporary_stack(struct vm_area_struct *vma);
|
|
|
|
|
2009-09-16 11:50:10 +02:00
|
|
|
int try_to_unmap(struct page *, enum ttu_flags flags);
|
ksm: let shared pages be swappable
Initial implementation for swapping out KSM's shared pages: add
page_referenced_ksm() and try_to_unmap_ksm(), which rmap.c calls when
faced with a PageKsm page.
Most of what's needed can be got from the rmap_items listed from the
stable_node of the ksm page, without discovering the actual vma: so in
this patch just fake up a struct vma for page_referenced_one() or
try_to_unmap_one(), then refine that in the next patch.
Add VM_NONLINEAR to ksm_madvise()'s list of exclusions: it has always been
implicit there (being only set with VM_SHARED, already excluded), but
let's make it explicit, to help justify the lack of nonlinear unmap.
Rely on the page lock to protect against concurrent modifications to that
page's node of the stable tree.
The awkward part is not swapout but swapin: do_swap_page() and
page_add_anon_rmap() now have to allow for new possibilities - perhaps a
ksm page still in swapcache, perhaps a swapcache page associated with one
location in one anon_vma now needed for another location or anon_vma.
(And the vma might even be no longer VM_MERGEABLE when that happens.)
ksm_might_need_to_copy() checks for that case, and supplies a duplicate
page when necessary, simply leaving it to a subsequent pass of ksmd to
rediscover the identity and merge them back into one ksm page.
Disappointingly primitive: but the alternative would have to accumulate
unswappable info about the swapped out ksm pages, limiting swappability.
Remove page_add_ksm_rmap(): page_add_anon_rmap() now has to allow for the
particular case it was handling, so just use it instead.
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 02:59:24 +01:00
|
|
|
int try_to_unmap_one(struct page *, struct vm_area_struct *,
|
|
|
|
unsigned long address, enum ttu_flags flags);
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2005-06-24 07:05:25 +02:00
|
|
|
/*
|
|
|
|
* Called from mm/filemap_xip.c to unmap empty zero page
|
|
|
|
*/
|
2010-10-26 23:22:01 +02:00
|
|
|
pte_t *__page_check_address(struct page *, struct mm_struct *,
|
2008-08-20 23:09:18 +02:00
|
|
|
unsigned long, spinlock_t **, int);
|
2005-06-24 07:05:25 +02:00
|
|
|
|
2010-10-26 23:22:01 +02:00
|
|
|
static inline pte_t *page_check_address(struct page *page, struct mm_struct *mm,
|
|
|
|
unsigned long address,
|
|
|
|
spinlock_t **ptlp, int sync)
|
|
|
|
{
|
|
|
|
pte_t *ptep;
|
|
|
|
|
|
|
|
__cond_lock(*ptlp, ptep = __page_check_address(page, mm, address,
|
|
|
|
ptlp, sync));
|
|
|
|
return ptep;
|
|
|
|
}
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
/*
|
|
|
|
* Used by swapoff to help locate where page is expected in vma.
|
|
|
|
*/
|
|
|
|
unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
|
|
|
|
|
2006-09-26 08:30:57 +02:00
|
|
|
/*
|
|
|
|
* Cleans the PTEs of shared mappings.
|
|
|
|
* (and since clean PTEs should also be readonly, write protects them too)
|
|
|
|
*
|
|
|
|
* returns the number of cleaned PTEs.
|
|
|
|
*/
|
|
|
|
int page_mkclean(struct page *);
|
|
|
|
|
mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 05:26:44 +02:00
|
|
|
/*
|
|
|
|
* called in munlock()/munmap() path to check for other vmas holding
|
|
|
|
* the page mlocked.
|
|
|
|
*/
|
|
|
|
int try_to_munlock(struct page *);
|
|
|
|
|
2009-09-16 11:50:04 +02:00
|
|
|
/*
|
|
|
|
* Called by memory-failure.c to kill processes.
|
|
|
|
*/
|
2011-05-25 02:12:07 +02:00
|
|
|
struct anon_vma *page_lock_anon_vma(struct page *page);
|
2009-09-16 11:50:04 +02:00
|
|
|
void page_unlock_anon_vma(struct anon_vma *anon_vma);
|
2009-09-16 11:50:15 +02:00
|
|
|
int page_mapped_in_vma(struct page *page, struct vm_area_struct *vma);
|
2009-09-16 11:50:04 +02:00
|
|
|
|
ksm: rmap_walk to remove_migation_ptes
A side-effect of making ksm pages swappable is that they have to be placed
on the LRUs: which then exposes them to isolate_lru_page() and hence to
page migration.
Add rmap_walk() for remove_migration_ptes() to use: rmap_walk_anon() and
rmap_walk_file() in rmap.c, but rmap_walk_ksm() in ksm.c. Perhaps some
consolidation with existing code is possible, but don't attempt that yet
(try_to_unmap needs to handle nonlinears, but migration pte removal does
not).
rmap_walk() is sadly less general than it appears: rmap_walk_anon(), like
remove_anon_migration_ptes() which it replaces, avoids calling
page_lock_anon_vma(), because that includes a page_mapped() test which
fails when all migration ptes are in place. That was valid when NUMA page
migration was introduced (holding mmap_sem provided the missing guarantee
that anon_vma's slab had not already been destroyed), but I believe not
valid in the memory hotremove case added since.
For now do the same as before, and consider the best way to fix that
unlikely race later on. When fixed, we can probably use rmap_walk() on
hwpoisoned ksm pages too: for now, they remain among hwpoison's various
exceptions (its PageKsm test comes before the page is locked, but its
page_lock_anon_vma fails safely if an anon gets upgraded).
Signed-off-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Izik Eidus <ieidus@redhat.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Chris Wright <chrisw@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-12-15 02:59:31 +01:00
|
|
|
/*
|
|
|
|
* Called by migrate.c to remove migration ptes, but might be used more later.
|
|
|
|
*/
|
|
|
|
int rmap_walk(struct page *page, int (*rmap_one)(struct page *,
|
|
|
|
struct vm_area_struct *, unsigned long, void *), void *arg);
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
#else /* !CONFIG_MMU */
|
|
|
|
|
|
|
|
#define anon_vma_init() do {} while (0)
|
|
|
|
#define anon_vma_prepare(vma) (0)
|
|
|
|
#define anon_vma_link(vma) do {} while (0)
|
|
|
|
|
2009-06-23 21:37:01 +02:00
|
|
|
static inline int page_referenced(struct page *page, int is_locked,
|
2012-01-13 02:18:32 +01:00
|
|
|
struct mem_cgroup *memcg,
|
2009-06-23 21:37:01 +02:00
|
|
|
unsigned long *vm_flags)
|
|
|
|
{
|
|
|
|
*vm_flags = 0;
|
2010-03-05 22:42:22 +01:00
|
|
|
return 0;
|
2009-06-23 21:37:01 +02:00
|
|
|
}
|
|
|
|
|
2006-02-01 12:05:38 +01:00
|
|
|
#define try_to_unmap(page, refs) SWAP_FAIL
|
2005-04-17 00:20:36 +02:00
|
|
|
|
2006-09-26 08:30:57 +02:00
|
|
|
static inline int page_mkclean(struct page *page)
|
|
|
|
{
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
|
2005-04-17 00:20:36 +02:00
|
|
|
#endif /* CONFIG_MMU */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return values of try_to_unmap
|
|
|
|
*/
|
|
|
|
#define SWAP_SUCCESS 0
|
|
|
|
#define SWAP_AGAIN 1
|
|
|
|
#define SWAP_FAIL 2
|
mlock: mlocked pages are unevictable
Make sure that mlocked pages also live on the unevictable LRU, so kswapd
will not scan them over and over again.
This is achieved through various strategies:
1) add yet another page flag--PG_mlocked--to indicate that
the page is locked for efficient testing in vmscan and,
optionally, fault path. This allows early culling of
unevictable pages, preventing them from getting to
page_referenced()/try_to_unmap(). Also allows separate
accounting of mlock'd pages, as Nick's original patch
did.
Note: Nick's original mlock patch used a PG_mlocked
flag. I had removed this in favor of the PG_unevictable
flag + an mlock_count [new page struct member]. I
restored the PG_mlocked flag to eliminate the new
count field.
2) add the mlock/unevictable infrastructure to mm/mlock.c,
with internal APIs in mm/internal.h. This is a rework
of Nick's original patch to these files, taking into
account that mlocked pages are now kept on unevictable
LRU list.
3) update vmscan.c:page_evictable() to check PageMlocked()
and, if vma passed in, the vm_flags. Note that the vma
will only be passed in for new pages in the fault path;
and then only if the "cull unevictable pages in fault
path" patch is included.
4) add try_to_unlock() to rmap.c to walk a page's rmap and
ClearPageMlocked() if no other vmas have it mlocked.
Reuses as much of try_to_unmap() as possible. This
effectively replaces the use of one of the lru list links
as an mlock count. If this mechanism let's pages in mlocked
vmas leak through w/o PG_mlocked set [I don't know that it
does], we should catch them later in try_to_unmap(). One
hopes this will be rare, as it will be relatively expensive.
Original mm/internal.h, mm/rmap.c and mm/mlock.c changes:
Signed-off-by: Nick Piggin <npiggin@suse.de>
splitlru: introduce __get_user_pages():
New munlock processing need to GUP_FLAGS_IGNORE_VMA_PERMISSIONS.
because current get_user_pages() can't grab PROT_NONE pages theresore it
cause PROT_NONE pages can't munlock.
[akpm@linux-foundation.org: fix this for pagemap-pass-mm-into-pagewalkers.patch]
[akpm@linux-foundation.org: untangle patch interdependencies]
[akpm@linux-foundation.org: fix things after out-of-order merging]
[hugh@veritas.com: fix page-flags mess]
[lee.schermerhorn@hp.com: fix munlock page table walk - now requires 'mm']
[kosaki.motohiro@jp.fujitsu.com: build fix]
[kosaki.motohiro@jp.fujitsu.com: fix truncate race and sevaral comments]
[kosaki.motohiro@jp.fujitsu.com: splitlru: introduce __get_user_pages()]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Dave Hansen <dave@linux.vnet.ibm.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-10-19 05:26:44 +02:00
|
|
|
#define SWAP_MLOCK 3
|
2005-04-17 00:20:36 +02:00
|
|
|
|
|
|
|
#endif /* _LINUX_RMAP_H */
|