Commit graph

2835 commits

Author SHA1 Message Date
Ingo Molnar
0edcf8d692 Merge branch 'tj-percpu' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc into core/percpu
Conflicts:
	arch/x86/include/asm/pgtable.h
2009-02-24 21:52:45 +01:00
Tejun Heo
40150d37be percpu: add __read_mostly to variables which are mostly read only
Most global variables in percpu allocator are initialized during boot
and read only from that point on.  Add __read_mostly as per Rusty's
suggestion.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
2009-02-24 14:24:30 +09:00
Tejun Heo
8d408b4be3 percpu: give more latitude to arch specific first chunk initialization
Impact: more latitude for first percpu chunk allocation

The first percpu chunk serves the kernel static percpu area and may or
may not contain extra room for further dynamic allocation.
Initialization of the first chunk needs to be done before normal
memory allocation service is up, so it has its own init path -
pcpu_setup_static().

It seems archs need more latitude while initializing the first chunk
for example to take advantage of large page mapping.  This patch makes
the following changes to allow this.

* Define PERCPU_DYNAMIC_RESERVE to give arch hint about how much space
  to reserve in the first chunk for further dynamic allocation.

* Rename pcpu_setup_static() to pcpu_setup_first_chunk().

* Make pcpu_setup_first_chunk() much more flexible by fetching page
  pointer by callback and adding optional @unit_size, @free_size and
  @base_addr arguments which allow archs to selectively part of chunk
  initialization to their likings.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-24 11:57:21 +09:00
Tejun Heo
d9b55eeb1d percpu: remove unit_size power-of-2 restriction
Impact: allow unit_size to be arbitrary multiple of PAGE_SIZE

In dynamic percpu allocator, there is no reason the unit size should
be power of two.  Remove the restriction.

As non-power-of-two unit size means that empty chunks fall into the
same slot index as lightly occupied chunks which is bad for reclaming.
Reserve an extra slot for empty chunks.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-24 11:57:21 +09:00
Tejun Heo
c0c0a29379 vmalloc: add @align to vm_area_register_early()
Impact: allow larger alignment for early vmalloc area allocation

Some early vmalloc users might want larger alignment, for example, for
custom large page mapping.  Add @align to vm_area_register_early().
While at it, drop docbook comment on non-existent @size.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
2009-02-24 11:57:21 +09:00
Tejun Heo
c132937556 bootmem: clean up arch-specific bootmem wrapping
Impact: cleaner and consistent bootmem wrapping

By setting CONFIG_HAVE_ARCH_BOOTMEM_NODE, archs can define
arch-specific wrappers for bootmem allocation.  However, this is done
a bit strangely in that only the high level convenience macros can be
changed while lower level, but still exported, interface functions
can't be wrapped.  This not only is messy but also leads to strange
situation where alloc_bootmem() does what the arch wants it to do but
the equivalent __alloc_bootmem() call doesn't although they should be
able to be used interchangeably.

This patch updates bootmem such that archs can override / wrap the
backend function - alloc_bootmem_core() instead of the highlevel
interface functions to allow simpler and consistent wrapping.  Also,
HAVE_ARCH_BOOTMEM_NODE is renamed to HAVE_ARCH_BOOTMEM.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Johannes Weiner <hannes@saeurebad.de>
2009-02-24 11:57:20 +09:00
Tejun Heo
cb83b42e23 percpu: fix pcpu_chunk_struct_size
Impact: fix short allocation leading to memory corruption

While dropping rvalue wrapping macros around global parameters,
pcpu_chunk_struct_size was set incorrectly resulting in shorter page
pointer array.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-24 11:57:20 +09:00
Linus Torvalds
adfafefd10 Merge branch 'hibernate'
* hibernate:
  PM: Fix suspend_console and resume_console to use only one semaphore
  PM: Wait for console in resume
  PM: Fix pm_notifiers during user mode hibernation
  swsusp: clean up shrink_all_zones()
  swsusp: dont fiddle with swappiness
  PM: fix build for CONFIG_PM unset
  PM/hibernate: fix "swap breaks after hibernation failures"
  PM/resume: wait for device probing to finish
  Consolidate driver_probe_done() loops into one place
2009-02-21 14:17:26 -08:00
Johannes Weiner
0cb57258fe swsusp: clean up shrink_all_zones()
Move local variables to innermost possible scopes and use local
variables to cache calculations/reads done more than once.

No change in functionality (intended).

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <gregkh@suse.de>
Acked-by: Pavel Machek <pavel@ucw.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21 14:17:17 -08:00
Johannes Weiner
3049103ddf swsusp: dont fiddle with swappiness
sc.swappiness is not used in the swsusp memory shrinking path, do not
set it.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21 14:17:17 -08:00
Alan Jenkins
a1bb7d6123 PM/hibernate: fix "swap breaks after hibernation failures"
http://bugzilla.kernel.org/show_bug.cgi?id=12239

The image writing code dropped a reference to the current swap device.
This doesn't show up if the hibernation succeeds - because it doesn't
affect the image which gets resumed.  But it means multiple _failed_
hibernations end up freeing the swap device while it is still use!

swsusp_write() finds the block device for the swap file using swap_type_of().
It then uses blkdev_get() / blkdev_put() to open and close the block device.

Unfortunately, blkdev_get() assumes ownership of the inode of the block_device
passed to it.  So blkdev_put() calls iput() on the inode.  This is by design
and other callers expect this behaviour.  The fix is for swap_type_of() to take
a reference on the inode using bdget().

Signed-off-by: Alan Jenkins <alan-jenkins@tuffmail.co.uk>
Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Len Brown <lenb@kernel.org>
Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-21 14:17:17 -08:00
Tejun Heo
cae3aeb83f percpu: clean up size usage
Andrew was concerned about the unit of variables named or have suffix
size.  Every usage in percpu allocator is in bytes but make it super
clear by adding comments.

While at it, make pcpu_depopulate_chunk() take int @off and @size like
everyone else.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
2009-02-21 16:56:23 +09:00
Tejun Heo
f6fcba7014 vmalloc: call flush_cache_vunmap() from unmap_kernel_range()
Impact: proper vcache flush on unmap_kernel_range()

flush_cache_vunmap() should be called before pages are unmapped.  Add
a call to it in unmap_kernel_range().

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Nick Piggin <npiggin@suse.de>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: <stable@kernel.org>		[2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-20 17:57:49 -08:00
Johannes Weiner
3ef0e5ba46 slab: introduce kzfree()
kzfree() is a wrapper for kfree() that additionally zeroes the underlying
memory before releasing it to the slab allocator.

Currently there is code which memset()s the memory region of an object
before releasing it back to the slab allocator to make sure
security-sensitive data are really zeroed out after use.

These callsites can then just use kzfree() which saves some code, makes
users greppable and allows for a stupid destructor that isn't necessarily
aware of the actual object size.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Matt Mackall <mpm@selenic.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-20 17:57:48 -08:00
Tejun Heo
fbf59bc9d7 percpu: implement new dynamic percpu allocator
Impact: new scalable dynamic percpu allocator which allows dynamic
        percpu areas to be accessed the same way as static ones

Implement scalable dynamic percpu allocator which can be used for both
static and dynamic percpu areas.  This will allow static and dynamic
areas to share faster direct access methods.  This feature is optional
and enabled only when CONFIG_HAVE_DYNAMIC_PER_CPU_AREA is defined by
arch.  Please read comment on top of mm/percpu.c for details.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
2009-02-20 16:29:08 +09:00
Tejun Heo
8fc4898500 vmalloc: add un/map_kernel_range_noflush()
Impact: two more public map/unmap functions

Implement map_kernel_range_noflush() and unmap_kernel_range_noflush().
These functions respectively map and unmap address range in kernel VM
area but doesn't do any vcache or tlb flushing.  These will be used by
new percpu allocator.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
2009-02-20 16:29:08 +09:00
Tejun Heo
f0aa661790 vmalloc: implement vm_area_register_early()
Impact: allow multiple early vm areas

There are places where kernel VM area needs to be allocated before
vmalloc is initialized.  This is done by allocating static vm_struct,
initializing several fields and linking it to vmlist and later vmalloc
initialization picking up these from vmlist.  This is currently done
manually and if there's more than one such areas, there's no defined
way to arbitrate who gets which address.

This patch implements vm_area_register_early(), which takes vm_area
struct with flags and size initialized, assigns address to it and puts
it on the vmlist.  This way, multiple early vm areas can determine
which addresses they should use.  The only current user - alpha mm
init - is converted to use it.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-20 16:29:08 +09:00
Tejun Heo
f2a8205c4e percpu: kill percpu_alloc() and friends
Impact: kill unused functions

percpu_alloc() and its friends never saw much action.  It was supposed
to replace the cpu-mask unaware __alloc_percpu() but it never happened
and in fact __percpu_alloc_mask() itself never really grew proper
up/down handling interface either (no exported interface for
populate/depopulate).

percpu allocation is about to go through major reimplementation and
there's no reason to carry this unused interface around.  Replace it
with __alloc_percpu() and free_percpu().

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-20 16:29:08 +09:00
Tejun Heo
734269521e vmalloc: call flush_cache_vunmap() from unmap_kernel_range()
Impact: proper vcache flush on unmap_kernel_range()

flush_cache_vunmap() should be called before pages are unmapped.  Add
a call to it in unmap_kernel_range().

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-02-20 16:29:07 +09:00
Linus Torvalds
ba95fd47d1 Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-block
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
  block: fix deadlock in blk_abort_queue() for drivers that readd to timeout list
  block: fix booting from partitioned md array
  block: revert part of 18ce3751cc
  cciss: PCI power management reset for kexec
  paride/pg.c: xs(): &&/|| confusion
  fs/bio: bio_alloc_bioset: pass right object ptr to mempool_free
  block: fix bad definition of BIO_RW_SYNC
  bsg: Fix sense buffer bug in SG_IO
2009-02-18 18:33:04 -08:00
KAMEZAWA Hiroyuki
cc2559bccc mm: fix memmap init for handling memory hole
Now, early_pfn_in_nid(PFN, NID) may returns false if PFN is a hole.
and memmap initialization was not done. This was a trouble for
sparc boot.

To fix this, the PFN should be initialized and marked as PG_reserved.
This patch changes early_pfn_in_nid() return true if PFN is a hole.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reported-by: David Miller <davem@davemlloft.net>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:55 -08:00
KAMEZAWA Hiroyuki
f2dbcfa738 mm: clean up for early_pfn_to_nid()
What's happening is that the assertion in mm/page_alloc.c:move_freepages()
is triggering:

	BUG_ON(page_zone(start_page) != page_zone(end_page));

Once I knew this is what was happening, I added some annotations:

	if (unlikely(page_zone(start_page) != page_zone(end_page))) {
		printk(KERN_ERR "move_freepages: Bogus zones: "
		       "start_page[%p] end_page[%p] zone[%p]\n",
		       start_page, end_page, zone);
		printk(KERN_ERR "move_freepages: "
		       "start_zone[%p] end_zone[%p]\n",
		       page_zone(start_page), page_zone(end_page));
		printk(KERN_ERR "move_freepages: "
		       "start_pfn[0x%lx] end_pfn[0x%lx]\n",
		       page_to_pfn(start_page), page_to_pfn(end_page));
		printk(KERN_ERR "move_freepages: "
		       "start_nid[%d] end_nid[%d]\n",
		       page_to_nid(start_page), page_to_nid(end_page));
 ...

And here's what I got:

	move_freepages: Bogus zones: start_page[2207d0000] end_page[2207dffc0] zone[fffff8103effcb00]
	move_freepages: start_zone[fffff8103effcb00] end_zone[fffff8003fffeb00]
	move_freepages: start_pfn[0x81f600] end_pfn[0x81f7ff]
	move_freepages: start_nid[1] end_nid[0]

My memory layout on this box is:

[    0.000000] Zone PFN ranges:
[    0.000000]   Normal   0x00000000 -> 0x0081ff5d
[    0.000000] Movable zone start PFN for each node
[    0.000000] early_node_map[8] active PFN ranges
[    0.000000]     0: 0x00000000 -> 0x00020000
[    0.000000]     1: 0x00800000 -> 0x0081f7ff
[    0.000000]     1: 0x0081f800 -> 0x0081fe50
[    0.000000]     1: 0x0081fed1 -> 0x0081fed8
[    0.000000]     1: 0x0081feda -> 0x0081fedb
[    0.000000]     1: 0x0081fedd -> 0x0081fee5
[    0.000000]     1: 0x0081fee7 -> 0x0081ff51
[    0.000000]     1: 0x0081ff59 -> 0x0081ff5d

So it's a block move in that 0x81f600-->0x81f7ff region which triggers
the problem.

This patch:

Declaration of early_pfn_to_nid() is scattered over per-arch include
files, and it seems it's complicated to know when the declaration is used.
 I think it makes fix-for-memmap-init not easy.

This patch moves all declaration to include/linux/mm.h

After this,
  if !CONFIG_NODES_POPULATES_NODE_MAP && !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
     -> Use static definition in include/linux/mm.h
  else if !CONFIG_HAVE_ARCH_EARLY_PFN_TO_NID
     -> Use generic definition in mm/page_alloc.c
  else
     -> per-arch back end function will be called.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: David Miller <davem@davemlloft.net>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org>		[2.6.25.x, 2.6.26.x, 2.6.27.x, 2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:55 -08:00
Nick Piggin
1cf6e7d83b mm: task dirty accounting fix
YAMAMOTO-san noticed that task_dirty_inc doesn't seem to be called properly for
cases where set_page_dirty is not used to dirty a page (eg. mark_buffer_dirty).

Additionally, there is some inconsistency about when task_dirty_inc is
called.  It is used for dirty balancing, however it even gets called for
__set_page_dirty_no_writeback.

So rather than increment it in a set_page_dirty wrapper, move it down to
exactly where the dirty page accounting stats are incremented.

Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:54 -08:00
Benjamin Herrenschmidt
c296861291 vmalloc: add __get_vm_area_caller()
We have get_vm_area_caller() and __get_vm_area() but not
__get_vm_area_caller()

On powerpc, I use __get_vm_area() to separate the ranges of addresses
given to vmalloc vs.  ioremap (various good reasons for that) so in order
to be able to implement the new caller tracking in /proc/vmallocinfo, I
need a "_caller" variant of it.

(akpm: needed for ongoing powerpc development, so merge it early)

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-18 15:37:53 -08:00
Jens Axboe
93dbb39350 block: fix bad definition of BIO_RW_SYNC
We can't OR shift values, so get rid of BIO_RW_SYNC and use BIO_RW_SYNCIO
and BIO_RW_UNPLUG explicitly. This brings back the behaviour from before
213d9417fe.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-02-18 10:32:00 +01:00
Linus Torvalds
35010334aa Merge branch 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
  x86, vm86: fix preemption bug
  x86, olpc: fix model detection without OFW
  x86, hpet: fix for LS21 + HPET = boot hang
  x86: CPA avoid repeated lazy mmu flush
  x86: warn if arch_flush_lazy_mmu_cpu is called in preemptible context
  x86/paravirt: make arch_flush_lazy_mmu/cpu disable preemption
  x86, pat: fix warn_on_once() while mapping 0-1MB range with /dev/mem
  x86/cpa: make sure cpa is safe to call in lazy mmu mode
  x86, ptrace, mm: fix double-free on race
2009-02-17 14:27:39 -08:00
Linus Torvalds
071a0bc2ce Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  mm: Export symbol ksize()
2009-02-12 09:56:14 -08:00
Nick Piggin
3a4c6800f3 Fix page writeback thinko, causing Berkeley DB slowdown
A bug was introduced into write_cache_pages cyclic writeout by commit
31a12666d8 ("mm: write_cache_pages cyclic
fix").  The intention (and comments) is that we should cycle back and
look for more dirty pages at the beginning of the file if there is no
more work to be done.

But the !done condition was dropped from the test.  This means that any
time the page writeout loop breaks (eg.  due to nr_to_write == 0), we
will set index to 0, then goto again.  This will set done_index to
index, then find done is set, so will proceed to the end of the
function.  When updating mapping->writeback_index for cyclic writeout,
we now use done_index == 0, so we're always cycling back to 0.

This seemed to be causing random mmap writes (slapadd and iozone) to
start writing more pages from the LRU and writeout would slowdown, and
caused bugzilla entry

	http://bugzilla.kernel.org/show_bug.cgi?id=12604

about Berkeley DB slowing down dramatically.

With this patch, iozone random write performance is increased nearly
5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2).

Signed-off-by: Nick Piggin <npiggin@suse.de>
Reported-and-tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-12 08:10:53 -08:00
Kirill A. Shutemov
b1aabecd55 mm: Export symbol ksize()
Commit 7b2cd92adc ("crypto: api - Fix
zeroing on free") added modular user of ksize(). Export that to fix
crypto.ko compilation.

Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-02-12 17:50:46 +02:00
Jeremy Fitzhardinge
9480c53e9b mm: rearrange exit_mmap() to unlock before arch_exit_mmap
Christophe Saout reported [in precursor to:
http://marc.info/?l=linux-kernel&m=123209902707347&w=4]:

> Note that I also some a different issue with CONFIG_UNEVICTABLE_LRU.
> Seems like Xen tears down current->mm early on process termination, so
> that __get_user_pages in exit_mmap causes nasty messages when the
> process had any mlocked pages.  (in fact, it somehow manages to get into
> the swapping code and produces a null pointer dereference trying to get
> a swap token)

Jeremy explained:

Yes.  In the normal case under Xen, an in-use pagetable is "pinned",
meaning that it is RO to the kernel, and all updates must go via hypercall
(or writes are trapped and emulated, which is much the same thing).  An
unpinned pagetable is not currently in use by any process, and can be
directly accessed as normal RW pages.

As an optimisation at process exit time, we unpin the pagetable as early
as possible (switching the process to init_mm), so that all the normal
pagetable teardown can happen with direct memory accesses.

This happens in exit_mmap() -> arch_exit_mmap().  The munlocking happens
a few lines below.  The obvious thing to do would be to move
arch_exit_mmap() to below the munlock code, but I think we'd want to
call it even if mm->mmap is NULL, just to be on the safe side.

Thus, this patch:

exit_mmap() needs to unlock any locked vmas before calling arch_exit_mmap,
as the latter may switch the current mm to init_mm, which would cause the
former to fail.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christophe Saout <christophe@saout.de>
Cc: Keir Fraser <keir.fraser@eu.citrix.com>
Cc: Christophe Saout <christophe@saout.de>
Cc: Alex Williamson <alex.williamson@hp.com>
Cc: <stable@kernel.org>		[2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:37 -08:00
Federico Cuello
89e1219004 writeback: fix break condition
Commit dcf6a79dda ("write-back: fix
nr_to_write counter") fixed nr_to_write counter, but didn't set the break
condition properly.

If nr_to_write == 0 after being decremented it will loop one more time
before setting done = 1 and breaking the loop.

[akpm@linux-foundation.org: coding-style fixes]
Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:37 -08:00
KAMEZAWA Hiroyuki
2e9c237243 memcg: use __GFP_NOWARN in page cgroup allocation
page_cgroup's page allocation at init/memory hotplug uses kmalloc() and
vmalloc(). If kmalloc() failes, vmalloc() is used.

This is because vmalloc() is very limited resource on 32bit systems.
We want to use kmalloc() first.

But in this kind of call, __GFP_NOWARN should be specified.

Reported-by: Heiko Carstens <heiko.carstens@de.ibm.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:35 -08:00
MinChan Kim
508b9f8efd mm: fix mlocked page counter mismatch
When I tested following program, I found that the mlocked counter
is strange.  It cannot free some mlocked pages.

It is because try_to_unmap_file() doesn't check real
page mappings in vmas.

That is because the goal of an address_space for a file is to find all
processes into which the file's specific interval is mapped.  It is
related to the file's interval, not to pages.

Even if the page isn't really mapped by the vma, it returns SWAP_MLOCK
since the vma has VM_LOCKED, then calls try_to_mlock_page.  After this the
mlocked counter is increased again.

COWed anon page in a file-backed vma could be a such case.  This patch
resolves it.

-- my test program --

int main()
{
       mlockall(MCL_CURRENT);
       return 0;
}

-- before --

root@barrios-target-linux:~# cat /proc/meminfo | egrep 'Mlo|Unev'
Unevictable:           0 kB
Mlocked:               0 kB

-- after --

root@barrios-target-linux:~# cat /proc/meminfo | egrep 'Mlo|Unev'
Unevictable:           8 kB
Mlocked:               8 kB

Signed-off-by: MinChan Kim <minchan.kim@gmail.com>
Acked-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Tested-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:35 -08:00
Sven Wegener
fc3501d411 mm: fix dirty_bytes/dirty_background_bytes sysctls on 64bit arches
We need to pass an unsigned long as the minimum, because it gets casted
to an unsigned long in the sysctl handler. If we pass an int, we'll
access four more bytes on 64bit arches, resulting in a random minimum
value.

[rientjes@google.com: fix type of `old_bytes']
Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:35 -08:00
Daisuke Nishimura
1001c9fb87 migration: migrate_vmas should check "vma"
migrate_vmas() should check "vma" not "vma->vm_next" for for-loop condition.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 14:25:34 -08:00
Mel Gorman
17c9d12e12 Do not account for hugetlbfs quota at mmap() time if mapping [SHM|MAP]_NORESERVE
Commit 5a6fe12595 brought hugetlbfs more
in line with the core VM by obeying VM_NORESERVE and not reserving
hugepages for both shared and private mappings when [SHM|MAP]_NORESERVE
are specified.  However, it is still taking filesystem quota
unconditionally.

At fault time, if there are no reserves and attempt is made to allocate
the page and account for filesystem quota.  If either fail, the fault
fails.  The impact is that quota is getting accounted for twice.  This
patch partially reverts 5a6fe12595.  To
help prevent this mistake happening again, it improves the documentation
of hugetlb_reserve_pages()

Reported-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Andy Whitcroft <apw@canonical.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-11 12:38:09 -08:00
Markus Metzger
9f339e7028 x86, ptrace, mm: fix double-free on race
Ptrace_detach() races with __ptrace_unlink() if the traced task is
reaped while detaching. This might cause a double-free of the BTS
buffer.

Change the ptrace_detach() path to only do the memory accounting in
ptrace_bts_detach() and leave the buffer free to ptrace_bts_untrace()
which will be called from __ptrace_unlink().

The fix follows a proposal from Oleg Nesterov.

Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Markus Metzger <markus.t.metzger@intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-02-11 15:44:20 +01:00
Mel Gorman
5a6fe12595 Do not account for the address space used by hugetlbfs using VM_ACCOUNT
When overcommit is disabled, the core VM accounts for pages used by anonymous
shared, private mappings and special mappings. It keeps track of VMAs that
should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
with VM_NORESERVE.

Overcommit for hugetlbfs is much riskier than overcommit for base pages
due to contiguity requirements. It avoids overcommiting on both shared and
private mappings using reservation counters that are checked and updated
during mmap(). This ensures (within limits) that hugepages exist in the
future when faults occurs or it is too easy to applications to be SIGKILLed.

As hugetlbfs makes its own reservations of a different unit to the base page
size, VM_ACCOUNT should never be set. Even if the units were correct, we would
double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
be set because an application can request no reserves be made for hugetlbfs
at the risk of getting killed later.

With commit fc8744adc8, VM_NORESERVE and
VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
breaks the accounting for both the core VM and hugetlbfs, can trigger an
OOM storm when hugepage pools are too small lockups and corrupted counters
otherwise are used. This patch brings hugetlbfs more in line with how the
core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-10 10:48:42 -08:00
Hugh Dickins
d5b562330e mm: fix error case in mlock downgrade reversion
Commit 27421e211a, Manually revert
"mlock: downgrade mmap sem while populating mlocked regions", has
introduced its own regression: __mlock_vma_pages_range() may report
an error (for example, -EFAULT from trying to lock down pages from
beyond EOF), but mlock_vma_pages_range() must hide that from its
callers as before.

Reported-by: Sami Farin <safari-kernel@safari.iki.fi>
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-08 13:53:28 -08:00
Carsten Otte
ab92661d5d do_wp_page: fix regression with execute in place
Fix do_wp_page for VM_MIXEDMAP mappings.

In the case where pfn_valid returns 0 for a pfn at the beginning of
do_wp_page and the mapping is not shared writable, the code branches to
label `gotten:' with old_page == NULL.

In case the vma is locked (vma->vm_flags & VM_LOCKED), lock_page,
clear_page_mlock, and unlock_page try to access the old_page.

This patch checks whether old_page is valid before it is dereferenced.

The regression was introduced by "mlock: mlocked pages are unevictable"
(commit b291f00039).

Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: <stable@kernel.org>		[2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-05 12:56:48 -08:00
Artem Bityutskiy
dcf6a79dda write-back: fix nr_to_write counter
Commit 05fe478dd0 introduced some
@wbc->nr_to_write breakage.

It made the following changes:
 1. Decrement wbc->nr_to_write instead of nr_to_write
 2. Decrement wbc->nr_to_write _only_ if wbc->sync_mode == WB_SYNC_NONE
 3. If synced nr_to_write pages, stop only if if wbc->sync_mode ==
    WB_SYNC_NONE, otherwise keep going.

However, according to the commit message, the intention was to only make
change 3.  Change 1 is a bug.  Change 2 does not seem to be necessary,
and it breaks UBIFS expectations, so if needed, it should be done
separately later.  And change 2 does not seem to be documented in the
commit message.

This patch does the following:
 1. Undo changes 1 and 2
 2. Add a comment explaining change 3 (it very useful to have comments
    in _code_, not only in the commit).

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-03 16:59:08 -08:00
Linus Torvalds
859281ff37 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  slub: fix per cpu kmem_cache_cpu array memory leak
  kmalloc: return NULL instead of link failure
2009-02-02 19:27:00 -08:00
Linus Torvalds
27421e211a Manually revert "mlock: downgrade mmap sem while populating mlocked regions"
This essentially reverts commit 8edb08caf6.

It downgraded our mmap semaphore to a read-lock while mlocking pages, in
order to allow other threads (and external accesses like "ps" et al) to
walk the vma lists and take page faults etc.  Which is a nice idea, but
the implementation does not work.

Because we cannot upgrade the lock back to a write lock without
releasing the mmap semaphore, the code had to release the lock entirely
and then re-take it as a writelock.  However, that meant that the caller
possibly lost the vma chain that it was following, since now another
thread could come in and mmap/munmap the range.

The code tried to work around that by just looking up the vma again and
erroring out if that happened, but quite frankly, that was just a buggy
hack that doesn't actually protect against anything (the other thread
could just have replaced the vma with another one instead of totally
unmapping it).

The only way to downgrade to a read map _reliably_ is to do it at the
end, which is likely the right thing to do: do all the 'vma' operations
with the write-lock held, then downgrade to a read after completing them
all, and then do the "populate the newly mlocked regions" while holding
just the read lock.  And then just drop the read-lock and return to user
space.

The (perhaps somewhat simpler) alternative is to just make all the
callers of mlock_vma_pages_range() know that the mmap lock got dropped,
and just re-grab the mmap semaphore if it needs to mlock more than one
vma region.

So we can do this "downgrade mmap sem while populating mlocked regions"
thing right, but the way it was done here was absolutely not correct.
Thus the revert, in the expectation that we will do it all correctly
some day.

Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-01 11:00:16 -08:00
Linus Torvalds
fc8744adc8 Stop playing silly games with the VM_ACCOUNT flag
The mmap_region() code would temporarily set the VM_ACCOUNT flag for
anonymous shared mappings just to inform shmem_zero_setup() that it
should enable accounting for the resulting shm object.  It would then
clear the flag after calling ->mmap (for the /dev/zero case) or doing
shmem_zero_setup() (for the MAP_ANON case).

This just resulted in vma merge issues, but also made for just
unnecessary confusion.  Use the already-existing VM_NORESERVE flag for
this instead, and let shmem_{zero|file}_setup() just figure it out from
that.

This also happens to make it obvious that the new DRI2 GEM layer uses a
non-reserving backing store for its object allocation - which is quite
possibly not intentional.  But since I didn't want to change semantics
in this patch, I left it alone, and just updated the caller to use the
new flag semantics.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-31 15:08:56 -08:00
Linus Torvalds
33bfad54b5 Allow opportunistic merging of VM_CAN_NONLINEAR areas
Commit de33c8db59 ("Fix OOPS in
mmap_region() when merging adjacent VM_LOCKED file segments") unified
the vma merging of anonymous and file maps to just one place, which
simplified the code and fixed a use-after-free bug that could cause an
oops.

But by doing the merge opportunistically before even having called
->mmap() on the file method, it now compares two different 'vm_flags'
values: the pre-mmap() value of the new not-yet-formed vma, and previous
mappings of the same file around it.

And in doing so, it refused to merge the common file case, which adds a
marker to say "I can be made non-linear".

This fixes it by just adding a set of flags that don't have to match,
because we know they are ok to merge.  Currently it's only that single
VM_CAN_NONLINEAR flag, but at least conceptually there could be others
in the future.

Reported-and-acked-by: Hugh Dickins <hugh@veritas.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Greg KH <gregkh@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-30 11:37:22 -08:00
KAMEZAWA Hiroyuki
299b4eaa30 memcg: NULL pointer dereference at rmdir on some NUMA systems
N_POSSIBLE doesn't means there is memory...and force_empty can
visit invalid node which have no pgdat.

To visit all valid nodes, N_HIGH_MEMORY should be used.

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-29 18:04:44 -08:00
KAMEZAWA Hiroyuki
85d9fc89fb memcg: fix refcnt handling at swapoff
Now, at swapoff, even while try_charge() fails, commit is executed.  This
is a bug which turns the refcnt of cgroup_subsys_state negative.

Reported-by: Li Zefan <lizf@cn.fujitsu.com>
Tested-by: Li Zefan <lizf@cn.fujitsu.com>
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-29 18:04:43 -08:00
Daisuke Nishimura
7bcc1bb123 memcg: get/put parents at create/free
The lifetime of struct cgroup and struct mem_cgroup is different and
mem_cgroup has its own reference count for handling references from
swap_cgroup.

This causes strange problem that the parent mem_cgroup dies while child
mem_cgroup alive, and this problem causes a bug in case of
use_hierarchy==1 because res_counter_uncharge climbs up the tree.

This patch is for avoiding it by getting the parent at create, and putting
it at freeing.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by; KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Paul Menage <menage@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-29 18:04:43 -08:00
Linus Torvalds
de33c8db59 Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segments
As of commit ba470de431 ("map: handle
mlocked pages during map, remap, unmap") we now use the 'vma' variable
at the end of mmap_region() to handle the page-in of newly mapped
mlocked pages.

However, if we merged adjacent vma's together, the vma we're using may
be stale.  We historically consciously avoided using it after the merge
operation, but that got overlooked when redoing the locked page
handling.

This commit simplifies mmap_region() by doing any vma merges early,
avoiding the issue entirely, and 'vma' will always be valid.  As pointed
out by Hugh Dickins, this depends on any drivers that change the page
offset of flags to have set one of the VM_SPECIAL bits (so that they
cannot trigger the early merge logic), but that's true in general.

Reported-and-tested-by: Maksim Yevmenkin <maksim.yevmenkin@gmail.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-29 17:46:42 -08:00
David Rientjes
3718909448 slub: fix per cpu kmem_cache_cpu array memory leak
The per cpu array of kmem_cache_cpu structures accomodates
NR_KMEM_CACHE_CPU such structs.

When this array overflows and a struct is allocated by kmalloc(), it may
have an address at the upper bound of this array.  If this happens, it
does not get freed and the per cpu kmem_cache_cpu_free pointer will be out
of bounds after kmem_cache_destroy() or cpu offlining.

Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-01-28 10:43:42 +02:00