Commit graph

2903 commits

Author SHA1 Message Date
Cyrill Gorcunov
ef161a9863 mm: mminit_validate_memmodel_limits(): remove redundant test
In case if start_pfn overlap the upper bound no need to test end_pfn again
since we have it already trimmed.

Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-01 08:59:11 -07:00
Linus Torvalds
d17abcd541 Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask
* git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask:
  oprofile: Thou shalt not call __exit functions from __init functions
  cpumask: remove the now-obsoleted pcibus_to_cpumask(): generic
  cpumask: remove cpumask_t from core
  cpumask: convert rcutorture.c
  cpumask: use new cpumask_ functions in core code.
  cpumask: remove references to struct irqaction's mask field.
  cpumask: use mm_cpumask() wrapper: kernel/fork.c
  cpumask: use set_cpu_active in init/main.c
  cpumask: remove node_to_first_cpu
  cpumask: fix seq_bitmap_*() functions.
  cpumask: remove dangerous CPU_MASK_ALL_PTR, &CPU_MASK_ALL
2009-03-30 18:00:26 -07:00
Linus Torvalds
c4e1aa67ed Merge branch 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (33 commits)
  lockdep: fix deadlock in lockdep_trace_alloc
  lockdep: annotate reclaim context (__GFP_NOFS), fix SLOB
  lockdep: annotate reclaim context (__GFP_NOFS), fix
  lockdep: build fix for !PROVE_LOCKING
  lockstat: warn about disabled lock debugging
  lockdep: use stringify.h
  lockdep: simplify check_prev_add_irq()
  lockdep: get_user_chars() redo
  lockdep: simplify get_user_chars()
  lockdep: add comments to mark_lock_irq()
  lockdep: remove macro usage from mark_held_locks()
  lockdep: fully reduce mark_lock_irq()
  lockdep: merge the !_READ mark_lock_irq() helpers
  lockdep: merge the _READ mark_lock_irq() helpers
  lockdep: simplify mark_lock_irq() helpers #3
  lockdep: further simplify mark_lock_irq() helpers
  lockdep: simplify the mark_lock_irq() helpers
  lockdep: split up mark_lock_irq()
  lockdep: generate usage strings
  lockdep: generate the state bit definitions
  ...
2009-03-30 17:17:35 -07:00
Ingo Molnar
19cefdffbf lockdep: annotate reclaim context (__GFP_NOFS), fix SLOB
Impact: build fix

fix typo in mm/slob.c:

 mm/slob.c:469: error: ‘flags’ undeclared (first use in this function)
 mm/slob.c:469: error: (Each undeclared identifier is reported only once
 mm/slob.c:469: error: for each function it appears in.)

Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
LKML-Reference: <20090128135457.350751756@chello.nl>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-30 22:59:44 +02:00
Linus Torvalds
019abbc870 Merge branch 'x86-stage-3-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'x86-stage-3-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (190 commits)
  Revert "cpuacct: reduce one NULL check in fast-path"
  Revert "x86: don't compile vsmp_64 for 32bit"
  x86: Correct behaviour of irq affinity
  x86: early_ioremap_init(), use __fix_to_virt(), because we are sure it's safe
  x86: use default_cpu_mask_to_apicid for 64bit
  x86: fix set_extra_move_desc calling
  x86, PAT, PCI: Change vma prot in pci_mmap to reflect inherited prot
  x86/dmi: fix dmi_alloc() section mismatches
  x86: e820 fix various signedness issues in setup.c and e820.c
  x86: apic/io_apic.c define msi_ir_chip and ir_ioapic_chip all the time
  x86: irq.c keep CONFIG_X86_LOCAL_APIC interrupts together
  x86: irq.c use same path for show_interrupts
  x86: cpu/cpu.h cleanup
  x86: Fix a couple of sparse warnings in arch/x86/kernel/apic/io_apic.c
  Revert "x86: create a non-zero sized bm_pte only when needed"
  x86: pci-nommu.c cleanup
  x86: io_delay.c cleanup
  x86: rtc.c cleanup
  x86: i8253 cleanup
  x86: kdebugfs.c cleanup
  ...
2009-03-30 11:38:31 -07:00
Rusty Russell
aa85ea5b89 cpumask: use new cpumask_ functions in core code.
Impact: cleanup

Time to clean up remaining laggards using the old cpu_ functions.

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Cc: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Trond.Myklebust@netapp.com
2009-03-30 22:05:16 +10:30
Rusty Russell
1a2142afa5 cpumask: remove dangerous CPU_MASK_ALL_PTR, &CPU_MASK_ALL
Impact: cleanup

(Thanks to Al Viro for reminding me of this, via Ingo)

CPU_MASK_ALL is the (deprecated) "all bits set" cpumask, defined as so:

	#define CPU_MASK_ALL (cpumask_t) { { ... } }

Taking the address of such a temporary is questionable at best,
unfortunately 321a8e9d (cpumask: add CPU_MASK_ALL_PTR macro) added
CPU_MASK_ALL_PTR:

	#define CPU_MASK_ALL_PTR (&CPU_MASK_ALL)

Which formalizes this practice.  One day gcc could bite us over this
usage (though we seem to have gotten away with it so far).

So replace everywhere which used &CPU_MASK_ALL or CPU_MASK_ALL_PTR
with the modern "cpu_all_mask" (a real const struct cpumask *).

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
Acked-by: Ingo Molnar <mingo@elte.hu>
Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Mike Travis <travis@sgi.com>
2009-03-30 22:05:11 +10:30
Ingo Molnar
3fab191002 Merge branch 'linus' into x86/core 2009-03-28 22:27:45 +01:00
Linus Torvalds
0fe41b8982 Merge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm
* 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm: (422 commits)
  [ARM] 5435/1: fix compile warning in sanity_check_meminfo()
  [ARM] 5434/1: ARM: OMAP: Fix mailbox compile for 24xx
  [ARM] pxa: fix the bad assumption that PCMCIA sockets always start with 0
  [ARM] pxa: fix Colibri PXA300 and PXA320 LCD backlight pins
  imxfb: Fix TFT mode
  i.MX21/27: remove ifdef CONFIG_FB_IMX
  imxfb: add clock support
  mxc: add arch_reset() function
  clkdev: add possibility to get a clock based on the device name
  i.MX1: remove fb support from mach-imx
  [ARM] pxa: build arch/arm/plat-pxa/mfp.c only when PXA3xx or ARCH_MMP defined
  Gemini: Add support for Teltonika RUT100
  Gemini: gpiolib based GPIO support v2
  MAINTAINERS: add myself as Gemini architecture maintainer
  ARM: Add Gemini architecture v3
  [ARM] OMAP: Fix compile for omap2_init_common_hw()
  MAINTAINERS: Add myself as Faraday ARM core variant maintainer
  ARM: Add support for FA526 v2
  [ARM] acorn,ebsa110,footbridge,integrator,sa1100: Convert asm/io.h to linux/io.h
  [ARM] collie: fix two minor formatting nits
  ...
2009-03-28 14:03:14 -07:00
Russell King
ed40d0c472 Merge branch 'origin' into devel
Conflicts:
	sound/soc/pxa/pxa2xx-i2s.c
2009-03-28 20:29:51 +00:00
Ingo Molnar
6e15cf0486 Merge branch 'core/percpu' into percpu-cpumask-x86-for-linus-2
Conflicts:
	arch/parisc/kernel/irq.c
	arch/x86/include/asm/fixmap_64.h
	arch/x86/include/asm/setup.h
	kernel/irq/handle.c

Semantic merge:
        arch/x86/include/asm/fixmap.h

Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-27 17:28:43 +01:00
Linus Torvalds
be0ea69674 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
  slob: fix lockup in slob_free()
  slub: use get_track()
  slub: rename calculate_min_partial() to set_min_partial()
  slub: add min_partial sysfs tunable
  slub: move min_partial to struct kmem_cache
  SLUB: Fix default slab order for big object sizes
  SLUB: Do not pass 8k objects through to the page allocator
  SLUB: Introduce and use SLUB_MAX_SIZE and SLUB_PAGE_SHIFT constants
  slob: clean up the code
  SLUB: Use ->objsize from struct kmem_cache_cpu in slab_free()
2009-03-26 16:18:17 -07:00
Linus Torvalds
86d9c07017 Merge branch 'for-2.6.30' of git://git.kernel.dk/linux-2.6-block
* 'for-2.6.30' of git://git.kernel.dk/linux-2.6-block:
  Get rid of pdflush_operation() in emergency sync and remount
  btrfs: get rid of current_is_pdflush() in btrfs_btree_balance_dirty
  Move the default_backing_dev_info out of readahead.c and into backing-dev.c
  block: Repeated lines in switching-sched.txt
  bsg: Remove bogus check against request_queue->max_sectors
  block: WARN in __blk_put_request() for potential bio leak
  loop: fix circular locking in loop_clr_fd()
  loop: support barrier writes
  bsg: add support for tail queuing
  cpqarray: enable bus mastering
  block: genhd.h cleanup patch
  block: add private bio_set for bio integrity allocations
  block: genhd.h comment needs updating
  block: get rid of unused blkdev_free_rq() define
  block: remove various blk_queue_*() setting functions in blk_init_queue_node()
  cciss: add BUILD_BUG_ON() for catching bad CommandList_struct alignment
  block: don't create bio_vec slabs of less than the inline number
  block: cleanup bio_alloc_bioset()
2009-03-26 16:03:04 -07:00
Linus Torvalds
8d80ce80e1 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (71 commits)
  SELinux: inode_doinit_with_dentry drop no dentry printk
  SELinux: new permission between tty audit and audit socket
  SELinux: open perm for sock files
  smack: fixes for unlabeled host support
  keys: make procfiles per-user-namespace
  keys: skip keys from another user namespace
  keys: consider user namespace in key_permission
  keys: distinguish per-uid keys in different namespaces
  integrity: ima iint radix_tree_lookup locking fix
  TOMOYO: Do not call tomoyo_realpath_init unless registered.
  integrity: ima scatterlist bug fix
  smack: fix lots of kernel-doc notation
  TOMOYO: Don't create securityfs entries unless registered.
  TOMOYO: Fix exception policy read failure.
  SELinux: convert the avc cache hash list to an hlist
  SELinux: code readability with avc_cache
  SELinux: remove unused av.decided field
  SELinux: more careful use of avd in avc_has_perm_noaudit
  SELinux: remove the unused ae.used
  SELinux: check seqno when updating an avc_node
  ...
2009-03-26 11:03:39 -07:00
Wu Fengguang
1b5e62b42b writeback: double the dirty thresholds
Enlarge default dirty ratios from 5/10 to 10/20.  This fixes [Bug
#12809] iozone regression with 2.6.29-rc6.

The iozone benchmarks are performed on a 1200M file, with 8GB ram.

  iozone -i 0 -i 1 -i 2 -i 3 -i 4 -r 4k -s 64k -s 512m -s 1200m -b tmp.xls
  iozone -B -r 4k -s 64k -s 512m -s 1200m -b tmp.xls

The performance regression is triggered by commit 1cf6e7d83bf3(mm: task
dirty accounting fix), which makes more correct/thorough dirty
accounting.

The default 5/10 dirty ratios were picked (a) with the old dirty logic
and (b) largely at random and (c) designed to be aggressive.  In
particular, that (a) means that having fixed some of the dirty
accounting, maybe the real bug is now that it was always too aggressive,
just hidden by an accounting issue.

The enlarged 10/20 dirty ratios are just about enough to fix the regression.

[ We will have to look at how this affects the old fsync() latency issue,
  but that probably will need independent work.  - Linus ]

Cc: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reported-by: "Lin, Ming M" <ming.m.lin@intel.com>
Tested-by: "Lin, Ming M" <ming.m.lin@intel.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-26 11:01:11 -07:00
Jens Axboe
26160158d3 Move the default_backing_dev_info out of readahead.c and into backing-dev.c
It really makes no sense to have it in readahead.c, so move it where
it belongs.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-03-26 11:01:33 +01:00
Russell King
8937b7349c Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tmlind/linux-omap-2.6 into devel 2009-03-25 18:31:35 +00:00
Pekka Enberg
15a5b0a491 Merge branches 'topic/slob/cleanups', 'topic/slob/fixes', 'topic/slub/core', 'topic/slub/cleanups' and 'topic/slub/perf' into for-linus 2009-03-24 10:25:21 +02:00
James Morris
703a3cd728 Merge branch 'master' into next 2009-03-24 10:52:46 +11:00
Nick Piggin
6fb8f42439 slob: fix lockup in slob_free()
Don't hold SLOB lock when freeing the page. Reduces lock hold width. See
the following thread for discussion of the bug:

  http://marc.info/?l=linux-kernel&m=123709983214143&w=2

Reported-by: Ingo Molnar <mingo@elte.hu>
Acked-by: Matt Mackall <mpm@selenic.com>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-03-23 10:40:45 +02:00
Akinobu Mita
1a00df4a2c slub: use get_track()
Use get_track() in set_track()

Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2009-03-23 09:43:52 +02:00
Ingo Molnar
705bb9dc72 Merge branches 'x86/cleanups', 'x86/cpu', 'x86/debug', 'x86/mce2', 'x86/mm', 'x86/mtrr', 'x86/setup', 'x86/setup-memory', 'x86/urgent', 'x86/uv', 'x86/x2apic' and 'linus' into x86/core
Conflicts:
	arch/parisc/kernel/irq.c
2009-03-18 13:19:49 +01:00
Nicolas Pitre
3297e76077 highmem: atomic highmem kmap page pinning
Most ARM machines have a non IO coherent cache, meaning that the
dma_map_*() set of functions must clean and/or invalidate the affected
memory manually before DMA occurs.  And because the majority of those
machines have a VIVT cache, the cache maintenance operations must be
performed using virtual
addresses.

When a highmem page is kunmap'd, its mapping (and cache) remains in place
in case it is kmap'd again. However if dma_map_page() is then called with
such a page, some cache maintenance on the remaining mapping must be
performed. In that case, page_address(page) is non null and we can use
that to synchronize the cache.

It is unlikely but still possible for kmap() to race and recycle the
virtual address obtained above, and use it for another page before some
on-going cache invalidation loop in dma_map_page() is done. In that case,
the new mapping could end up with dirty cache lines for another page,
and the unsuspecting cache invalidation loop in dma_map_page() might
simply discard those dirty cache lines resulting in data loss.

For example, let's consider this sequence of events:

	- dma_map_page(..., DMA_FROM_DEVICE) is called on a highmem page.

	-->	- vaddr = page_address(page) is non null. In this case
		it is likely that the page has valid cache lines
		associated with vaddr. Remember that the cache is VIVT.

		-->	for (i = vaddr; i < vaddr + PAGE_SIZE; i += 32)
				invalidate_cache_line(i);

	*** preemption occurs in the middle of the loop above ***

	- kmap_high() is called for a different page.

	-->	- last_pkmap_nr wraps to zero and flush_all_zero_pkmaps()
		  is called.  The pkmap_count value for the page passed
		  to dma_map_page() above happens to be 1, so the page
		  is unmapped.  But prior to that, flush_cache_kmaps()
		  cleared the cache for it.  So far so good.

		- A fresh pkmap entry is assigned for this kmap request.
		  The Murphy law says this pkmap entry will eventually
		  happen to use the same vaddr as the one which used to
		  belong to the other page being processed by
		  dma_map_page() in the preempted thread above.

	- The kmap_high() caller start dirtying the cache using the
	  just assigned virtual mapping for its page.

	*** the first thread is rescheduled ***

			- The for(...) loop is resumed, but now cached
			  data belonging to a different physical page is
			  being discarded !

And this is not only a preemption issue as ARM can be SMP as well,
making the above scenario just as likely. Hence the need for some kind
of pkmap page pinning which can be used in any context, primarily for
the benefit of dma_map_page() on ARM.

This provides the necessary interface to cope with the above issue if
ARCH_NEEDS_KMAP_HIGH_GET is defined, otherwise the resulting code is
unchanged.

Signed-off-by: Nicolas Pitre <nico@marvell.com>
Reviewed-by: MinChan Kim <minchan.kim@gmail.com>
Acked-by: Andrew Morton <akpm@linux-foundation.org>
2009-03-15 21:01:21 -04:00
Daisuke Nishimura
1d885526f2 vmscan: pgmoved should be cleared after updating recent_rotated
pgmoved should be cleared after updating recent_rotated.

Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-14 11:57:22 -07:00
Ingo Molnar
0ca0f16fd1 Merge branches 'x86/apic', 'x86/asm', 'x86/cleanups', 'x86/debug', 'x86/kconfig', 'x86/mm', 'x86/ptrace', 'x86/setup' and 'x86/urgent'; commit 'v2.6.29-rc8' into x86/core 2009-03-14 16:25:40 +01:00
Pallipadi, Venkatesh
895791dac6 VM, x86, PAT: add a new vm flag to track full pfnmap at mmap
Impact: cleanup

Add a new vm flag VM_PFN_AT_MMAP to identify a PFNMAP that is
fully mapped with remap_pfn_range. Patch removes the overloading
of VM_INSERTPAGE from the earlier patch.

Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
Acked-by: Nick Piggin <npiggin@suse.de>
LKML-Reference: <20090313233543.GA19909@linux-os.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-14 09:47:44 +01:00
Ingo Molnar
0634023562 Merge branch 'x86/core' into x86/kconfig 2009-03-13 17:08:30 +01:00
Pallipadi, Venkatesh
4bb9c5c021 VM, x86, PAT: Change is_linear_pfn_mapping to not use vm_pgoff
Impact: fix false positive PAT warnings - also fix VirtalBox hang

Use of vma->vm_pgoff to identify the pfnmaps that are fully
mapped at mmap time is broken. vm_pgoff is set by generic mmap
code even for cases where drivers are setting up the mappings
at the fault time.

The problem was originally reported here:

 http://marc.info/?l=linux-kernel&m=123383810628583&w=2

Change is_linear_pfn_mapping logic to overload VM_INSERTPAGE
flag along with VM_PFNMAP to mean full PFNMAP setup at mmap
time.

Problem also tracked at:

 http://bugzilla.kernel.org/show_bug.cgi?id=12800

Reported-by: Thomas Hellstrom <thellstrom@vmware.com>
Tested-by: Frans Pop <elendil@planet.nl>
Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Signed-off-by: Suresh Siddha <suresh.b.siddha>@intel.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: "ebiederm@xmission.com" <ebiederm@xmission.com>
Cc: <stable@kernel.org> # only for 2.6.29.1, not .28
LKML-Reference: <20090313004527.GA7176@linux-os.sc.intel.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-13 04:28:50 +01:00
KOSAKI Motohiro
f272b7bc44 memcg: use correct scan number at reclaim
Even when page reclaim is under mem_cgroup, # of scan page is determined by
status of global LRU. Fix that.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-03-12 16:20:24 -07:00
Tejun Heo
60db564220 percpu: fix spurious alignment WARN in legacy SMP percpu allocator
Impact: remove spurious WARN on legacy SMP percpu allocator

Commit f2a8205c4e incorrectly added too
tight WARN_ON_ONCE() on alignments for UP and legacy SMP percpu
allocator.  Commit e317603694 fixed it
for UP but legacy SMP allocator was forgotten.  Fix it.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Sachin P. Sant <sachinp@in.ibm.com>
2009-03-11 14:36:54 +09:00
Tejun Heo
66c3a75772 percpu: generalize embedding first chunk setup helper
Impact: code reorganization

Separate out embedding first chunk setup helper from x86 embedding
first chunk allocator and put it in mm/percpu.c.  This will be used by
the default percpu first chunk allocator and possibly by other archs.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-10 16:27:48 +09:00
Tejun Heo
6074d5b0a3 percpu: more flexibility for @dyn_size of pcpu_setup_first_chunk()
Impact: cleanup, more flexibility for first chunk init

Non-negative @dyn_size used to be allowed iff @unit_size wasn't auto.
This restriction stemmed from implementation detail and made things a
bit less intuitive.  This patch allows @dyn_size to be specified
regardless of @unit_size and swaps the positions of @dyn_size and
@unit_size so that the parameter order makes more sense (static,
reserved and dyn sizes followed by enclosing unit_size).

While at it, add @unit_size >= PCPU_MIN_UNIT_SIZE sanity check.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-10 16:27:48 +09:00
Tejun Heo
e01009833e percpu: make x86 addr <-> pcpu ptr conversion macros generic
Impact: generic addr <-> pcpu ptr conversion macros

There's nothing arch specific about x86 __addr_to_pcpu_ptr() and
__pcpu_ptr_to_addr().  With proper __per_cpu_load and __per_cpu_start
defined, they'll do the right thing regardless of actual layout.

Move these macros from arch/x86/include/asm/percpu.h to mm/percpu.c
and allow archs to override it as necessary.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-10 16:27:48 +09:00
Tejun Heo
ccea34b5d0 percpu: finer grained locking to break deadlock and allow atomic free
Impact: fix deadlock and allow atomic free

Percpu allocation always uses GFP_KERNEL and whole alloc/free paths
were protected by single mutex.  All percpu allocations have been from
GFP_KERNEL-safe context and the original allocator had this assumption
too.  However, by protecting both alloc and free paths with the same
mutex, the new allocator creates free -> alloc -> GFP_KERNEL
dependency which the original allocator didn't have.  This can lead to
deadlock if free is called from FS or IO paths.  Also, in general,
allocators are expected to allow free to be called from atomic
context.

This patch implements finer grained locking to break the deadlock and
allow atomic free.  For details, please read the "Synchronization
rules" comment.

While at it, also add CONTEXT: to function comments to describe which
context they expect to be called from and what they do to it.

This problem was reported by Thomas Gleixner and Peter Zijlstra.

  http://thread.gmane.org/gmane.linux.kernel/802384

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Thomas Gleixner <tglx@linutronix.de>
Reported-by: Peter Zijlstra <peterz@infradead.org>
2009-03-07 14:46:35 +09:00
Tejun Heo
a56dbddf06 percpu: move fully free chunk reclamation into a work
Impact: code reorganization for later changes

Do fully free chunk reclamation using a work.  This change is to
prepare for locking changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-07 00:44:11 +09:00
Tejun Heo
9f7dcf224b percpu: move chunk area map extension out of area allocation
Impact: code reorganization for later changes

Separate out chunk area map extension into a separate function -
pcpu_extend_area_map() - and call it directly from pcpu_alloc() such
that pcpu_alloc_area() is guaranteed to have enough area map slots on
invocation.

With this change, pcpu_alloc_area() does only area allocation and the
only failure mode is when the chunk doens't have enough room, so
there's no need to distinguish it from memory allocation failures.
Make it return -1 on such cases instead of hacky -ENOSPC.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-07 00:44:09 +09:00
Tejun Heo
1880d93b80 percpu: replace pcpu_realloc() with pcpu_mem_alloc() and pcpu_mem_free()
Impact: code reorganization for later changes

With static map handling moved to pcpu_split_block(), pcpu_realloc()
only clutters the code and it's also unsuitable for scheduled locking
changes.  Implement and use pcpu_mem_alloc/free() instead.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-07 00:44:09 +09:00
Tejun Heo
edcb463997 percpu, module: implement reserved allocation and use it for module percpu variables
Impact: add reserved allocation functionality and use it for module
	percpu variables

This patch implements reserved allocation from the first chunk.  When
setting up the first chunk, arch can ask to set aside certain number
of bytes right after the core static area which is available only
through a separate reserved allocator.  This will be used primarily
for module static percpu variables on architectures with limited
relocation range to ensure that the module perpcu symbols are inside
the relocatable range.

If reserved area is requested, the first chunk becomes reserved and
isn't available for regular allocation.  If the first chunk also
includes piggy-back dynamic allocation area, a separate chunk mapping
the same region is created to serve dynamic allocation.  The first one
is called static first chunk and the second dynamic first chunk.
Although they share the page map, their different area map
initializations guarantee they serve disjoint areas according to their
purposes.

If arch doesn't setup reserved area, reserved allocation is handled
like any other allocation.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-06 14:33:59 +09:00
Tejun Heo
3e24aa5890 percpu: add an indirection ptr for chunk page map access
Impact: allow sharing page map, no functional difference yet

Make chunk->page access indirect by adding a pointer and renaming the
actual array to page_ar.  This will be used by future changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-06 14:33:59 +09:00
Tejun Heo
cafe8816b2 percpu: use negative for auto for pcpu_setup_first_chunk() arguments
Impact: argument semantic cleanup

In pcpu_setup_first_chunk(), zero @unit_size and @dyn_size meant
auto-sizing.  It's okay for @unit_size as 0 doesn't make sense but 0
dynamic reserve size is valid.  Alos, if arch @dyn_size is calculated
from other parameters, it might end up passing in 0 @dyn_size and
malfunction when the size is automatically adjusted.

This patch makes both @unit_size and @dyn_size ssize_t and use -1 for
auto sizing.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-06 14:33:59 +09:00
Tejun Heo
61ace7fa2f percpu: improve first chunk initial area map handling
Impact: no functional change

When the first chunk is created, its initial area map is not allocated
because kmalloc isn't online yet.  The map is allocated and
initialized on the first allocation request on the chunk.  This works
fine but the scattering of initialization logic between the init
function and allocation path is a bit confusing.

This patch makes the first chunk initialize and use minimal statically
allocated map from pcpu_setpu_first_chunk().  The map resizing path
still needs to handle this specially but it's more straight-forward
and gives more latitude to the init path.  This will ease future
changes.

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-06 14:33:59 +09:00
Tejun Heo
2441d15c97 percpu: cosmetic renames in pcpu_setup_first_chunk()
Impact: cosmetic, preparation for future changes

Make the following renames in pcpur_setup_first_chunk() in preparation
for future changes.

* s/free_size/dyn_size/
* s/static_vm/first_vm/
* s/static_chunk/schunk/

Signed-off-by: Tejun Heo <tj@kernel.org>
2009-03-06 14:33:59 +09:00
Ingo Molnar
91d75e209b Merge branch 'x86/core' into core/percpu 2009-03-04 02:29:19 +01:00
Ingo Molnar
8b0e5860cb Merge branches 'x86/apic', 'x86/cpu', 'x86/fixmap', 'x86/mm', 'x86/sched', 'x86/setup-lzma', 'x86/signal' and 'x86/urgent' into x86/core 2009-03-04 02:22:31 +01:00
Ingo Molnar
f180053694 x86, mm: dont use non-temporal stores in pagecache accesses
Impact: standardize IO on cached ops

On modern CPUs it is almost always a bad idea to use non-temporal stores,
as the regression in this commit has shown it:

  30d697f: x86: fix performance regression in write() syscall

The kernel simply has no good information about whether using non-temporal
stores is a good idea or not - and trying to add heuristics only increases
complexity and inserts fragility.

The regression on cached write()s took very long to be found - over two
years. So dont take any chances and let the hardware decide how it makes
use of its caches.

The only exception is drivers/gpu/drm/i915/i915_gem.c: there were we are
absolutely sure that another entity (the GPU) will pick up the dirty
data immediately and that the CPU will not touch that data before the
GPU will.

Also, keep the _nocache() primitives to make it easier for people to
experiment with these details. There may be more clear-cut cases where
non-cached copies can be used, outside of filemap.c.

Cc: Salman Qazi <sqazi@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-03-02 11:06:49 +01:00
Ingo Molnar
55f2b78995 Merge branch 'x86/urgent' into x86/pat 2009-03-01 12:47:58 +01:00
Tejun Heo
d0c4f57027 bootmem, x86: further fixes for arch-specific bootmem wrapping
Impact: fix new breakages introduced by previous fix

Commit c132937556 tried to clean up
bootmem arch wrapper but it wasn't quite correct.  Before the commit,
the followings were broken.

* Low level interface functions prefixed with __ ignored arch
  preference.

* reserve_bootmem(...) can't be mapped into
  reserve_bootmem_node(NODE_DATA(0)->bdata, ...) because the node is
  not preference here.  The region specified MUST fall into the
  specified region; otherwise, it will panic.

After the commit,

* If allocation fails for the arch preferred node, it should fallback
  to whatever is available.  Instead, it simply failed allocation.

There are too many internal details to allow generic wrapping and
still keep things simple for archs.  Plus, all that arch wants is a
way to prefer certain node over another.

This patch drops the generic wrapping around alloc_bootmem_core() and
add alloc_bootmem_core() instead.  If necessary, arch can define
bootmem_arch_referred_node() macro or function which takes all
allocation information and returns the preferred node.  bootmem
generic code will always try the preferred node first and then
fallback to other nodes as usual.

Breakages noted and changes reviewed by Johannes Weiner.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
2009-03-01 16:06:56 +09:00
Tejun Heo
02d51fdfb2 percpu: kill compile warning in pcpu_populate_chunk()
Impact: remove compile warning

Mark local variable map_end in pcpu_populate_chunk() with
uninitialized_var().  The variable is always used in tandem with
map_start and guaranteed to be initialized before use but gcc doesn't
understand that.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Ingo Molnar <mingo@elte.hu>
2009-03-01 15:42:36 +09:00
Vegard Nossum
cbb766766f mm: fix lazy vmap purging (use-after-free error)
I just got this new warning from kmemcheck:

    WARNING: kmemcheck: Caught 32-bit read from freed memory (c7806a60)
    a06a80c7ecde70c1a04080c700000000a06709c1000000000000000000000000
     f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f f
     ^

    Pid: 0, comm: swapper Not tainted (2.6.29-rc4 #230)
    EIP: 0060:[<c1096df7>] EFLAGS: 00000286 CPU: 0
    EIP is at __purge_vmap_area_lazy+0x117/0x140
    EAX: 00070f43 EBX: c7806a40 ECX: c1677080 EDX: 00027b66
    ESI: 00002001 EDI: c170df0c EBP: c170df00 ESP: c178830c
     DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068
    CR0: 80050033 CR2: c7806b14 CR3: 01775000 CR4: 00000690
    DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    DR6: 00004000 DR7: 00000000
     [<c1096f3e>] free_unmap_vmap_area_noflush+0x6e/0x70
     [<c1096f6a>] remove_vm_area+0x2a/0x70
     [<c1097025>] __vunmap+0x45/0xe0
     [<c10970de>] vunmap+0x1e/0x30
     [<c1008ba5>] text_poke+0x95/0x150
     [<c1008ca9>] alternatives_smp_unlock+0x49/0x60
     [<c171ef47>] alternative_instructions+0x11b/0x124
     [<c171f991>] check_bugs+0xbd/0xdc
     [<c17148c5>] start_kernel+0x2ed/0x360
     [<c171409e>] __init_begin+0x9e/0xa9
     [<ffffffff>] 0xffffffff

It happened here:

    $ addr2line -e vmlinux -i c1096df7
    mm/vmalloc.c:540

Code:

	list_for_each_entry(va, &valist, purge_list)
		__free_vmap_area(va);

It's this instruction:

    mov    0x20(%ebx),%edx

Which corresponds to a dereference of va->purge_list.next:

    (gdb) p ((struct vmap_area *) 0)->purge_list.next
    Cannot access memory at address 0x20

It seems that we should use "safe" list traversal here, as the element
is freed inside the loop. Please verify that this is the right fix.

Acked-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: <stable@kernel.org>		[2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-27 16:26:21 -08:00
Nick Piggin
7766970cc1 mm: vmap fix overflow
The new vmap allocator can wrap the address and get confused in the case
of large allocations or VMALLOC_END near the end of address space.

Problem reported by Christoph Hellwig on a 32-bit XFS workload.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Reported-by: Christoph Hellwig <hch@lst.de>
Cc: <stable@kernel.org>		[2.6.28.x]
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-27 16:26:21 -08:00