linux-hardened/fs
KAMEZAWA Hiroyuki 08e552c69c memcg: synchronized LRU
A big patch for changing memcg's LRU semantics.

Now,
  - page_cgroup is linked to mem_cgroup's its own LRU (per zone).

  - LRU of page_cgroup is not synchronous with global LRU.

  - page and page_cgroup is one-to-one and statically allocated.

  - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
    - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

  - SwapCache is handled.

And, when we handle LRU list of page_cgroup, we do following.

	pc = lookup_page_cgroup(page);
	lock_page_cgroup(pc); .....................(1)
	mz = page_cgroup_zoneinfo(pc);
	spin_lock(&mz->lru_lock);
	.....add to LRU
	spin_unlock(&mz->lru_lock);
	unlock_page_cgroup(pc);

But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as

        spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
	mem_cgroup_add/remove/etc_lru() {
		pc = lookup_page_cgroup(page);
		mz = page_cgroup_zoneinfo(pc);
		if (PageCgroupUsed(pc)) {
			....add to LRU
		}
        spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
    1. When pc->mem_cgroup can be modified.
       - at charge.
       - at account_move().
    2. at charge
       the PCG_USED bit is not set before pc->mem_cgroup is fixed.
    3. at account_move()
       the page is isolated and not on LRU.

Pros.
  - easy for maintenance.
  - memcg can make use of laziness of pagevec.
  - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
  - LRU status of memcg will be synchronized with global LRU's one.
  - # of locks are reduced.
  - account_move() is simplified very much.
Cons.
  - may increase cost of LRU rotation.
    (no impact if memcg is not configured.)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-08 08:31:05 -08:00
..
9p Merge branch 'next' into for-linus 2008-12-25 11:40:09 +11:00
adfs vfs: Use const for kernel parser table 2008-10-13 10:10:37 -07:00
affs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2009-01-05 18:32:06 -08:00
afs fs: symlink write_begin allocation context fix 2009-01-04 13:33:20 -08:00
autofs zero i_uid/i_gid on inode allocation 2009-01-05 11:54:28 -05:00
autofs4 autofs4: fix string validation check order 2009-01-06 15:59:23 -08:00
befs befs: ensure fast symlinks are NUL-terminated 2008-12-31 18:07:40 -05:00
bfs bfs: check that filesystem fits on the blockdevice 2009-01-06 15:59:31 -08:00
cifs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2009-01-05 18:32:06 -08:00
coda coda: fix fs/coda/sysctl.c build warnings when !CONFIG_SYSCTL 2009-01-08 08:31:01 -08:00
configfs zero i_uid/i_gid on inode allocation 2009-01-05 11:54:28 -05:00
cramfs zero i_uid/i_gid on inode allocation 2009-01-05 11:54:28 -05:00
debugfs debugfs: add helpers for exporting a size_t simple value 2009-01-07 10:00:16 -08:00
devpts zero i_uid/i_gid on inode allocation 2009-01-05 11:54:28 -05:00
dlm Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/dlm 2009-01-05 19:02:09 -08:00
ecryptfs fs/ecryptfs/inode.c: cleanup kerneldoc 2009-01-06 15:59:22 -08:00
efs [PATCH] switch all filesystems over to d_obtain_alias 2008-10-23 05:13:01 -04:00
exportfs Merge branch 'next' into for-linus 2008-12-25 11:40:09 +11:00
ext2 ext2: tighten restrictions on inode flags 2009-01-08 08:31:00 -08:00
ext3 ext3: tighten restrictions on inode flags 2009-01-08 08:31:01 -08:00
ext4 percpu_counter: FBC_BATCH should be a variable 2009-01-06 15:59:13 -08:00
fat Merge git://git.kernel.org/pub/scm/linux/kernel/git/hirofumi/fatfs-2.6 2008-12-30 20:33:34 -08:00
freevxfs freevxfs: ensure fast symlinks are NUL-terminated 2008-12-31 18:07:40 -05:00
fuse Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse 2009-01-06 17:01:20 -08:00
gfs2 GFS2: Fix typo in gfs_page_mkwrite() 2009-01-07 08:58:28 +00:00
hfs CRED: Wrap task credential accesses in the HFS filesystem 2008-11-14 10:38:54 +11:00
hfsplus CRED: Wrap task credential accesses in the HFSplus filesystem 2008-11-14 10:38:54 +11:00
hostfs fs: symlink write_begin allocation context fix 2009-01-04 13:33:20 -08:00
hpfs CRED: Wrap task credential accesses in the HPFS filesystem 2008-11-14 10:38:55 +11:00
hppfs CRED: Use creds in file structs 2008-11-14 10:39:25 +11:00
hugetlbfs hugetlb: unsigned ret cannot be negative 2009-01-06 15:59:08 -08:00
isofs isofs check for NULL ->i_op in root directory is dead code 2009-01-05 11:53:38 -05:00
jbd jbd: remove excess kernel-doc notation 2009-01-08 08:31:01 -08:00
jbd2 jbd2: Add buffer triggers 2009-01-05 08:40:30 -08:00
jffs2 fs: symlink write_begin allocation context fix 2009-01-04 13:33:20 -08:00
jfs fix the treatment of jfs special inodes 2009-01-05 11:54:29 -05:00
lockd NLM: Clean up flow of control in make_socks() function 2009-01-07 15:40:44 -05:00
minix minix: fix add link's wrong position calculation 2009-01-06 15:59:27 -08:00
ncpfs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2009-01-07 11:31:52 -08:00
nfs fs: symlink write_begin allocation context fix 2009-01-04 13:33:20 -08:00
nfs_common SUNRPC: nfsacl_encode/nfsacl_decode should be exported as GPL-only 2008-12-23 15:21:32 -05:00
nfsd nfsd: last_byte_offset 2009-01-07 17:38:31 -05:00
nls remove CONFIG_KMOD from fs 2008-10-17 02:38:36 +11:00
notify inotify: fix type errors in interfaces 2009-01-05 11:54:29 -05:00
ntfs ntfs: don't NULL i_op 2009-01-05 11:54:27 -05:00
ocfs2 trivial: fix then -> than typos in comments and documentation 2009-01-06 11:28:06 +01:00
omfs zero i_uid/i_gid on inode allocation 2009-01-05 11:54:28 -05:00
openpromfs zero i_uid/i_gid on inode allocation 2009-01-05 11:54:28 -05:00
partitions block: struct device - replace bus_id with dev_name(), dev_set_name() 2009-01-06 10:44:43 -08:00
proc Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/adobriyan/proc 2009-01-07 12:01:06 -08:00
qnx4 SL*B: drop kmem cache argument from constructor 2008-07-26 12:00:07 -07:00
ramfs zero i_uid/i_gid on inode allocation 2009-01-05 11:54:28 -05:00
reiserfs Merge branch 'upstream-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mfasheh/ocfs2 2009-01-05 18:32:43 -08:00
romfs zero i_uid/i_gid on inode allocation 2009-01-05 11:54:28 -05:00
smbfs fs: symlink write_begin allocation context fix 2009-01-04 13:33:20 -08:00
sysfs zero i_uid/i_gid on inode allocation 2009-01-05 11:54:28 -05:00
sysv sysv: ensure fast symlinks are NUL-terminated 2008-12-31 18:07:39 -05:00
ubifs Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial 2009-01-07 11:31:52 -08:00
udf Merge branch 'master' into next 2008-12-04 17:16:36 +11:00
ufs CRED: Wrap task credential accesses in the UFS filesystem 2008-11-14 10:39:04 +11:00
xfs trivial: fix then -> than typos in comments and documentation 2009-01-06 11:28:06 +01:00
aio.c aio: make the lookup_ioctx() lockless 2008-12-29 08:29:50 +01:00
anon_inodes.c anon_inodes: use fops->owner for module refcount 2008-12-31 16:55:44 +02:00
attr.c CRED: Wrap task credential accesses in the filesystem subsystem 2008-11-14 10:39:05 +11:00
bad_inode.c kill ->dir_notify() 2008-12-31 18:07:43 -05:00
binfmt_aout.c sanitize ifdefs in binfmt_aout 2009-01-03 11:45:54 -08:00
binfmt_elf.c Merge branch 'for-linus' of git://git390.osdl.marist.edu/pub/scm/linux-2.6 2008-12-28 12:33:21 -08:00
binfmt_elf_fdpic.c CRED: Make execve() take advantage of copy-on-write credentials 2008-11-14 10:39:24 +11:00
binfmt_em86.c Allow recursion in binfmt_script and binfmt_misc 2008-10-16 11:21:38 -07:00
binfmt_flat.c CRED: Make execve() take advantage of copy-on-write credentials 2008-11-14 10:39:24 +11:00
binfmt_misc.c fs/binfmt_misc.c: add terminating newline to /proc/sys/fs/binfmt_misc/status 2009-01-06 15:59:19 -08:00
binfmt_script.c Allow recursion in binfmt_script and binfmt_misc 2008-10-16 11:21:38 -07:00
binfmt_som.c CRED: Make execve() take advantage of copy-on-write credentials 2008-11-14 10:39:24 +11:00
bio-integrity.c bio: allow individual slabs in the bio_set 2008-12-29 08:29:23 +01:00
bio.c bio: get rid of bio_vec clearing 2008-12-29 08:29:53 +01:00
block_dev.c fs: fix function param name in kernel-doc 2009-01-06 15:59:14 -08:00
buffer.c block_write_begin(): remove useless goto 2009-01-06 15:59:08 -08:00
char_dev.c fs: fix name overwrite in __register_chrdev_region() 2009-01-06 15:59:13 -08:00
compat.c add missing accounting calls to compat_sys_{readv,writev} 2009-01-06 15:59:13 -08:00
compat_binfmt_elf.c
compat_ioctl.c remove unused #include <linux/dirent.h>'s 2008-07-25 10:53:34 -07:00
dcache.c filp_cachep can be static in fs/file_table.c 2008-12-31 18:07:42 -05:00
dcookies.c shrink struct dentry 2008-12-31 18:07:38 -05:00
direct-io.c fs: truncate blocks outside i_size after O_DIRECT write error 2009-01-06 15:59:06 -08:00
dquot.c quota: don't set grace time when user isn't above softlimit 2009-01-08 08:31:01 -08:00
drop_caches.c
eventfd.c flag parameters: check magic constants 2008-07-24 10:47:29 -07:00
eventpoll.c epoll: introduce resource usage limits 2008-12-01 19:55:24 -08:00
exec.c fs/exec.c: make do_coredump() void 2009-01-06 15:59:29 -08:00
fcntl.c Merge branch 'next' into for-linus 2008-12-25 11:40:09 +11:00
fifo.c [PATCH] introduce fmode_t, do annotations 2008-10-21 07:47:06 -04:00
file.c [PATCH] merge locate_fd() and get_unused_fd() 2008-08-01 11:25:23 -04:00
file_table.c filp_cachep can be static in fs/file_table.c 2008-12-31 18:07:42 -05:00
filesystems.c vfs: remove duplicate code in get_fs_type() 2009-01-05 11:54:29 -05:00
fs-writeback.c fs: sys_sync fix 2009-01-06 15:59:09 -08:00
generic_acl.c
inode.c async: make the final inode deletion an asynchronous event 2009-01-07 08:47:24 -08:00
internal.h CRED: Make execve() take advantage of copy-on-write credentials 2008-11-14 10:39:24 +11:00
ioctl.c GFS2: Support for FIEMAP ioctl 2009-01-05 07:38:46 +00:00
ioprio.c CRED: Use RCU to access another task's creds and to release a task's own creds 2008-11-14 10:39:19 +11:00
Kconfig fs: use menuconfig to control the Misc. filesystems menu 2009-01-06 15:59:12 -08:00
Kconfig.binfmt add CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS 2008-10-20 08:52:39 -07:00
libfs.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2009-01-05 18:32:06 -08:00
locks.c CRED: Wrap task credential accesses in the filesystem subsystem 2008-11-14 10:39:05 +11:00
Makefile quota: Split off quota tree handling into a separate file 2009-01-05 08:40:21 -08:00
mbcache.c
mpage.c do_mpage_readpage(): remove useless clear_buffer_mapped() call 2009-01-06 15:59:01 -08:00
namei.c Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 2009-01-05 18:32:06 -08:00
namespace.c fs/namespace.c: drop code after return 2008-12-31 18:07:38 -05:00
nfsctl.c pass a struct path * to may_open 2008-12-31 18:07:41 -05:00
no-block.c
open.c inode->i_op is never NULL 2009-01-05 11:54:28 -05:00
pipe.c sanitize audit_fd_pair() 2009-01-04 15:14:41 -05:00
pnode.c
pnode.h
posix_acl.c CRED: Wrap task credential accesses in the filesystem subsystem 2008-11-14 10:39:05 +11:00
quota.c quota: Introduce DQUOT_QUOTA_SYS_FILE flag 2009-01-05 08:36:57 -08:00
quota_tree.c quota: Split off quota tree handling into a separate file 2009-01-05 08:40:21 -08:00
quota_tree.h quota: Split off quota tree handling into a separate file 2009-01-05 08:40:21 -08:00
quota_v1.c quota: Move quotaio_v[12].h from include/linux/ to fs/ 2009-01-05 08:36:58 -08:00
quota_v2.c quota: Convert union in mem_dqinfo to a pointer 2009-01-05 08:40:21 -08:00
quotaio_v1.h quota: Move quotaio_v[12].h from include/linux/ to fs/ 2009-01-05 08:36:58 -08:00
quotaio_v2.h quota: Split off quota tree handling into a separate file 2009-01-05 08:40:21 -08:00
read_write.c vfs: lseek(fd, 0, SEEK_CUR) race condition 2009-01-05 11:53:07 -05:00
read_write.h
readdir.c [PATCH] prepare vfs_readdir() callers to returning filldir result 2008-10-23 05:13:10 -04:00
select.c poll: allow f_op->poll to sleep 2009-01-06 15:59:12 -08:00
seq_file.c Merge branch 'cpus4096-for-linus-3' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip 2009-01-03 12:04:39 -08:00
signalfd.c flag parameters: check magic constants 2008-07-24 10:47:29 -07:00
splice.c memcg: synchronized LRU 2009-01-08 08:31:05 -08:00
stack.c
stat.c inode->i_op is never NULL 2009-01-05 11:54:28 -05:00
super.c async: Don't call async_synchronize_full_special() while holding sb_lock 2009-01-08 08:15:39 -08:00
sync.c mm: do_sync_mapping_range integrity fix 2009-01-06 15:59:00 -08:00
timerfd.c hrtimer: convert timerfd to the new hrtimer apis 2008-09-05 21:35:09 -07:00
utimes.c [PATCH] sanitize __user_walk_fd() et.al. 2008-07-26 20:53:34 -04:00
xattr.c inode->i_op is never NULL 2009-01-05 11:54:28 -05:00
xattr_acl.c