cgroups and css_sets are mapped M:N and this M:N mapping is
represented by struct cg_cgroup_link which forms linked lists on both
sides. The naming around this mapping is already confusing and struct
cg_cgroup_link exacerbates the situation quite a bit.
>From cgroup side, it starts off ->css_sets and runs through
->cgrp_link_list. From css_set side, it starts off ->cg_links and
runs through ->cg_link_list. This is rather reversed as
cgrp_link_list is used to iterate css_sets and cg_link_list cgroups.
Also, this is the only place which is still using the confusing "cg"
for css_sets. This patch cleans it up a bit.
* s/cgroup->css_sets/cgroup->cset_links/
s/css_set->cg_links/css_set->cgrp_links/
s/cgroup_iter->cg_link/cgroup_iter->cset_link/
* s/cg_cgroup_link/cgrp_cset_link/
* s/cgrp_cset_link->cg/cgrp_cset_link->cset/
s/cgrp_cset_link->cgrp_link_list/cgrp_cset_link->cset_link/
s/cgrp_cset_link->cg_link_list/cgrp_cset_link->cgrp_link/
* s/init_css_set_link/init_cgrp_cset_link/
s/free_cg_links/free_cgrp_cset_links/
s/allocate_cg_links/allocate_cgrp_cset_links/
* s/cgl[12]/link[12]/ in compare_css_sets()
* s/saved_link/tmp_link/ s/tmp/tmp_links/ and a couple similar
adustments.
* Comment and whiteline adjustments.
After the changes, we have
list_for_each_entry(link, &cont->cset_links, cset_link) {
struct css_set *cset = link->cset;
instead of
list_for_each_entry(link, &cont->css_sets, cgrp_link_list) {
struct css_set *cset = link->cg;
This patch is purely cosmetic.
v2: Fix broken sentences in the patch description.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup.c uses @cg for most struct css_set variables, which in itself
could be a bit confusing, but made much worse by the fact that there
are places which use @cg for struct cgroup variables.
compare_css_sets() epitomizes this confusion - @[old_]cg are struct
css_set while @cg[12] are struct cgroup.
It's not like the whole deal with cgroup, css_set and cg_cgroup_link
isn't already confusing enough. Let's give it some sanity by
uniformly using @cset for all struct css_set variables.
* s/cg/cset/ for all css_set variables.
* s/oldcg/old_cset/ s/oldcgrp/old_cgrp/. The same for the ones
prefixed with "new".
* s/cg/cgrp/ for cgroup variables in compare_css_sets().
* s/css/cset/ for the cgroup variable in task_cgroup_from_root().
* Whiteline adjustments.
This patch is purely cosmetic.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
* Rename it from files[] (really?) to cgroup_base_files[].
* Drop CGROUP_FILE_GENERIC_PREFIX which was defined as "cgroup." and
used inconsistently. Just use "cgroup." directly.
* Collect insane files at the end. Note that only the insane ones are
missing "cgroup." prefix.
This patch doesn't introduce any functional changes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
The empty cgroup notification mechanism currently implemented in
cgroup is tragically outdated. Forking and execing userland process
stopped being a viable notification mechanism more than a decade ago.
We're gonna have a saner mechanism. Let's make it clear that this
abomination is going away.
Mark "notify_on_release" and "release_agent" with CFTYPE_INSANE.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Some resources controlled by cgroup aren't per-task and cgroup core
allowing threads of a single thread_group to be in different cgroups
forced memcg do explicitly find the group leader and use it. This is
gonna be nasty when transitioning to unified hierarchy and in general
we don't want and won't support granularity finer than processes.
Mark "tasks" with CFTYPE_INSANE.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.cz>
Cc: Balbir Singh <bsingharora@gmail.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: cgroups@vger.kernel.org
Cc: Vivek Goyal <vgoyal@redhat.com>
This patch converts cgroup_for_each_child(),
cgroup_next_descendant_pre/post() and thus
cgroup_for_each_descendant_pre/post() to use cgroup_next_sibling()
instead of manually dereferencing ->sibling.next.
The only reason the iterators couldn't allow dropping RCU read lock
while iteration is in progress was because they couldn't determine the
next sibling safely once RCU read lock is dropped. Using
cgroup_next_sibling() removes that problem and enables all iterators
to allow dropping RCU read lock in the middle. Comments are updated
accordingly.
This makes the iterators easier to use and will simplify controllers.
Note that @cgroup argument is renamed to @cgrp in
cgroup_for_each_child() because it conflicts with "struct cgroup" used
in the new macro body.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Currently, there's no easy way to find out the next sibling cgroup
unless it's known that the current cgroup is accessed from the
parent's children list in a single RCU critical section. This in turn
forces all iterators to require whole iteration to be enclosed in a
single RCU critical section, which sometimes is too restrictive. This
patch implements cgroup_next_sibling() which can reliably determine
the next sibling regardless of the state of the current cgroup as long
as it's accessible.
It currently is impossible to determine the next sibling after
dropping RCU read lock because the cgroup being iterated could be
removed anytime and if RCU read lock is dropped, nothing guarantess
its ->sibling.next pointer is accessible. A removed cgroup would
continue to point to its next sibling for RCU accesses but stop
receiving updates from the sibling. IOW, the next sibling could be
removed and then complete its grace period while RCU read lock is
dropped, making it unsafe to dereference ->sibling.next after dropping
and re-acquiring RCU read lock.
This can be solved by adding a way to traverse to the next sibling
without dereferencing ->sibling.next. This patch adds a monotonically
increasing cgroup serial number, cgroup->serial_nr, which guarantees
that all cgroup->children lists are kept in increasing serial_nr
order. A new function, cgroup_next_sibling(), is implemented, which,
if CGRP_REMOVED is not set on the current cgroup, follows
->sibling.next; otherwise, traverses the parent's ->children list
until it sees a sibling with higher ->serial_nr.
This allows the function to always return the next sibling regardless
of the state of the current cgroup without adding overhead in the fast
path.
Further patches will update the iterators to use cgroup_next_sibling()
so that they allow dropping RCU read lock and blocking while iteration
is in progress which in turn will be used to simplify controllers.
v2: Typo fix as per Serge.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
cgroup_is_removed() no longer has external users and it shouldn't grow
any - controllers should deal with cgroup_subsys_state on/offline
state instead of cgroup removal state. Make it static.
While at it, make it return bool.
Signed-off-by: Tejun Heo <tj@kernel.org>
Merging to receive 7805d000db ("cgroup: fix a subtle bug in descendant
pre-order walk") so that further iterator updates can build upon it.
Signed-off-by: Tejun Heo <tj@kernel.org>
When cgroup_next_descendant_pre() initiates a walk, it checks whether
the subtree root doesn't have any children and if not returns NULL.
Later code assumes that the subtree isn't empty. This is broken
because the subtree may become empty inbetween, which can lead to the
traversal escaping the subtree by walking to the sibling of the
subtree root.
There's no reason to have the early exit path. Remove it along with
the later assumption that the subtree isn't empty. This simplifies
the code a bit and fixes the subtle bug.
While at it, fix the comment of cgroup_for_each_descendant_pre() which
was incorrectly referring to ->css_offline() instead of
->css_online().
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Michal Hocko <mhocko@suse.cz>
Cc: stable@vger.kernel.org
kdbus folks want a sane way to determine the cgroup path that a given
task belongs to on a given hierarchy, which is a reasonble thing to
expect from cgroup core.
Implement task_cgroup_path_from_hierarchy().
v2: Dropped unnecessary NULL check on the return value of
task_cgroup_from_root() as suggested by Li Zefan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Greg Kroah-Hartman <greg@kroah.com>
Acked-by: Li Zefan <lizefan@huawei.com>
Cc: Kay Sievers <kay@vrfy.org>
Cc: Lennart Poettering <lennart@poettering.net>
Cc: Daniel Mack <daniel@zonque.org>
We want to be able to lookup a hierarchy from its id and cyclic
allocation is a whole lot simpler with idr. Convert to idr and use
idr_alloc_cyclc().
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
Now that hierarchy_id alloc / free are protected by the cgroup
mutexes, there's no need for this separate lock. Drop it.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
We're planning to converting hierarchy_ida to an idr and use it to
look up hierarchy from its id. As we want the mapping to happen
atomically with cgroupfs_root registration, this patch refactors
hierarchy_id init / exit so that ida operations happen inside
cgroup_[root_]mutex.
* s/init_root_id()/cgroup_init_root_id()/ and make it return 0 or
-errno like a normal function.
* Move hierarchy_id initialization from cgroup_root_from_opts() into
cgroup_mount() block where the root is confirmed to be used and
being registered while holding both mutexes.
* Split cgroup_drop_id() into cgroup_exit_root_id() and
cgroup_free_root(), so that ID release can happen before dropping
the mutexes in cgroup_kill_sb(). The latter expects hierarchy_id to
be exited before being invoked.
Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Li Zefan <lizefan@huawei.com>
cgroup_create_file() calls d_instantiate(), which may decide to look
at the xattrs on the file. Smack always does this and SELinux can be
configured to do so.
But cgroup_add_file() didn't initialize xattrs before calling
cgroup_create_file(), which finally leads to dereferencing NULL
dentry->d_fsdata.
This bug has been there since cgroup xattr was introduced.
Cc: <stable@vger.kernel.org> # 3.8.x
Reported-by: Ivan Bulatovic <combuster@archlinux.us>
Reported-by: Casey Schaufler <casey@schaufler-ca.com>
Signed-off-by: Li Zefan <lizefan@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
kprobes up to par with the latest changes to ftrace (multi buffering
and the new function probes).
He also discovered and fixed some bugs in doing so. When pulling in his
patches, I also found a few minor bugs as well and fixed them.
This also includes a compile fix for some archs that select the ring buffer
but not tracing.
I based this off of the last patch you took from me that fixed the merge
conflict error, as that was the commit that had all the changes I needed
for this set of changes.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQEcBAABAgAGBQJRjYnJAAoJEOdOSU1xswtMg9EH/iFs438FgrNMk2ZdQftmqcqA
cqcactHo1mmoHjAoLZT/oDBjEThhVUuqzMXrFRutSYcTh4PsQEC3arX0mpsC+T12
UEEV/tZS3TXH+GXEyrOit/O3kzntQcDHwJDV4+0n80IrJmw4IDZbnV3R8DWjS6wp
so+dq0A1pwehcG/upgpw1oTKsGv1G/p6vyf968B6W44icHEClLiph4JE2kzE6D3r
fzSpOLaQoBEvwIRf6xRKxi240VqIItXwfG7pwNpPpSC37gRLzm74zGr+Sj93/k1y
pARbZ/5XO7/pcVYQYupErRAoV5in+QMZ67k5G1vQIvyOS9r039catbQf/7PkzcI=
=EZCE
-----END PGP SIGNATURE-----
Merge tag 'trace-fixes-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing/kprobes update from Steven Rostedt:
"The majority of these changes are from Masami Hiramatsu bringing
kprobes up to par with the latest changes to ftrace (multi buffering
and the new function probes).
He also discovered and fixed some bugs in doing so. When pulling in
his patches, I also found a few minor bugs as well and fixed them.
This also includes a compile fix for some archs that select the ring
buffer but not tracing.
I based this off of the last patch you took from me that fixed the
merge conflict error, as that was the commit that had all the changes
I needed for this set of changes."
* tag 'trace-fixes-v3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing/kprobes: Support soft-mode disabling
tracing/kprobes: Support ftrace_event_file base multibuffer
tracing/kprobes: Pass trace_probe directly from dispatcher
tracing/kprobes: Increment probe hit-count even if it is used by perf
tracing/kprobes: Use bool for retprobe checker
ftrace: Fix function probe when more than one probe is added
ftrace: Fix the output of enabled_functions debug file
ftrace: Fix locking in register_ftrace_function_probe()
tracing: Add helper function trace_create_new_event() to remove duplicate code
tracing: Modify soft-mode only if there's no other referrer
tracing: Indicate enabled soft-mode in enable file
tracing/kprobes: Fix to increment return event probe hit-count
ftrace: Cleanup regex_lock and ftrace_lock around hash updating
ftrace, kprobes: Fix a deadlock on ftrace_regex_lock
ftrace: Have ftrace_regex_write() return either read or error
tracing: Return error if register_ftrace_function_probe() fails for event_enable_func()
tracing: Don't succeed if event_enable_func did not register anything
ring-buffer: Select IRQ_WORK
Pull audit changes from Eric Paris:
"Al used to send pull requests every couple of years but he told me to
just start pushing them to you directly.
Our touching outside of core audit code is pretty straight forward. A
couple of interface changes which hit net/. A simple argument bug
calling audit functions in namei.c and the removal of some assembly
branch prediction code on ppc"
* git://git.infradead.org/users/eparis/audit: (31 commits)
audit: fix message spacing printing auid
Revert "audit: move kaudit thread start from auditd registration to kaudit init"
audit: vfs: fix audit_inode call in O_CREAT case of do_last
audit: Make testing for a valid loginuid explicit.
audit: fix event coverage of AUDIT_ANOM_LINK
audit: use spin_lock in audit_receive_msg to process tty logging
audit: do not needlessly take a lock in tty_audit_exit
audit: do not needlessly take a spinlock in copy_signal
audit: add an option to control logging of passwords with pam_tty_audit
audit: use spin_lock_irqsave/restore in audit tty code
helper for some session id stuff
audit: use a consistent audit helper to log lsm information
audit: push loginuid and sessionid processing down
audit: stop pushing loginid, uid, sessionid as arguments
audit: remove the old depricated kernel interface
audit: make validity checking generic
audit: allow checking the type of audit message in the user filter
audit: fix build break when AUDIT_DEBUG == 2
audit: remove duplicate export of audit_enabled
Audit: do not print error when LSMs disabled
...
Pull stray syscall bits from Al Viro:
"Several syscall-related commits that were missing from the original"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal:
switch compat_sys_sysctl to COMPAT_SYSCALL_DEFINE
unicore32: just use mmap_pgoff()...
unify compat fanotify_mark(2), switch to COMPAT_SYSCALL_DEFINE
x86, vm86: fix VM86 syscalls: use SYSCALL_DEFINEx(...)
- Two krealloc() abuse fixes
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.13 (GNU/Linux)
iEYEABECAAYFAlGLydsACgkQdwG7hYl686O3ZwCfYIJUdE/VADo53j3SwDbCEa15
pMgAnA13t6dKgC74Xqk5dd65KKaw64WB
=nbL9
-----END PGP SIGNATURE-----
Merge tag 'for-linus-20130509' of git://git.infradead.org/~dwmw2/random-2.6
Pull misc fixes from David Woodhouse:
"This is some miscellaneous cleanups that don't really belong anywhere
else (or were ignored), that have been sitting in linux-next for some
time. Two of them are fixes resulting from my audit of krealloc()
usage that don't seem to have elicited any response when I posted
them, and the other three are patches from Artem removing dead code."
* tag 'for-linus-20130509' of git://git.infradead.org/~dwmw2/random-2.6:
pcmcia: remove RPX board stuff
m68k: remove rpxlite stuff
pcmcia: remove Motorola MBX860 support
params: Fix potential memory leak in add_sysfs_param()
dell-laptop: Fix krealloc() misuse in parse_da_table()
Support soft-mode disabling on kprobe-based dynamic events.
Soft-disabling is just ignoring recording if the soft disabled
flag is set.
Link: http://lkml.kernel.org/r/20130509054454.30398.7237.stgit@mhiramat-M0-7522
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Increment probe hit-count for profiling even if it is used
by perf tool. Same thing has already done in trace_uprobe.
Link: http://lkml.kernel.org/r/20130509054436.30398.21133.stgit@mhiramat-M0-7522
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The enabled_functions debugfs file was created to be able to see
what functions have been modified from nops to calling a tracer.
The current method uses the counter in the function record.
As when a ftrace_ops is registered to a function, its count
increases. But that doesn't mean that the function is actively
being traced. /proc/sys/kernel/ftrace_enabled can be set to zero
which would disable it, as well as something can go wrong and
we can think its enabled when only the counter is set.
The record's FTRACE_FL_ENABLED flag is set or cleared when its
function is modified. That is a much more accurate way of knowing
what function is enabled or not.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
The iteration of the ftrace function list and the call to
ftrace_match_record() need to be protected by the ftrace_lock.
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Both __trace_add_new_event() and __trace_early_add_new_event() do
basically the same thing, except that __trace_add_new_event() does
a little more.
Instead of having duplicate code between the two functions, add
a helper function trace_create_new_event() that both can use.
This will help against having bugs fixed in one function but not
the other.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Modify soft-mode flag only if no other soft-mode referrer
(currently only the ftrace triggers) by using a reference
counter in each ftrace_event_file.
Without this fix, adding and removing several different
enable/disable_event triggers on the same event clear
soft-mode bit from the ftrace_event_file. This also
happens with a typo of glob on setting triggers.
e.g.
# echo vfs_symlink:enable_event:net:netif_rx > set_ftrace_filter
# cat events/net/netif_rx/enable
0*
# echo typo_func:enable_event:net:netif_rx > set_ftrace_filter
# cat events/net/netif_rx/enable
0
# cat set_ftrace_filter
#### all functions enabled ####
vfs_symlink:enable_event:net:netif_rx:unlimited
As above, we still have a trigger, but soft-mode is gone.
Link: http://lkml.kernel.org/r/20130509054429.30398.7464.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: David Sharp <dhsharp@google.com>
Cc: Hiraku Toyooka <hiraku.toyooka.gu@hitachi.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Indicate enabled soft-mode event as "1*" in "enable" file
for each event, because it can be soft-disabled when disable_event
trigger is hit.
Link: http://lkml.kernel.org/r/20130509054426.30398.28202.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Cleanup regex_lock and ftrace_lock locking points around
ftrace_ops hash update code.
The new rule is that regex_lock protects ops->*_hash
read-update-write code for each ftrace_ops. Usually,
hash update is done by following sequence.
1. allocate a new local hash and copy the original hash.
2. update the local hash.
3. move(actually, copy) back the local hash to ftrace_ops.
4. update ftrace entries if needed.
5. release the local hash.
This makes regex_lock protect #1-#4, and ftrace_lock
to protect #3, #4 and adding and removing ftrace_ops from the
ftrace_ops_list. The ftrace_lock protects #3 as well because
the move functions update the entries too.
Link: http://lkml.kernel.org/r/20130509054421.30398.83411.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Fix a deadlock on ftrace_regex_lock which happens when setting
an enable_event trigger on dynamic kprobe event as below.
----
sh-2.05b# echo p vfs_symlink > kprobe_events
sh-2.05b# echo vfs_symlink:enable_event:kprobes:p_vfs_symlink_0 > set_ftrace_filter
=============================================
[ INFO: possible recursive locking detected ]
3.9.0+ #35 Not tainted
---------------------------------------------
sh/72 is trying to acquire lock:
(ftrace_regex_lock){+.+.+.}, at: [<ffffffff810ba6c1>] ftrace_set_hash+0x81/0x1f0
but task is already holding lock:
(ftrace_regex_lock){+.+.+.}, at: [<ffffffff810b7cbd>] ftrace_regex_write.isra.29.part.30+0x3d/0x220
other info that might help us debug this:
Possible unsafe locking scenario:
CPU0
----
lock(ftrace_regex_lock);
lock(ftrace_regex_lock);
*** DEADLOCK ***
----
To fix that, this introduces a finer regex_lock for each ftrace_ops.
ftrace_regex_lock is too big of a lock which protects all
filter/notrace_hash operations, but it doesn't need to be a global
lock after supporting multiple ftrace_ops because each ftrace_ops
has its own filter/notrace_hash.
Link: http://lkml.kernel.org/r/20130509054417.30398.84254.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Added initialization flag and automate mutex initialization for
non ftrace.c ftrace_probes. ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
As ftrace_regex_write() reads the result of ftrace_process_regex()
which can sometimes return a positive number, only consider a
failure if the return is negative. Otherwise, it will skip possible
other registered probes and by returning a positive number that
wasn't read, it will confuse the user processes doing the writing.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
register_ftrace_function_probe() returns the number of functions
it registered, which can be zero, it can also return a negative number
if something went wrong. But event_enable_func() only checks for
the case that it didn't register anything, it needs to also check
for the case that something went wrong and return that error code
as well.
Added some comments about the code as well, to make it more
understandable.
Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Return 0 instead of the number of activated ftrace function probes if
event_enable_func succeeded and return an error code if it failed or
did not register any functions. But it currently returns the number
of registered functions and if it didn't register anything, it returns 0,
but that is considered success.
This also fixes the return value. As if it succeeds, it returns the
number of functions that were enabled, which is returned back to
the user in ftrace_regex_write (the write() return code). If only
one function is enabled, then the return code of the write is one,
and this can confuse the user program in thinking it only wrote 1
byte.
Link: http://lkml.kernel.org/r/20130509054413.30398.55650.stgit@mhiramat-M0-7522
Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Tom Zanussi <tom.zanussi@intel.com>
Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
[ Rewrote change log to reflect that this fixes two bugs - SR ]
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
Pull block driver updates from Jens Axboe:
"It might look big in volume, but when categorized, not a lot of
drivers are touched. The pull request contains:
- mtip32xx fixes from Micron.
- A slew of drbd updates, this time in a nicer series.
- bcache, a flash/ssd caching framework from Kent.
- Fixes for cciss"
* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
bcache: Use bd_link_disk_holder()
bcache: Allocator cleanup/fixes
cciss: bug fix to prevent cciss from loading in kdump crash kernel
cciss: add cciss_allow_hpsa module parameter
drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
mtip32xx: Workaround for unaligned writes
bcache: Make sure blocksize isn't smaller than device blocksize
bcache: Fix merge_bvec_fn usage for when it modifies the bvm
bcache: Correctly check against BIO_MAX_PAGES
bcache: Hack around stuff that clones up to bi_max_vecs
bcache: Set ra_pages based on backing device's ra_pages
bcache: Take data offset from the bdev superblock.
mtip32xx: mtip32xx: Disable TRIM support
mtip32xx: fix a smatch warning
bcache: Disable broken btree fuzz tester
bcache: Fix a format string overflow
bcache: Fix a minor memory leak on device teardown
bcache: Documentation updates
bcache: Use WARN_ONCE() instead of __WARN()
bcache: Add missing #include <linux/prefetch.h>
...
Pull block core updates from Jens Axboe:
- Major bit is Kents prep work for immutable bio vecs.
- Stable candidate fix for a scheduling-while-atomic in the queue
bypass operation.
- Fix for the hang on exceeded rq->datalen 32-bit unsigned when merging
discard bios.
- Tejuns changes to convert the writeback thread pool to the generic
workqueue mechanism.
- Runtime PM framework, SCSI patches exists on top of these in James'
tree.
- A few random fixes.
* 'for-3.10/core' of git://git.kernel.dk/linux-block: (40 commits)
relay: move remove_buf_file inside relay_close_buf
partitions/efi.c: replace useless kzalloc's by kmalloc's
fs/block_dev.c: fix iov_shorten() criteria in blkdev_aio_read()
block: fix max discard sectors limit
blkcg: fix "scheduling while atomic" in blk_queue_bypass_start
Documentation: cfq-iosched: update documentation help for cfq tunables
writeback: expose the bdi_wq workqueue
writeback: replace custom worker pool implementation with unbound workqueue
writeback: remove unused bdi_pending_list
aoe: Fix unitialized var usage
bio-integrity: Add explicit field for owner of bip_buf
block: Add an explicit bio flag for bios that own their bvec
block: Add bio_alloc_pages()
block: Convert some code to bio_for_each_segment_all()
block: Add bio_for_each_segment_all()
bounce: Refactor __blk_queue_bounce to not use bi_io_vec
raid1: use bio_copy_data()
pktcdvd: Use bio_reset() in disabled code to kill bi_idx usage
pktcdvd: use bio_copy_data()
block: Add bio_copy_data()
...
The helper function didn't include a leading space, so it was jammed
against the previous text in the audit record.
Signed-off-by: Eric Paris <eparis@redhat.com>
Merge more incoming from Andrew Morton:
- Various fixes which were stalled or which I picked up recently
- A large rotorooting of the AIO code. Allegedly to improve
performance but I don't really have good performance numbers (I might
have lost the email) and I can't raise Kent today. I held this out
of 3.9 and we could give it another cycle if it's all too late/scary.
I ended up taking only the first two thirds of the AIO rotorooting. I
left the percpu parts and the batch completion for later. - Linus
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (33 commits)
aio: don't include aio.h in sched.h
aio: kill ki_retry
aio: kill ki_key
aio: give shared kioctx fields their own cachelines
aio: kill struct aio_ring_info
aio: kill batch allocation
aio: change reqs_active to include unreaped completions
aio: use cancellation list lazily
aio: use flush_dcache_page()
aio: make aio_read_evt() more efficient, convert to hrtimers
wait: add wait_event_hrtimeout()
aio: refcounting cleanup
aio: make aio_put_req() lockless
aio: do fget() after aio_get_req()
aio: dprintk() -> pr_debug()
aio: move private stuff out of aio.h
aio: add kiocb_cancel()
aio: kill return value of aio_complete()
char: add aio_{read,write} to /dev/{null,zero}
aio: remove retry-based AIO
...
This reverts commit 6ff5e45985.
Conflicts:
kernel/audit.c
This patch was starting a kthread for all the time. Since the follow on
patches that required it didn't get finished in 3.10 time, we shouldn't
ship this change in 3.10.
Signed-off-by: Eric Paris <eparis@redhat.com>
audit rule additions containing "-F auid!=4294967295" were failing
with EINVAL because of a regression caused by e1760bd.
Apparently some userland audit rule sets want to know if loginuid uid
has been set and are using a test for auid != 4294967295 to determine
that.
In practice that is a horrible way to ask if a value has been set,
because it relies on subtle implementation details and will break
every time the uid implementation in the kernel changes.
So add a clean way to test if the audit loginuid has been set, and
silently convert the old idiom to the cleaner and more comprehensible
new idiom.
Cc: <stable@vger.kernel.org> # 3.7
Reported-By: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Tested-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Eric Paris <eparis@redhat.com>
Some interrupt controllers refuse to map interrupts marked as
"protected" by firwmare. Since we try to map everyting in the
device-tree on some platforms, we end up with a lot of nasty
WARN's in the boot log for what is a normal situation on those
machines.
This defines a specific return code (-EPERM) from the host map()
callback which cause irqdomain to fail silently.
MPIC is updated to return this when hitting a protected source
printing only a single line message for diagnostic purposes.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Pull 'full dynticks' support from Ingo Molnar:
"This tree from Frederic Weisbecker adds a new, (exciting! :-) core
kernel feature to the timer and scheduler subsystems: 'full dynticks',
or CONFIG_NO_HZ_FULL=y.
This feature extends the nohz variable-size timer tick feature from
idle to busy CPUs (running at most one task) as well, potentially
reducing the number of timer interrupts significantly.
This feature got motivated by real-time folks and the -rt tree, but
the general utility and motivation of full-dynticks runs wider than
that:
- HPC workloads get faster: CPUs running a single task should be able
to utilize a maximum amount of CPU power. A periodic timer tick at
HZ=1000 can cause a constant overhead of up to 1.0%. This feature
removes that overhead - and speeds up the system by 0.5%-1.0% on
typical distro configs even on modern systems.
- Real-time workload latency reduction: CPUs running critical tasks
should experience as little jitter as possible. The last remaining
source of kernel-related jitter was the periodic timer tick.
- A single task executing on a CPU is a pretty common situation,
especially with an increasing number of cores/CPUs, so this feature
helps desktop and mobile workloads as well.
The cost of the feature is mainly related to increased timer
reprogramming overhead when a CPU switches its tick period, and thus
slightly longer to-idle and from-idle latency.
Configuration-wise a third mode of operation is added to the existing
two NOHZ kconfig modes:
- CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
as a config option. This is the traditional Linux periodic tick
design: there's a HZ tick going on all the time, regardless of
whether a CPU is idle or not.
- CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
periodic tick when a CPU enters idle mode.
- CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
tick when a CPU is idle, also slows the tick down to 1 Hz (one
timer interrupt per second) when only a single task is running on a
CPU.
The .config behavior is compatible: existing !CONFIG_NO_HZ and
CONFIG_NO_HZ=y settings get translated to the new values, without the
user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
default.
This feature is based on a lot of infrastructure work that has been
steadily going upstream in the last 2-3 cycles: related RCU support
and non-periodic cputime support in particular is upstream already.
This tree adds the final pieces and activates the feature. The pull
request is marked RFC because:
- it's marked 64-bit only at the moment - the 32-bit support patch is
small but did not get ready in time.
- it has a number of fresh commits that came in after the merge
window. The overwhelming majority of commits are from before the
merge window, but still some aspects of the tree are fresh and so I
marked it RFC.
- it's a pretty wide-reaching feature with lots of effects - and
while the components have been in testing for some time, the full
combination is still not very widely used. That it's default-off
should reduce its regression abilities and obviously there are no
known regressions with CONFIG_NO_HZ_FULL=y enabled either.
- the feature is not completely idempotent: there is no 100%
equivalent replacement for a periodic scheduler/timer tick. In
particular there's ongoing work to map out and reduce its effects
on scheduler load-balancing and statistics. This should not impact
correctness though, there are no known regressions related to this
feature at this point.
- it's a pretty ambitious feature that with time will likely be
enabled by most Linux distros, and we'd like you to make input on
its design/implementation, if you dislike some aspect we missed.
Without flaming us to crisp! :-)
Future plans:
- there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
the periodic tick altogether when there's a single busy task on a
CPU. We'd first like 1 Hz to be exposed more widely before we go
for the 0 Hz target though.
- once we reach 0 Hz we can remove the periodic tick assumption from
nr_running>=2 as well, by essentially interrupting busy tasks only
as frequently as the sched_latency constraints require us to do -
once every 4-40 msecs, depending on nr_running.
I am personally leaning towards biting the bullet and doing this in
v3.10, like the -rt tree this effort has been going on for too long -
but the final word is up to you as usual.
More technical details can be found in Documentation/timers/NO_HZ.txt"
* 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
sched: Keep at least 1 tick per second for active dynticks tasks
rcu: Fix full dynticks' dependency on wide RCU nocb mode
nohz: Protect smp_processor_id() in tick_nohz_task_switch()
nohz_full: Add documentation.
cputime_nsecs: use math64.h for nsec resolution conversion helpers
nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
nohz: Reduce overhead under high-freq idling patterns
nohz: Remove full dynticks' superfluous dependency on RCU tree
nohz: Fix unavailable tick_stop tracepoint in dynticks idle
nohz: Add basic tracing
nohz: Select wide RCU nocb for full dynticks
nohz: Disable the tick when irq resume in full dynticks CPU
nohz: Re-evaluate the tick for the new task after a context switch
nohz: Prepare to stop the tick on irq exit
nohz: Implement full dynticks kick
nohz: Re-evaluate the tick from the scheduler IPI
sched: New helper to prevent from stopping the tick in full dynticks
sched: Kick full dynticks CPU that have more than one task enqueued.
perf: New helper to prevent full dynticks CPUs from stopping tick
perf: Kick full dynticks CPU if events rotation is needed
...
Pull perf fixes from Ingo Molnar:
"Misc fixes plus a small hw-enablement patch for Intel IB model 58
uncore events"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/intel/lbr: Demand proper privileges for PERF_SAMPLE_BRANCH_KERNEL
perf/x86/intel/lbr: Fix LBR filter
perf/x86: Blacklist all MEM_*_RETIRED events for Ivy Bridge
perf: Fix vmalloc ring buffer pages handling
perf/x86/intel: Fix unintended variable name reuse
perf/x86/intel: Add support for IvyBridge model 58 Uncore
perf/x86/intel: Fix typo in perf_event_intel_uncore.c
x86: Eliminate irq_mis_count counted in arch_irq_stat
fix a remove/insert race which Never Happens, and (my favorite) handle the
case when we have too many modules for a single commandline. Seriously,
the kernel is full, please go away!
Cheers,
Rusty.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
iQIcBAABAgAGBQJRgbGEAAoJENkgDmzRrbjx2+QP/jXs93K/sXw3rL0vBklwCFv6
IPZmqYZiGjrzqlB4coWkgYRwW1oOsREfAjF5MmfPdykS3fO5kXfdxN4FBdfKp+IZ
RdsycdGDuSxWomgYsivrrxLBDxDAX1VuBOjr6mu5Uuk/pCjFa61cfJDiErsu0jKz
2EMTc98A+E71XamJdvbtal5MUIu9yeluJWG2ux2+VbCul4MSpMc//0n2nrws/RCB
AoC96AT/Xf4U10a8zT8RfCJ29M5Vvx/KfTIcFiZvtCQxEaHNNmj831gDNiw/3jFI
ndRph+VLHBsMoBMxfzNRrM+evqkq8+AGEGRj3ycQy5Pa6DunPyzMafWOVGBGnmaS
tl9hATGx1438048i5tUn8ieAYG1YL1HM83hQovpCThfUKQMiq186iDt1SYYmlq3g
0thj3znQqZDYhboPtgWzOMUdqOG/iBIKjhGQjjHZs+MInFgxL2hmax0gBNkvEtQb
oLyfGbF6UjS7I/Md/HohnUQ4xr9kYa3MQeqPjKbRwgHRkdXhzTEZtI+MYDJBxOnW
QGVQ97aJ2WA7vC7sz/1VhTcZqmU5zfrSc8lF+Ea+H8dQGHHbz8HxKQacEvKcMrXl
OJyEkRUWDA0MTjeIHzn2fff9Q6/qqA1QejRiFofGJrpxopcJS84/7yA0repxvuMG
yaMPsLq53UW37/AXYsho
=MPiD
-----END PGP SIGNATURE-----
Merge tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
Pull mudule updates from Rusty Russell:
"We get rid of the general module prefix confusion with a binary config
option, fix a remove/insert race which Never Happens, and (my
favorite) handle the case when we have too many modules for a single
commandline. Seriously, the kernel is full, please go away!"
* tag 'modules-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux:
modpost: fix unwanted VMLINUX_SYMBOL_STR expansion
X.509: Support parse long form of length octets in Authority Key Identifier
module: don't unlink the module until we've removed all exposure.
kernel: kallsyms: memory override issue, need check destination buffer length
MODSIGN: do not send garbage to stderr when enabling modules signature
modpost: handle huge numbers of modules.
modpost: add -T option to read module names from file/stdin.
modpost: minor cleanup.
genksyms: pass symbol-prefix instead of arch
module: fix symbol versioning with symbol prefixes
CONFIG_SYMBOL_PREFIX: cleanup.