rcu: Update stall-warning documentation

Add documentation of CONFIG_RCU_CPU_STALL_VERBOSE, CONFIG_RCU_CPU_STALL_INFO,
and RCU_STALL_DELAY_DELTA.  Describe multiple stall-warning messages from
a single stall, and the timing of the subsequent messages.  Add headings.
Remove RCU_SECONDS_TILL_STALL_RECHECK because this value is now computed
at runtime from RCU_CPU_STALL_TIMEOUT, so that sysfs changes to the timeout
value now directly affect the RCU_SECONDS_TILL_STALL_RECHECK value.

Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
This commit is contained in:
Paul E. McKenney 2012-01-20 17:35:55 -08:00 committed by Paul E. McKenney
parent c13f3757d0
commit 24cd7fd0ea

View file

@ -12,14 +12,38 @@ CONFIG_RCU_CPU_STALL_TIMEOUT
This kernel configuration parameter defines the period of time
that RCU will wait from the beginning of a grace period until it
issues an RCU CPU stall warning. This time period is normally
ten seconds.
sixty seconds.
RCU_SECONDS_TILL_STALL_RECHECK
This configuration parameter may be changed at runtime via the
/sys/module/rcutree/parameters/rcu_cpu_stall_timeout, however
this parameter is checked only at the beginning of a cycle.
So if you are 30 seconds into a 70-second stall, setting this
sysfs parameter to (say) five will shorten the timeout for the
-next- stall, or the following warning for the current stall
(assuming the stall lasts long enough). It will not affect the
timing of the next warning for the current stall.
This macro defines the period of time that RCU will wait after
issuing a stall warning until it issues another stall warning
for the same stall. This time period is normally set to three
times the check interval plus thirty seconds.
Stall-warning messages may be enabled and disabled completely via
/sys/module/rcutree/parameters/rcu_cpu_stall_suppress.
CONFIG_RCU_CPU_STALL_VERBOSE
This kernel configuration parameter causes the stall warning to
also dump the stacks of any tasks that are blocking the current
RCU-preempt grace period.
RCU_CPU_STALL_INFO
This kernel configuration parameter causes the stall warning to
print out additional per-CPU diagnostic information, including
information on scheduling-clock ticks and RCU's idle-CPU tracking.
RCU_STALL_DELAY_DELTA
Although the lockdep facility is extremely useful, it does add
some overhead. Therefore, under CONFIG_PROVE_RCU, the
RCU_STALL_DELAY_DELTA macro allows five extra seconds before
giving an RCU CPU stall warning message.
RCU_STALL_RAT_DELAY
@ -64,6 +88,54 @@ INFO: rcu_bh_state detected stalls on CPUs/tasks: { } (detected by 4, 2502 jiffi
This is rare, but does happen from time to time in real life.
If the CONFIG_RCU_CPU_STALL_INFO kernel configuration parameter is set,
more information is printed with the stall-warning message, for example:
INFO: rcu_preempt detected stall on CPU
0: (63959 ticks this GP) idle=241/3fffffffffffffff/0
(t=65000 jiffies)
In kernels with CONFIG_RCU_FAST_NO_HZ, even more information is
printed:
INFO: rcu_preempt detected stall on CPU
0: (64628 ticks this GP) idle=dd5/3fffffffffffffff/0 drain=0 . timer=-1
(t=65000 jiffies)
The "(64628 ticks this GP)" indicates that this CPU has taken more
than 64,000 scheduling-clock interrupts during the current stalled
grace period. If the CPU was not yet aware of the current grace
period (for example, if it was offline), then this part of the message
indicates how many grace periods behind the CPU is.
The "idle=" portion of the message prints the dyntick-idle state.
The hex number before the first "/" is the low-order 12 bits of the
dynticks counter, which will have an even-numbered value if the CPU is
in dyntick-idle mode and an odd-numbered value otherwise. The hex
number between the two "/"s is the value of the nesting, which will
be a small positive number if in the idle loop and a very large positive
number (as shown above) otherwise.
For CONFIG_RCU_FAST_NO_HZ kernels, the "drain=0" indicates that the
CPU is not in the process of trying to force itself into dyntick-idle
state, the "." indicates that the CPU has not given up forcing RCU
into dyntick-idle mode (it would be "H" otherwise), and the "timer=-1"
indicates that the CPU has not recented forced RCU into dyntick-idle
mode (it would otherwise indicate the number of microseconds remaining
in this forced state).
Multiple Warnings From One Stall
If a stall lasts long enough, multiple stall-warning messages will be
printed for it. The second and subsequent messages are printed at
longer intervals, so that the time between (say) the first and second
message will be about three times the interval between the beginning
of the stall and the first message.
What Causes RCU CPU Stall Warnings?
So your kernel printed an RCU CPU stall warning. The next question is
"What caused it?" The following problems can result in RCU CPU stall
warnings:
@ -128,4 +200,5 @@ is occurring, which will usually be in the function nearest the top of
that portion of the stack which remains the same from trace to trace.
If you can reliably trigger the stall, ftrace can be quite helpful.
RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE.
RCU bugs can often be debugged with the help of CONFIG_RCU_TRACE
and with RCU's event tracing.