linux-hardened/kernel/time
Youquan Song 69a37beabf cpuidle: Quickly notice prediction failure for repeat mode
The prediction for future is difficult and when the cpuidle governor prediction
fails and govenor possibly choose the shallower C-state than it should. How to
quickly notice and find the failure becomes important for power saving.

cpuidle menu governor has a method to predict the repeat pattern if there are 8
C-states residency which are continuous and the same or very close, so it will
predict the next C-states residency will keep same residency time.

There is a real case that turbostat utility (tools/power/x86/turbostat)
at kernel 3.3 or early. turbostat utility will read 10 registers one by one at
Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu
 governor will predict it is repeat mode and there is another IPI wake up idle
 CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally
idle. However, in the turbostat, following 10 registers reading is sleep 5
seconds by default, so the idle CPU will keep at C1 for a long time though it is
 idle until break event occurs.
In a idle Sandybridge system, run "./turbostat -v", we will notice that deep
C-state dangles between "70% ~ 99%". After patched the kernel, we will notice
deep C-state stays at >99.98%.

In the patch, a timer is added when menu governor detects a repeat mode and
choose a shallow C-state. The timer is set to a time out value that greater
than predicted time, and we conclude repeat mode prediction failure if timer is
triggered. When repeat mode happens as expected, the timer is not triggered
and CPU waken up from C-states and it will cancel the timer initiatively.
When repeat mode does not happen, the timer will be time out and menu governor
will quickly notice that the repeat mode prediction fails and then re-evaluates
deeper C-states possibility.

Below is another case which will clearly show the patch much benefit:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <signal.h>
#include <sys/time.h>
#include <time.h>
#include <pthread.h>

volatile int * shutdown;
volatile long * count;
int delay = 20;
int loop = 8;

void usage(void)
{
	fprintf(stderr,
		"Usage: idle_predict [options]\n"
		"  --help	-h  Print this help\n"
		"  --thread	-n  Thread number\n"
		"  --loop     	-l  Loop times in shallow Cstate\n"
		"  --delay	-t  Sleep time (uS)in shallow Cstate\n");
}

void *simple_loop() {
	int idle_num = 1;
	while (!(*shutdown)) {
		*count = *count + 1;

		if (idle_num % loop)
			usleep(delay);
		else {
			/* sleep 1 second */
			usleep(1000000);
			idle_num = 0;
		}
		idle_num++;
	}

}

static void sighand(int sig)
{
	*shutdown = 1;
}

int main(int argc, char *argv[])
{
	sigset_t sigset;
	int signum = SIGALRM;
	int i, c, er = 0, thread_num = 8;
	pthread_t pt[1024];

	static char optstr[] = "n:l:t:h:";

	while ((c = getopt(argc, argv, optstr)) != EOF)
		switch (c) {
			case 'n':
				thread_num = atoi(optarg);
				break;
			case 'l':
				loop = atoi(optarg);
				break;
			case 't':
				delay = atoi(optarg);
				break;
			case 'h':
			default:
				usage();
				exit(1);
		}

	printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay);
	count = malloc(sizeof(long));
	shutdown = malloc(sizeof(int));
	*count = 0;
	*shutdown = 0;

	sigemptyset(&sigset);
	sigaddset(&sigset, signum);
	sigprocmask (SIG_BLOCK, &sigset, NULL);
	signal(SIGINT, sighand);
	signal(SIGTERM, sighand);

	for(i = 0; i < thread_num ; i++)
		pthread_create(&pt[i], NULL, simple_loop, NULL);

	for (i = 0; i < thread_num; i++)
		pthread_join(pt[i], NULL);

	exit(0);
}

Get powertop V2 from git://github.com/fenrus75/powertop, build powertop.
After build the above test application, then run it.
Test plaform can be Intel Sandybridge or other recent platforms.
#./idle_predict -l 10 &
#./powertop

We will find that deep C-state will dangle between 40%~100% and much time spent
on C1 state. It is because menu governor wrongly predict that repeat mode
is kept, so it will choose the C1 shallow C-state even though it has chance to
sleep 1 second in deep C-state.

While after patched the kernel, we find that deep C-state will keep >99.6%.

Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2012-11-15 00:34:19 +01:00
..
alarmtimer.c alarmtimer: Rename alarmtimer_remove to alarmtimer_dequeue 2012-09-24 12:38:03 -04:00
clockevents.c timekeeping: Add suspend and resume of clock event devices 2012-09-04 01:36:01 +02:00
clocksource.c kernel-time: fix s/then/than/ spelling errors 2012-03-23 16:49:21 -07:00
jiffies.c jiffies: Remove compile time assumptions about CLOCK_TICK_RATE 2012-09-24 12:38:05 -04:00
Kconfig time: Introduce new GENERIC_TIME_VSYSCALL 2012-09-24 12:38:08 -04:00
Makefile timers: Introduce in-kernel alarm-timer interface 2011-04-26 14:01:44 -07:00
ntp.c time/jiffies: Rename ACTHZ to SHIFTED_HZ 2012-07-31 17:09:12 +02:00
posix-clock.c kernel: Fix files explicitly needing EXPORT_SYMBOL infrastructure 2011-10-31 19:30:05 -04:00
tick-broadcast.c tick: Fix the spurious broadcast timer ticks after resume 2012-04-19 21:27:50 +02:00
tick-common.c clockevents: Make minimum delay adjustments configurable 2011-09-08 11:10:56 +02:00
tick-internal.h clockevents: Make minimum delay adjustments configurable 2011-09-08 11:10:56 +02:00
tick-oneshot.c clockevents: Make minimum delay adjustments configurable 2011-09-08 11:10:56 +02:00
tick-sched.c cpuidle: Quickly notice prediction failure for repeat mode 2012-11-15 00:34:19 +01:00
timecompare.c time: Use ARRAY_SIZE macro in timecompare.c 2010-10-21 17:30:06 +02:00
timeconv.c time: add function to convert between calendar time and broken-down time for universal use 2009-09-24 07:20:56 -07:00
timekeeping.c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip 2012-10-12 22:17:48 +09:00
timer_list.c nohz: Rename ts->idle_tick to ts->last_tick 2012-06-11 20:07:17 +02:00
timer_stats.c locking, timer_stats: Annotate table_lock as raw 2011-09-13 11:12:00 +02:00