Merge branch 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

Pull workqueue updates from Tejun Heo:
 "A lot of activities on workqueue side this time.  The changes achieve
  the followings.

   - WQ_UNBOUND workqueues - the workqueues which are per-cpu - are
     updated to be able to interface with multiple backend worker pools.
     This involved a lot of churning but the end result seems actually
     neater as unbound workqueues are now a lot closer to per-cpu ones.

   - The ability to interface with multiple backend worker pools are
     used to implement unbound workqueues with custom attributes.
     Currently the supported attributes are the nice level and CPU
     affinity.  It may be expanded to include cgroup association in
     future.  The attributes can be specified either by calling
     apply_workqueue_attrs() or through /sys/bus/workqueue/WQ_NAME/* if
     the workqueue in question is exported through sysfs.

     The backend worker pools are keyed by the actual attributes and
     shared by any workqueues which share the same attributes.  When
     attributes of a workqueue are changed, the workqueue binds to the
     worker pool with the specified attributes while leaving the work
     items which are already executing in its previous worker pools
     alone.

     This allows converting custom worker pool implementations which
     want worker attribute tuning to use workqueues.  The writeback pool
     is already converted in block tree and there are a couple others
     are likely to follow including btrfs io workers.

   - WQ_UNBOUND's ability to bind to multiple worker pools is also used
     to make it NUMA-aware.  Because there's no association between work
     item issuer and the specific worker assigned to execute it, before
     this change, using unbound workqueue led to unnecessary cross-node
     bouncing and it couldn't be helped by autonuma as it requires tasks
     to have implicit node affinity and workers are assigned randomly.

     After these changes, an unbound workqueue now binds to multiple
     NUMA-affine worker pools so that queued work items are executed in
     the same node.  This is turned on by default but can be disabled
     system-wide or for individual workqueues.

     Crypto was requesting NUMA affinity as encrypting data across
     different nodes can contribute noticeable overhead and doing it
     per-cpu was too limiting for certain cases and IO throughput could
     be bottlenecked by one CPU being fully occupied while others have
     idle cycles.

  While the new features required a lot of changes including
  restructuring locking, it didn't complicate the execution paths much.
  The unbound workqueue handling is now closer to per-cpu ones and the
  new features are implemented by simply associating a workqueue with
  different sets of backend worker pools without changing queue,
  execution or flush paths.

  As such, even though the amount of change is very high, I feel
  relatively safe in that it isn't likely to cause subtle issues with
  basic correctness of work item execution and handling.  If something
  is wrong, it's likely to show up as being associated with worker pools
  with the wrong attributes or OOPS while workqueue attributes are being
  changed or during CPU hotplug.

  While this creates more backend worker pools, it doesn't add too many
  more workers unless, of course, there are many workqueues with unique
  combinations of attributes.  Assuming everything else is the same,
  NUMA awareness costs an extra worker pool per NUMA node with online
  CPUs.

  There are also a couple things which are being routed outside the
  workqueue tree.

   - block tree pulled in workqueue for-3.10 so that writeback worker
     pool can be converted to unbound workqueue with sysfs control
     exposed.  This simplifies the code, makes writeback workers
     NUMA-aware and allows tuning nice level and CPU affinity via sysfs.

   - The conversion to workqueue means that there's no 1:1 association
     between a specific worker, which makes writeback folks unhappy as
     they want to be able to tell which filesystem caused a problem from
     backtrace on systems with many filesystems mounted.  This is
     resolved by allowing work items to set debug info string which is
     printed when the task is dumped.  As this change involves unifying
     implementations of dump_stack() and friends in arch codes, it's
     being routed through Andrew's -mm tree."

* 'for-3.10' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (84 commits)
  workqueue: use kmem_cache_free() instead of kfree()
  workqueue: avoid false negative WARN_ON() in destroy_workqueue()
  workqueue: update sysfs interface to reflect NUMA awareness and a kernel param to disable NUMA affinity
  workqueue: implement NUMA affinity for unbound workqueues
  workqueue: introduce put_pwq_unlocked()
  workqueue: introduce numa_pwq_tbl_install()
  workqueue: use NUMA-aware allocation for pool_workqueues
  workqueue: break init_and_link_pwq() into two functions and introduce alloc_unbound_pwq()
  workqueue: map an unbound workqueues to multiple per-node pool_workqueues
  workqueue: move hot fields of workqueue_struct to the end
  workqueue: make workqueue->name[] fixed len
  workqueue: add workqueue->unbound_attrs
  workqueue: determine NUMA node of workers accourding to the allowed cpumask
  workqueue: drop 'H' from kworker names of unbound worker pools
  workqueue: add wq_numa_tbl_len and wq_numa_possible_cpumask[]
  workqueue: move pwq_pool_locking outside of get/put_unbound_pool()
  workqueue: fix memory leak in apply_workqueue_attrs()
  workqueue: fix unbound workqueue attrs hashing / comparison
  workqueue: fix race condition in unbound workqueue free path
  workqueue: remove pwq_lock which is no longer used
  ...
This commit is contained in:
Linus Torvalds 2013-04-29 19:07:40 -07:00
commit 46d9be3e5e
14 changed files with 2241 additions and 952 deletions

View file

@ -3260,6 +3260,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
or other driver-specific files in the
Documentation/watchdog/ directory.
workqueue.disable_numa
By default, all work items queued to unbound
workqueues are affine to the NUMA nodes they're
issued on, which results in better behavior in
general. If NUMA affinity needs to be disabled for
whatever reason, this option can be used. Note
that this also can be controlled per-workqueue for
workqueues visible under /sys/bus/workqueue/.
x2apic_phys [X86-64,APIC] Use x2apic physical mode instead of
default x2apic cluster mode on platforms
supporting x2apic.

View file

@ -101,6 +101,8 @@ static inline int hypervisor_init(void) { return 0; }
extern int platform_bus_init(void);
extern void cpu_dev_init(void);
struct kobject *virtual_device_parent(struct device *dev);
extern int bus_add_device(struct device *dev);
extern void bus_probe_device(struct device *dev);
extern void bus_remove_device(struct device *dev);

View file

@ -1205,6 +1205,49 @@ static void system_root_device_release(struct device *dev)
{
kfree(dev);
}
static int subsys_register(struct bus_type *subsys,
const struct attribute_group **groups,
struct kobject *parent_of_root)
{
struct device *dev;
int err;
err = bus_register(subsys);
if (err < 0)
return err;
dev = kzalloc(sizeof(struct device), GFP_KERNEL);
if (!dev) {
err = -ENOMEM;
goto err_dev;
}
err = dev_set_name(dev, "%s", subsys->name);
if (err < 0)
goto err_name;
dev->kobj.parent = parent_of_root;
dev->groups = groups;
dev->release = system_root_device_release;
err = device_register(dev);
if (err < 0)
goto err_dev_reg;
subsys->dev_root = dev;
return 0;
err_dev_reg:
put_device(dev);
dev = NULL;
err_name:
kfree(dev);
err_dev:
bus_unregister(subsys);
return err;
}
/**
* subsys_system_register - register a subsystem at /sys/devices/system/
* @subsys: system subsystem
@ -1226,45 +1269,33 @@ static void system_root_device_release(struct device *dev)
int subsys_system_register(struct bus_type *subsys,
const struct attribute_group **groups)
{
struct device *dev;
int err;
err = bus_register(subsys);
if (err < 0)
return err;
dev = kzalloc(sizeof(struct device), GFP_KERNEL);
if (!dev) {
err = -ENOMEM;
goto err_dev;
}
err = dev_set_name(dev, "%s", subsys->name);
if (err < 0)
goto err_name;
dev->kobj.parent = &system_kset->kobj;
dev->groups = groups;
dev->release = system_root_device_release;
err = device_register(dev);
if (err < 0)
goto err_dev_reg;
subsys->dev_root = dev;
return 0;
err_dev_reg:
put_device(dev);
dev = NULL;
err_name:
kfree(dev);
err_dev:
bus_unregister(subsys);
return err;
return subsys_register(subsys, groups, &system_kset->kobj);
}
EXPORT_SYMBOL_GPL(subsys_system_register);
/**
* subsys_virtual_register - register a subsystem at /sys/devices/virtual/
* @subsys: virtual subsystem
* @groups: default attributes for the root device
*
* All 'virtual' subsystems have a /sys/devices/system/<name> root device
* with the name of the subystem. The root device can carry subsystem-wide
* attributes. All registered devices are below this single root device.
* There's no restriction on device naming. This is for kernel software
* constructs which need sysfs interface.
*/
int subsys_virtual_register(struct bus_type *subsys,
const struct attribute_group **groups)
{
struct kobject *virtual_dir;
virtual_dir = virtual_device_parent(NULL);
if (!virtual_dir)
return -ENOMEM;
return subsys_register(subsys, groups, virtual_dir);
}
int __init buses_init(void)
{
bus_kset = kset_create_and_add("bus", &bus_uevent_ops, NULL);

View file

@ -703,7 +703,7 @@ void device_initialize(struct device *dev)
set_dev_node(dev, -1);
}
static struct kobject *virtual_device_parent(struct device *dev)
struct kobject *virtual_device_parent(struct device *dev)
{
static struct kobject *virtual_dir = NULL;

View file

@ -590,6 +590,21 @@ static inline int cpulist_scnprintf(char *buf, int len,
nr_cpumask_bits);
}
/**
* cpumask_parse - extract a cpumask from from a string
* @buf: the buffer to extract from
* @dstp: the cpumask to set.
*
* Returns -errno, or 0 for success.
*/
static inline int cpumask_parse(const char *buf, struct cpumask *dstp)
{
char *nl = strchr(buf, '\n');
int len = nl ? nl - buf : strlen(buf);
return bitmap_parse(buf, len, cpumask_bits(dstp), nr_cpumask_bits);
}
/**
* cpulist_parse - extract a cpumask from a user string of ranges
* @buf: the buffer to extract from

View file

@ -297,6 +297,8 @@ void subsys_interface_unregister(struct subsys_interface *sif);
int subsys_system_register(struct bus_type *subsys,
const struct attribute_group **groups);
int subsys_virtual_register(struct bus_type *subsys,
const struct attribute_group **groups);
/**
* struct class - device classes

View file

@ -1793,7 +1793,7 @@ extern void thread_group_cputime_adjusted(struct task_struct *p, cputime_t *ut,
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
#define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
#define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
#define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_allowed */
#define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */

View file

@ -11,6 +11,7 @@
#include <linux/lockdep.h>
#include <linux/threads.h>
#include <linux/atomic.h>
#include <linux/cpumask.h>
struct workqueue_struct;
@ -68,7 +69,7 @@ enum {
WORK_STRUCT_COLOR_BITS,
/* data contains off-queue information when !WORK_STRUCT_PWQ */
WORK_OFFQ_FLAG_BASE = WORK_STRUCT_FLAG_BITS,
WORK_OFFQ_FLAG_BASE = WORK_STRUCT_COLOR_SHIFT,
WORK_OFFQ_CANCELING = (1 << WORK_OFFQ_FLAG_BASE),
@ -115,6 +116,20 @@ struct delayed_work {
int cpu;
};
/*
* A struct for workqueue attributes. This can be used to change
* attributes of an unbound workqueue.
*
* Unlike other fields, ->no_numa isn't a property of a worker_pool. It
* only modifies how apply_workqueue_attrs() select pools and thus doesn't
* participate in pool hash calculations or equality comparisons.
*/
struct workqueue_attrs {
int nice; /* nice level */
cpumask_var_t cpumask; /* allowed CPUs */
bool no_numa; /* disable NUMA affinity */
};
static inline struct delayed_work *to_delayed_work(struct work_struct *work)
{
return container_of(work, struct delayed_work, work);
@ -283,9 +298,10 @@ enum {
WQ_MEM_RECLAIM = 1 << 3, /* may be used for memory reclaim */
WQ_HIGHPRI = 1 << 4, /* high priority */
WQ_CPU_INTENSIVE = 1 << 5, /* cpu instensive workqueue */
WQ_SYSFS = 1 << 6, /* visible in sysfs, see wq_sysfs_register() */
WQ_DRAINING = 1 << 6, /* internal: workqueue is draining */
WQ_RESCUER = 1 << 7, /* internal: workqueue has rescuer */
__WQ_DRAINING = 1 << 16, /* internal: workqueue is draining */
__WQ_ORDERED = 1 << 17, /* internal: workqueue is ordered */
WQ_MAX_ACTIVE = 512, /* I like 512, better ideas? */
WQ_MAX_UNBOUND_PER_CPU = 4, /* 4 * #cpus for unbound wq */
@ -388,7 +404,7 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active,
* Pointer to the allocated workqueue on success, %NULL on failure.
*/
#define alloc_ordered_workqueue(fmt, flags, args...) \
alloc_workqueue(fmt, WQ_UNBOUND | (flags), 1, ##args)
alloc_workqueue(fmt, WQ_UNBOUND | __WQ_ORDERED | (flags), 1, ##args)
#define create_workqueue(name) \
alloc_workqueue((name), WQ_MEM_RECLAIM, 1)
@ -399,30 +415,23 @@ __alloc_workqueue_key(const char *fmt, unsigned int flags, int max_active,
extern void destroy_workqueue(struct workqueue_struct *wq);
struct workqueue_attrs *alloc_workqueue_attrs(gfp_t gfp_mask);
void free_workqueue_attrs(struct workqueue_attrs *attrs);
int apply_workqueue_attrs(struct workqueue_struct *wq,
const struct workqueue_attrs *attrs);
extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
struct work_struct *work);
extern bool queue_work(struct workqueue_struct *wq, struct work_struct *work);
extern bool queue_delayed_work_on(int cpu, struct workqueue_struct *wq,
struct delayed_work *work, unsigned long delay);
extern bool queue_delayed_work(struct workqueue_struct *wq,
struct delayed_work *work, unsigned long delay);
extern bool mod_delayed_work_on(int cpu, struct workqueue_struct *wq,
struct delayed_work *dwork, unsigned long delay);
extern bool mod_delayed_work(struct workqueue_struct *wq,
struct delayed_work *dwork, unsigned long delay);
extern void flush_workqueue(struct workqueue_struct *wq);
extern void drain_workqueue(struct workqueue_struct *wq);
extern void flush_scheduled_work(void);
extern bool schedule_work_on(int cpu, struct work_struct *work);
extern bool schedule_work(struct work_struct *work);
extern bool schedule_delayed_work_on(int cpu, struct delayed_work *work,
unsigned long delay);
extern bool schedule_delayed_work(struct delayed_work *work,
unsigned long delay);
extern int schedule_on_each_cpu(work_func_t func);
extern int keventd_up(void);
int execute_in_process_context(work_func_t fn, struct execute_work *);
@ -435,9 +444,121 @@ extern bool cancel_delayed_work_sync(struct delayed_work *dwork);
extern void workqueue_set_max_active(struct workqueue_struct *wq,
int max_active);
extern bool workqueue_congested(unsigned int cpu, struct workqueue_struct *wq);
extern bool current_is_workqueue_rescuer(void);
extern bool workqueue_congested(int cpu, struct workqueue_struct *wq);
extern unsigned int work_busy(struct work_struct *work);
/**
* queue_work - queue work on a workqueue
* @wq: workqueue to use
* @work: work to queue
*
* Returns %false if @work was already on a queue, %true otherwise.
*
* We queue the work to the CPU on which it was submitted, but if the CPU dies
* it can be processed by another CPU.
*/
static inline bool queue_work(struct workqueue_struct *wq,
struct work_struct *work)
{
return queue_work_on(WORK_CPU_UNBOUND, wq, work);
}
/**
* queue_delayed_work - queue work on a workqueue after delay
* @wq: workqueue to use
* @dwork: delayable work to queue
* @delay: number of jiffies to wait before queueing
*
* Equivalent to queue_delayed_work_on() but tries to use the local CPU.
*/
static inline bool queue_delayed_work(struct workqueue_struct *wq,
struct delayed_work *dwork,
unsigned long delay)
{
return queue_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay);
}
/**
* mod_delayed_work - modify delay of or queue a delayed work
* @wq: workqueue to use
* @dwork: work to queue
* @delay: number of jiffies to wait before queueing
*
* mod_delayed_work_on() on local CPU.
*/
static inline bool mod_delayed_work(struct workqueue_struct *wq,
struct delayed_work *dwork,
unsigned long delay)
{
return mod_delayed_work_on(WORK_CPU_UNBOUND, wq, dwork, delay);
}
/**
* schedule_work_on - put work task on a specific cpu
* @cpu: cpu to put the work task on
* @work: job to be done
*
* This puts a job on a specific cpu
*/
static inline bool schedule_work_on(int cpu, struct work_struct *work)
{
return queue_work_on(cpu, system_wq, work);
}
/**
* schedule_work - put work task in global workqueue
* @work: job to be done
*
* Returns %false if @work was already on the kernel-global workqueue and
* %true otherwise.
*
* This puts a job in the kernel-global workqueue if it was not already
* queued and leaves it in the same position on the kernel-global
* workqueue otherwise.
*/
static inline bool schedule_work(struct work_struct *work)
{
return queue_work(system_wq, work);
}
/**
* schedule_delayed_work_on - queue work in global workqueue on CPU after delay
* @cpu: cpu to use
* @dwork: job to be done
* @delay: number of jiffies to wait
*
* After waiting for a given time this puts a job in the kernel-global
* workqueue on the specified CPU.
*/
static inline bool schedule_delayed_work_on(int cpu, struct delayed_work *dwork,
unsigned long delay)
{
return queue_delayed_work_on(cpu, system_wq, dwork, delay);
}
/**
* schedule_delayed_work - put work task in global workqueue after delay
* @dwork: job to be done
* @delay: number of jiffies to wait or 0 for immediate execution
*
* After waiting for a given time this puts a job in the kernel-global
* workqueue.
*/
static inline bool schedule_delayed_work(struct delayed_work *dwork,
unsigned long delay)
{
return queue_delayed_work(system_wq, dwork, delay);
}
/**
* keventd_up - is workqueue initialized yet?
*/
static inline bool keventd_up(void)
{
return system_wq != NULL;
}
/*
* Like above, but uses del_timer() instead of del_timer_sync(). This means,
* if it returns 0 the timer function may be running and the queueing is in
@ -466,12 +587,12 @@ static inline bool __deprecated flush_delayed_work_sync(struct delayed_work *dwo
}
#ifndef CONFIG_SMP
static inline long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg)
static inline long work_on_cpu(int cpu, long (*fn)(void *), void *arg)
{
return fn(arg);
}
#else
long work_on_cpu(unsigned int cpu, long (*fn)(void *), void *arg);
long work_on_cpu(int cpu, long (*fn)(void *), void *arg);
#endif /* CONFIG_SMP */
#ifdef CONFIG_FREEZER
@ -480,4 +601,11 @@ extern bool freeze_workqueues_busy(void);
extern void thaw_workqueues(void);
#endif /* CONFIG_FREEZER */
#ifdef CONFIG_SYSFS
int workqueue_sysfs_register(struct workqueue_struct *wq);
#else /* CONFIG_SYSFS */
static inline int workqueue_sysfs_register(struct workqueue_struct *wq)
{ return 0; }
#endif /* CONFIG_SYSFS */
#endif

View file

@ -2224,11 +2224,11 @@ retry_find_task:
tsk = tsk->group_leader;
/*
* Workqueue threads may acquire PF_THREAD_BOUND and become
* Workqueue threads may acquire PF_NO_SETAFFINITY and become
* trapped in a cpuset, or RT worker may be born in a cgroup
* with no rt_runtime allocated. Just say no.
*/
if (tsk == kthreadd_task || (tsk->flags & PF_THREAD_BOUND)) {
if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
ret = -EINVAL;
rcu_read_unlock();
goto out_unlock_cgroup;

View file

@ -1388,16 +1388,16 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
cgroup_taskset_for_each(task, cgrp, tset) {
/*
* Kthreads bound to specific cpus cannot be moved to a new
* cpuset; we cannot change their cpu affinity and
* isolating such threads by their set of allowed nodes is
* unnecessary. Thus, cpusets are not applicable for such
* threads. This prevents checking for success of
* set_cpus_allowed_ptr() on all attached tasks before
* cpus_allowed may be changed.
* Kthreads which disallow setaffinity shouldn't be moved
* to a new cpuset; we don't want to change their cpu
* affinity and isolating such threads by their set of
* allowed nodes is unnecessary. Thus, cpusets are not
* applicable for such threads. This prevents checking for
* success of set_cpus_allowed_ptr() on all attached tasks
* before cpus_allowed may be changed.
*/
ret = -EINVAL;
if (task->flags & PF_THREAD_BOUND)
if (task->flags & PF_NO_SETAFFINITY)
goto out_unlock;
ret = security_task_setscheduler(task);
if (ret)

View file

@ -278,7 +278,7 @@ static void __kthread_bind(struct task_struct *p, unsigned int cpu, long state)
}
/* It's safe because the task is inactive. */
do_set_cpus_allowed(p, cpumask_of(cpu));
p->flags |= PF_THREAD_BOUND;
p->flags |= PF_NO_SETAFFINITY;
}
/**

View file

@ -4083,6 +4083,10 @@ long sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
get_task_struct(p);
rcu_read_unlock();
if (p->flags & PF_NO_SETAFFINITY) {
retval = -EINVAL;
goto out_put_task;
}
if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
retval = -ENOMEM;
goto out_put_task;
@ -4730,11 +4734,6 @@ int set_cpus_allowed_ptr(struct task_struct *p, const struct cpumask *new_mask)
goto out;
}
if (unlikely((p->flags & PF_THREAD_BOUND) && p != current)) {
ret = -EINVAL;
goto out;
}
do_set_cpus_allowed(p, new_mask);
/* Can the task run on the task's current CPU? If so, we're done */

File diff suppressed because it is too large Load diff

View file

@ -32,14 +32,12 @@ struct worker {
struct list_head scheduled; /* L: scheduled works */
struct task_struct *task; /* I: worker task */
struct worker_pool *pool; /* I: the associated pool */
/* L: for rescuers */
/* 64 bytes boundary on 64bit, 32 on 32bit */
unsigned long last_active; /* L: last active timestamp */
unsigned int flags; /* X: flags */
int id; /* I: worker id */
/* for rebinding worker to CPU */
struct work_struct rebind_work; /* L: for busy worker */
/* used only by rescuers to point to the target workqueue */
struct workqueue_struct *rescue_wq; /* I: the workqueue to rescue */
};
@ -58,8 +56,7 @@ static inline struct worker *current_wq_worker(void)
* Scheduler hooks for concurrency managed workqueue. Only to be used from
* sched.c and workqueue.c.
*/
void wq_worker_waking_up(struct task_struct *task, unsigned int cpu);
struct task_struct *wq_worker_sleeping(struct task_struct *task,
unsigned int cpu);
void wq_worker_waking_up(struct task_struct *task, int cpu);
struct task_struct *wq_worker_sleeping(struct task_struct *task, int cpu);
#endif /* _KERNEL_WORKQUEUE_INTERNAL_H */