Skip to content

Commit a35b646

Browse files
Peter ZijlstraKAGA-KOKO
authored andcommitted
sched, cgroup: Reduce rq->lock hold times for large cgroup hierarchies
Peter Portante reported that for large cgroup hierarchies (and or on large CPU counts) we get immense lock contention on rq->lock and stuff stops working properly. His workload was a ton of processes, each in their own cgroup, everybody idling except for a sporadic wakeup once every so often. It was found that: schedule() idle_balance() load_balance() local_irq_save() double_rq_lock() update_h_load() walk_tg_tree(tg_load_down) tg_load_down() Results in an entire cgroup hierarchy walk under rq->lock for every new-idle balance and since new-idle balance isn't throttled this results in a lot of work while holding the rq->lock. This patch does two things, it removes the work from under rq->lock based on the good principle of race and pray which is widely employed in the load-balancer as a whole. And secondly it throttles the update_h_load() calculation to max once per jiffy. I considered excluding update_h_load() for new-idle balance all-together, but purely relying on regular balance passes to update this data might not work out under some rare circumstances where the new-idle busiest isn't the regular busiest for a while (unlikely, but a nightmare to debug if someone hits it and suffers). Cc: [email protected] Cc: Larry Woodman <[email protected]> Cc: Mike Galbraith <[email protected]> Reported-by: Peter Portante <[email protected]> Signed-off-by: Peter Zijlstra <[email protected]> Link: http://lkml.kernel.org/n/[email protected] Signed-off-by: Thomas Gleixner <[email protected]>
1 parent b940313 commit a35b646

File tree

2 files changed

+14
-3
lines changed

2 files changed

+14
-3
lines changed

kernel/sched/fair.c

Lines changed: 9 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3387,6 +3387,14 @@ static int tg_load_down(struct task_group *tg, void *data)
33873387

33883388
static void update_h_load(long cpu)
33893389
{
3390+
struct rq *rq = cpu_rq(cpu);
3391+
unsigned long now = jiffies;
3392+
3393+
if (rq->h_load_throttle == now)
3394+
return;
3395+
3396+
rq->h_load_throttle = now;
3397+
33903398
rcu_read_lock();
33913399
walk_tg_tree(tg_load_down, tg_nop, (void *)cpu);
33923400
rcu_read_unlock();
@@ -4293,11 +4301,10 @@ static int load_balance(int this_cpu, struct rq *this_rq,
42934301
env.src_rq = busiest;
42944302
env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
42954303

4304+
update_h_load(env.src_cpu);
42964305
more_balance:
42974306
local_irq_save(flags);
42984307
double_rq_lock(this_rq, busiest);
4299-
if (!env.loop)
4300-
update_h_load(env.src_cpu);
43014308

43024309
/*
43034310
* cur_ld_moved - load moved in current iteration

kernel/sched/sched.h

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -374,7 +374,11 @@ struct rq {
374374
#ifdef CONFIG_FAIR_GROUP_SCHED
375375
/* list of leaf cfs_rq on this cpu: */
376376
struct list_head leaf_cfs_rq_list;
377-
#endif
377+
#ifdef CONFIG_SMP
378+
unsigned long h_load_throttle;
379+
#endif /* CONFIG_SMP */
380+
#endif /* CONFIG_FAIR_GROUP_SCHED */
381+
378382
#ifdef CONFIG_RT_GROUP_SCHED
379383
struct list_head leaf_rt_rq_list;
380384
#endif

0 commit comments

Comments
 (0)