Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are dealocked when cpu is set to offline

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Suresh Siddha
Date: Friday, March 7, 2008 - 4:01 pm

On Fri, Mar 07, 2008 at 10:36:25PM +0100, Rafael J. Wysocki wrote:

No. Its not the issue with __migrate_task(). Appended patch fixes my issue.
Recent RT wakeup balance code changes exposed a bug in migration_call() code
path.

Andrew, Please check if the appended patch fixes your power-off problem aswell.

thanks,
suresh
---

Handle the `CPU_DOWN_PREPARE_FROZEN' case in migration_call().

Otherwise, without this, we don't clear the cpu going down in the
root domains "online" mask. This was causing the RT tasks to be woken
up on already dead cpus, causing system hangs during standby, shutdown etc.

For example, on my system, this is the failing sequence:

kthread_stop() // coming from the cpu_callback's
    wake_up_process()
	sched_class->select_task_rq(); //select_task_rq_rt
	    find_lowest_rq
		find_lowest_cpus
	    	    cpus_and(*lowest_mask, task_rq(task)->rd->online, task->cpus_allowed);

In my case tasks->cpus_allowed is set to cpu_possible_map and because of the
this bug, rd->online still shows the dead cpu. Resulting in
find_lowest_rq() return an offlined cpu, because of which RT task gets woken
up on a DEAD cpu, causing various hangs.

This issue doesn't happen with normal tasks because, the select_task_rq_fair()
chooses between only two cpu's (the cpu which is waking up the task or last run
cpu (task_cpu()), kernel hotplug code makes sure that both of which always
represent the online CPU's).

Why it didn't show up in 2.6.24? Because the new wakeup code is using a
complex CPU selection logic in select_task_rq_rt() which exposed this
bug in migration_call()

Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com>
---

diff --git a/kernel/sched.c b/kernel/sched.c
index 52b9867..60550d8 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5882,6 +5882,7 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		break;
 
 	case CPU_DOWN_PREPARE:
+	case CPU_DOWN_PREPARE_FROZEN:
 		/* Update our root-domain */
 		rq = cpu_rq(cpu);
 		spin_lock_irqsave(&rq->lock, flags);
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [BUG 2.6.25-rc3] scheduler/hotplug: some processes are ..., Suresh Siddha, (Fri Mar 7, 4:01 pm)
[PATCH] keep rd-&gt;online and cpu_online_map in sync, Gregory Haskins, (Mon Mar 10, 6:39 am)
Re: [PATCH] keep rd-&gt;online and cpu_online_map in sync, Gautham R Shenoy, (Mon Mar 10, 7:21 am)
Re: [PATCH] keep rd-&gt;online and cpu_online_map in sync, Suresh Siddha, (Mon Mar 10, 11:12 am)
[PATCH v2] keep rd-&gt;online and cpu_online_map in sync, Gregory Haskins, (Mon Mar 10, 2:59 pm)
Re: [PATCH] keep rd-&gt;online and cpu_online_map in sync, Gregory Haskins, (Mon Mar 10, 3:00 pm)
Re: [PATCH] keep rd-&gt;online and cpu_online_map in sync, Rafael J. Wysocki, (Mon Mar 10, 3:03 pm)
Re: [PATCH] keep rd-&gt;online and cpu_online_map in sync, Suresh Siddha, (Mon Mar 10, 3:10 pm)
Re: [PATCH v2] keep rd-&gt;online and cpu_online_map in sync, Gautham R Shenoy, (Mon Mar 10, 9:39 pm)