This is an updated version of my previous cpuset patch: "Make rebuild_sched_domains() usable from any context (take 2)" It's been merged with the latest cpuset changes in mainline. rebuild_sched_domains() is the only way to rebuild sched domains correctly based on the current cpuset settings. What this means is that we need to be able to call it from different contexts, like cpu hotplug for example. Also latest scheduler code in -tip now calls rebuild_sched_domains() directly from functions like arch_reinit_sched_domains(). In order to support that properly we need to rework cpuset locking rules to avoid circular depencies, which is what this patch does. New lock nesting rules are explained in the comments. We can now safely call rebuild_sched_domains() from virtually any context. The only requirement is that it needs to be called under get_online_cpus(). This allows cpu hotplug handlers and the scheduler to call rebuild_sched_domains() directly. The rest of the cpuset code now offloads sched domains rebuilds to the single threaded workqueue. As suggested by both Paul J. and Paul M. This version of the patch addresses comments from the previous review. I fixed all miss-formated comments and trailing spaces. __rebuild_sched_domains() has been renamed to async_rebuild_sched_domains(). I also factored out the code that builds domain masks and split up CPU and memory hotplug handling. This was needed to simplify locking, to avoid unsafe access to the cpu_online_map from mem hotplug handler, and in general to make things cleaner. The patch passes moderate testing (building kernel with -j 16, creating & removing domains and bring cpus off/online at the same time) on the quad-core2 based machine. It passes lockdep checks, even with preemptable RCU enabled. This time I also tested in with suspend/resume path and everything is working as expected. Signed-off-by: Max Krasnyansky <maxk@qualcomm.com> Cc: mingo@elte.hu Cc: pj@sgi.com Cc: menage@google.com Cc: ...
The Subject line suggests that this latest version of your patch should
apply on top of 2.6.27-rc1. That does not seem to be quite accurate.
As stated in your previous message, it seems you're working on top of
Linus's latest git version, which, at least for the source file
kernel/cpuset.c, is (I suppose) more or less the same as 2.6.27-rc1
plus the 'origin.patch' that is at the top of Andrew's broken out quilt
patch series 2.6.27-rc1-mm1.
That origin.patch includes Li Zefan's patch:
cpuset: clean up cpuset hierarchy traversal code
which Andrew recently sent along to Linus, and which patch is presumed
by this latest version of your patch.
I'll take a further look at your patch now.
We have lots of trees, including various from Ingo, linux-next, Andrew,
and Linus, each in many versions. It helps others if you can state
exactly what tree/version a patch applies to, in each patch version
that you submit.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
It's on top of the latest mainline I should not have put -rc1 in the subject sorry. Max --
"the latest mainline" ... I guess that means the same
as what I termed "Linus's latest git version" ?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
So far as I can tell, the logic and design is ok.
I found some of the comments, function names and code factoring
to be confusing or incomplete. And I suspect that the rebuilding
of sched domains in the case that sched_power_savings_store()
is called in kernel/sched.c, on systems using cpusets, is not
needed or desirable (I could easily be wrong here - beware!).
I'm attaching a patch that has the changes that I would like
to suggest for your consideration. I have only recompiled this
patch, for one configuration - x86_64 with CPUSETS. I am hoping,
Max, that you can complete the testing.
The patch below applies to 2.6.27-rc1, plus the first patch,
"origin.patch" in Andrew's 2.6.27-rc1-mm1 quilt patch stack,
plus your (Max's) latest:
[PATCH] cpuset: Rework sched domains and CPU hotplug handling (2.6.27-rc1)
Here's a description of most of what I noticed:
1) Unrelated spacing tweak, changing:
LIST_HEAD(q); /* queue of cpusets to be scanned*/
to:
LIST_HEAD(q); /* queue of cpusets to be scanned */
2) The phrasing:
Must be called with cgroup_lock held.
seems imprecise to me; "cgroup_lock" is the name of a wrapper
function, not of a lock. The underlying lock is cgroup_mutex,
which is still mentioned by name in various kernel/cpuset.c
comments, even though cgroup_mutex is static in kernel/cgroup.c
So I fiddled with the wording of this phrasing.
3) You removed the paragraph:
* ... May take callback_mutex during
* call due to the kfifo_alloc() and kmalloc() calls. May nest
* a call to the get_online_cpus()/put_online_cpus() pair.
* Must not be called holding callback_mutex, because we must not
* call get_online_cpus() while holding callback_mutex. Elsewhere
* the kernel nests callback_mutex inside get_online_cpus() calls.
* So the reverse nesting would risk an ABBA deadlock.
But you didn't replace it with an updated locking description.
I expanded and tweaked ...I think it it's covered by the top level comment that starts with
/*
* There are two global mutexes guarding cpuset structures. The first
* is the main control groups cgroup_mutex, accessed via
* cgroup_lock()/cgroup_unlock().
....
Actually it is appropriate, and there is one more user of the
arch_reinit_sched_domains() which is S390 topology updates.
Those things (mc_power and topology updates) have to update domain flags based
on the mc/smt power and current topology settings.
This is done in the
__rebuild_sched_domains()
...
SD_INIT(sd, ALLNODES);
...
SD_INIT(sd, MC);
...
SD_INIT(sd,X) uses one of SD initializers defined in the include/linux/topology.h
For example SD_CPU_INIT() includes BALANCE_FOR_PKG_POWER which expands to
#define BALANCE_FOR_PKG_POWER \
((sched_mc_power_savings || sched_smt_power_savings) ? \
SD_POWERSAVINGS_BALANCE : 0)
Yes it's kind convoluted :). Anyway, the point is that we need to rebuild the
domains when those settings change. We could probably write a simpler version
Actually it was not much of a rename. The only rename was
rebuild_sched_domains() -> async_rebuild_sched_domains() and yes looks like I
missed one or two comment.
The other stuff was basically factoring out the function that generates the
domain masks/attrs from the function that actually tells the scheduler to
rebuild them.
I'd be ok with some other name for generate_sched_domains() if you think it's
confusing.
btw With your current proposal we have
rebuild_sched_domains() - just generates domain masks/attrs
async_rebuild_sched_domains() - generates domain masks/attrs and actually
rebuilds the domains
That I think is more confusing. So I guess my preference would be to have the
generation part separate. And like I explained above we do need to be able to
Yes if we did not have to call rebuild_sched_domains() from
arch_reinit_sched_domains() I would've done ...Hmmm ... ok I suppose.
Could we have the kernel/sched.c code, in this case, call the
kernel/cpuset.c routine async_rebuild_sched_domains(), rather
than the synchronous rebuild_sched_domains() call (in your naming)
which required details of the get_online_cpus() and put_online_cpus()
wrapping to leak into kernel/sched.c:arch_reinit_sched_domains()
routine?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
It could I guess. But the questions is why ? I mean the only reason we've introduced workqueue is because lock nesting got too complicated. Otherwise in all those paths we're already in a process context and there is no need to schedule a workqueue. I see your point about get_online_cpus() thing. But it is similar to partition_sched_domains() which is an external (from the sched pov) interface and must be called within get_online_cpus() ... put_online_cpus(). Max --
My priorities are different than yours.
You look at a code path that, in some cases, is more involved than
necessary, and recommend providing an alternative code path, for
the cases that can get by executing (significantly) fewer CPU cycles.
I look at the number of code paths, lines of code, duplicated code
and number of tests and conditions in the source code, and ask how
to reduce them. I want the least amount of source code, the fewest
alternative paths through that code, the smallest number of logical
tests and code variants, the least amount of code duplication.
The key in this case is that I prefer having just one code path by
which all rebuilds of sched domains are done. Since that rebuild must
be asynchronous for some cases (to avoid ABBA lock nesting problems)
therefore let all sched domains be done by that async path.
The fewer the number of code path variants, the easier it is to
understand the code, and the harder it is to break the code.
Except for frequently executed code paths, which this surely is not,
minimizing software maintenance costs is far more important to me than
minimizing CPU cycles.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Actually I was not really trying to minimize cpu cycles or anything. But it does seem like an overkill to schedule a workqueue just because we do not want to expose the fact that rebuild_sched_domains() depends get_online_cpus(). More importantly though if you think about it I'm actually not introducing any new alternative paths (besides async_rebuild_sched_domains() itself). Code paths in question have been synchronous since day one, and have been tested that way. Now you're saying lets make them asynchronous. Peter already had a concern about async rebuilds from within the cpuset itself. I think those are fine but I definitely do not want to make domain rebuilds in the cpu hotplug path asynchronous. So I'd argue that async_rebuild_domains() is the new path which we have to make async due to lock nesting issues. The other paths (cpu hotplug, topology updates, SMT and MC power settings updates) are already synchronous and have no lock nesting issues and therefore do not need to change (async vs sync). Also IMO it makes sense to keep both CONFIG_CPUSETS and !CONFIG_CPUSETS handling as similar as possible to avoid surprises. With your proposal when cpusets are disabled arch_reinit_sched_domain() will be synchronous, when they are enabled it will asynchronous. If you feel strongly about this we can discuss this some more and try it out. But I really do not see those maintenance saving in this case. We'd actually be changing more things. Max --
"overkill" -- by what metric?
If something is overkill, it means something is excessive.
Excess (and deficiency) occur along some scale, some metric.
If not by CPU cycles, then by what metric is it overkill?
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Paul, I think you're focusing on the wrong part of my reply ;-). You are, of course, correct about the overkill metric. I was saying that saving cycles in that path was not my goal when I wrote the patch. I just added necessary functions to make existing paths work they way they currently do and only changed one path because it had to change due to lock nesting issues. Along the way simply pointed it out that scheduling work unnecessarily does seem less efficient. That was not the important part though :) Max --
I also don't care that much how much code is changed;
if it must be changed, then change it to what is best,
which may not necessarily be the minimum change.
If an asynchronous sched domain rebuild is needed in
some places, then consider just using it everywhere,
rather than providing two code paths to rebuild, one
sync and one async.
Ask not how we got here; ask where we should be.
In particular, and in this specific case, having the
dual code paths really did make it a little more difficult
for me to understand the code, as evidenced by the back
and forth discussion on how to name the confusingly
similar functions. Such naming controversies are usually
a sign of code duplication or improper factoring.
That additional difficulty in understanding this code
was a key factor in delaying my review of your code.
I'd look at it, my mind was glaze over, and I'd put
it aside for a few days. This happened repeatedly.
Fine code is like fine art ... spare, elegant, compelling
in its expression.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Sure. What I'm saying is that I do not think it's the best I still do not see a good reason why. IMO exceptions are acceptable. Only domain rebuilds triggered by cpuset fs writes need to be async. I do not see a good technical reason why the rest needs to be converted I'm not sure what you're referring to. There was no back an forth. You suggested reverting the rename and I pointed out that it was not a rename, I simply factored out the part that generates the To be fair the fact that you had trouble understanding my code does not automatically mean that it was not artistic ;-). Max --
I was referring to an earlier discussion, which resulted in a patch
which includes the line:
But you seemed to be saying that the reason it was not
best to do so was because it was best not to change more
than necessary.
Perhaps I still don't understand your metric. Apparently
Well, your metric seems clear enough there - minimize effort of
code conversion and testing.
How about this ... two routines quite identical and parallel,
even in their names, except that one is async and the other not:
==================================================================
/*
* Rebuild scheduler domains, asynchronously in a separate thread.
*
* If the flag 'sched_load_balance' of any cpuset with non-empty
* 'cpus' changes, or if the 'cpus' allowed changes in any cpuset
* which has that flag enabled, or if any cpuset with a non-empty
* 'cpus' is removed, then call this routine to rebuild the
* scheduler's dynamic sched domains.
*
* The rebuild_sched_domains() and partition_sched_domains()
* routines must nest cgroup_lock() inside get_online_cpus(),
* but such cpuset changes as these must nest that locking the
* other way, holding cgroup_lock() for much of the code.
*
* So in order to avoid an ABBA deadlock, the cpuset code handling
* these user changes delegates the actual sched domain rebuilding
* to a separate workqueue thread, which ends up processing the
* above rebuild_sched_domains_thread() function.
*/
static void async_rebuild_sched_domains(void)
{
queue_work(cpuset_wq, &rebuild_sched_domains_work);
}
/*
* Accomplishes the same scheduler domain rebuild as the above
* async_rebuild_sched_domains(), however it directly calls the
* rebuild routine inline, rather than calling it via a separate
* asynchronous work thread.
*
* This can only be called from code that is not holding
* cgroup_mutex (not nested in a cgroup_lock() call.)
*/
void inline_rebuild_sched_domains(void)
{
...Sure. That looks fine. Although inline_ will probably be a bit confusing since one may think that it has something to do with the C 'inline'. I'd suggest either sync_rebuild_sched_domains() or simply rebuild_sched_domains(). The later has the advantage that the patch will not have to touch the scheduler code. Let me know your preference and I'll respin the patch. Max --
Good point about "inline" confusions.
I am willing to agree to "sync_rebuild_sched_domains()".
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
I'd vote in favour of Max's suggestion of just rebuild_sched_domains(), to localize the changes better. Paul --
I'm outvoted, two to one.
Go for it, Max.
Thank, Paul M.
--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--
Thank you. I'll go ahead and respin the patch. Max --
