Re: init's children list is long and slows reaping children.

Previous thread: What protects cpu_tlbstate? by Jeremy Fitzhardinge on Thursday, April 5, 2007 - 12:44 pm. (5 messages)

Next thread: Re: AHCI exception, ext3 journal aborted on a VIA K8M890 / VT8251 by Stephen Evanchik on Thursday, April 5, 2007 - 12:52 pm. (1 message)
From: Robin Holt
Date: Thursday, April 5, 2007 - 12:51 pm

We have been testing a new larger configuration and we are seeing a very
large scan time of init's tsk->children list.  In the cases we are seeing,
there are numerous kernel processes created for each cpu (ie: events/0
... events/<big number>, xfslogd/0 ... xfslogd/<big number>).  These are
all on the list ahead of the processes we are currently trying to reap.

wait_task_zombie() is taking many seconds to get through the list.
For the case of a modprobe, stop_machine creates one thread per cpu
(remember big number). All are parented to init and their exit will
cause wait_task_zombie to scan multiple times most of the way through
this very long list looking for threads which need to be reaped.  As
a reference point, when we tried to mount the xfs root filesystem,
we ran out of pid space and had to recompile a kernel with a larger
default max pids.

For testing, Jack Steiner create the following patch.  All it does
is moves tasks which are transitioning to the zombie state from where
they are in the children list to the head of the list.  In this way,
they will be the first found and reaping does speed up.  We will still
do a full scan of the list once the rearranged tasks are all removed.
This does not seem to be a significant problem.

This does, however, modify the order of reaping of children.  Is there a
guarantee of the order for reaping children which needs to be preserved
or can this simple patch be used to speed up the reaping?  If this
simple patch is not acceptable, are there any preferred methods for
linking together the tasks that have been zombied so they can be reaped
more quickly?  Maybe add a zombie list_head to the task_struct and chain
them together in the children list order?

In comparison, without this patch, following modprobe on that particular
machine init is still reaping zombied tasks more than 30 seconds
following command completion.  With this patch, all the zombied tasks
are removed within the first couple seconds.

Any suggestions would be ...
From: Linus Torvalds
Date: Thursday, April 5, 2007 - 1:57 pm

I'd almost prefer to just put the zombie children on a separate list. I 
wonder how painful that would be..

That would still make it expensive for people who use WUNTRACED to get 
stopped children (since they'd have to look at all lists), but maybe 
that's not a big deal.

Another thing we could do is to just make sure that kernel threads simply 
don't end up as children of init. That whole thing is silly, they're 
really not children of the user-space init anyway. Comments?

		Linus
-

From: Chris Snook
Date: Thursday, April 5, 2007 - 5:51 pm

Does anyone remember why we started doing this in the first place?  I'm sure 
there are some tools that expect a process tree, rather than a forest, and 
making it a forest could make them unhappy.

The support angel on my shoulder says we should just put all the kernel threads 
under a kthread subtree to shorten init's child list and minimize impact.  The 
hacker devil on my other shoulder says that with usermode helpers, containers, 
etc. it's about time we treat it as a tree, and any tools that have a problem 
with that need to be fixed.

-- Chris
-

From: Chris Snook
Date: Thursday, April 5, 2007 - 6:03 pm

Err, that should have been "about time we treat it as a forest".

-- Chris
-

From: Linus Torvalds
Date: Thursday, April 5, 2007 - 6:29 pm

I'm not sure anybody would really be unhappy with pptr pointing to some 
magic and special task that has pid 0 (which makes it clear to everybody 
that the parent is something special), and that has SIGCHLD set to SIG_IGN 
(which should make the exit case not even go through the zombie phase).

I can't even imagine *how* you'd make a tool unhappy with that, since even 
tools like "ps" (and even more "pstree" won't read all the process states 
atomically, so they invariably will see parent pointers that don't even 
exist any more, because by the time they get to the parent, it has exited 

A number are already there, of course, since they use the kthread 
infrastructure to get there. 

		Linus
-

From: Eric W. Biederman
Date: Thursday, April 5, 2007 - 7:15 pm

Right.  pid == 1 being missing might cause some confusing having 
but having ppid == 0 should be fine.  Heck pid == 1 already has 
ppid == 0, so it is a value user space has had to deal with for a
while.

In addition there was a period in 2.6 where most kernel threads
and init had a pgid == 0 and a session  == 0, and nothing seemed
to complain.

We should probably make all of the kernel threads children of
init_task.  The initial idle thread on the first cpu that is the
parent of pid == 1.   That will give the ppid == 0 naturally because

Almost everything should be using kthread by now.  I do admit that there
are a handful of kernel threads that still use kthread_create but it
is a relatively short list.

Looking we apparently have a couple of process started by
kthread_create that are not under kthread.  They all have  pid numbers
lower than kthread so I'm guessing it is some startup ordering issue.

Currently it looks like daemonize is reparenting everything to init,
changing that to init_task and making the threads self reaping
should be trivial.

.....

I'm a little nervous that we exceeded our default pid max just booting
the kernel.  32768 is a lot of kernel threads.  That sounds like 32
kernel threads per cpu.  That seems to be more than I have on any
of my little machines.


There is no defined order for reaping of child processes and in fact
I can't even see anything in the kernel right now that would even
accidentally give user space the idea we had a defined order.

So I think we have some options once we get the kernel threads out
of the way.  Getting the kernel threads out of the way would seem
to be the first priority.

Eric
-

From: Robin Holt
Date: Friday, April 6, 2007 - 3:43 am

I think both avenues would probably be the right way to proceeed.
Getting kthreads to not be parented by init would be an opportunity
for optimization.

I think organizing the zombie tasks to be easily reaped also has merit.
Rapidly forking/exiting processes like udevd during the boot of a
different machine were also shown to benefit significantly from this
patch.  That machine had 512 cpus and 4608 disk devices, we dropped the
device discovery under udevd by 30%.  This, honestly, surprised us.
It makes some sense now that I think about it.  This would be a case
where improving the zombie handling would be beneficial to more than
just kthreads.

Thanks,
Robin
-

From: Eric W. Biederman
Date: Friday, April 6, 2007 - 8:38 am

How hard is tasklist_lock hit on these systems?

How hard is the pid hash hit on these systems?

My hunch is that if you are doing a lot of forking and exiting
zombie reaping isn't the only problem you are observing.

Thinking about it I do agree with Linus that two lists sounds like the
right solution because it ensures we always have O(1) time when
waiting for a zombie.  I'd like to place the list head for the zombie
list in the signal_struct and not in the task_struct so our
performance continues to be O(1) when we have a threaded process.

The big benefit of the zombie list over your proposed list reordering
is that waitpid can return immediately when we don't have zombies to
wait for, but we have lots of children.  So it looks like a universal
benefit and about as good as it is possible to make zombie handling
of waitpid.

Eric
-

From: Robin Holt
Date: Friday, April 6, 2007 - 9:41 am

The major hold-off we are seeing is from tasks reaping children,

In the little bit of testing we got before the machine got taken away,
we never observed significant issues with the pid hash.  We only got
time to run a few benchmarks like aim7.


Is this something you are taking on as a task for yourself or do you
want me to pursue?  I am extremely swamped on other non-kernel issues
and Jack is off working on x86_64 issues so we would both _greatly_
appreciate any help you can give.  Of course, we understand this is an
issue that is affecting us and the responsibility ultimately lies here.

As for testing a proposed patch, we have a customer machine available for
the next few days which could be put into the same configuration we had
for this test, but it will be limited availability and then only assuming
the SGI site test personnel are certain the machine meets customer needs.

Thanks,
Robin
-

From: Oleg Nesterov
Date: Friday, April 6, 2007 - 9:31 am

Well. I bet this will be painful, and will uglify the code even more.

do_wait() has to iterate over 2 lists, __ptrace_unlink() should check

Sure. It would be nice to move ->children into signal_struct at first.

TASK_TRACED/TASK_STOPPED ?

Oleg.

-

From: Ingo Molnar
Date: Friday, April 6, 2007 - 10:32 am

no. Two _completely separate_ lists.

i.e. a to-be-reaped task will still be on the main list _too_. The main 
list is for all the PID semantics rules. The reap-list is just for 
wait4() processing. The two would be completely separate.

	Ingo
-

From: Roland Dreier
Date: Friday, April 6, 2007 - 10:39 am

> no. Two _completely separate_ lists.
 > 
 > i.e. a to-be-reaped task will still be on the main list _too_. The main 
 > list is for all the PID semantics rules. The reap-list is just for 
 > wait4() processing. The two would be completely separate.

I guess this means we add another list head to struct task_struct.
Not that big a deal, but it does make me a little sad to think about
task_struct getting even bigger....

 - R.
-

From: Eric W. Biederman
Date: Friday, April 6, 2007 - 11:04 am

signal_struct please.  It isn't that much better but still...

Eric
-

From: Eric W. Biederman
Date: Friday, April 6, 2007 - 11:30 am

And what pray tell except for heuristics is the list of children used for?

I could find a use in the scheduler (oldest_child and younger/older_sibling).
I could find a use in mm/oom_kill.

I could find a use in irixsig where it roles it's own version of wait4.

Perhaps I was blind but that was about it.

I didn't see the child list implementing any semantics we really care
about to user space.

Eric
-

From: Ingo Molnar
Date: Friday, April 6, 2007 - 12:18 pm

yeah - by all means get rid of it, but first separate the data 
structures along uses. Then all the 'why should we iterate two lists in 

this can be zapped today. The patch below does it - the scheduler use 
was purely historic. oldest_child/older_sibling used to have a role but 



i think you are right.

	Ingo

Subject: [patch] sched: get rid of p->children use in show_task()
From: Ingo Molnar <mingo@elte.hu>

the p->parent PID printout gives us all the information about the
task tree that we need - the eldest_child()/older_sibling()/
younger_sibling() printouts are mostly historic and i do not
remember ever having used those fields. (IMO in fact they confuse
the SysRq-T output.) So remove them.

This code has sentimental value though, those fields and
printouts are one of the oldest ones still surviving from
Linux v0.95's kernel/sched.c:

        if (p->p_ysptr || p->p_osptr)
                printk("   Younger sib=%d, older sib=%d\n\r",
                        p->p_ysptr ? p->p_ysptr->pid : -1,
                        p->p_osptr ? p->p_osptr->pid : -1);
        else
                printk("\n\r");

written 15 years ago, in early 1992.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 kernel/sched.c |   35 +----------------------------------
 1 file changed, 1 insertion(+), 34 deletions(-)

Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -4687,27 +4687,6 @@ out_unlock:
 	return retval;
 }
 
-static inline struct task_struct *eldest_child(struct task_struct *p)
-{
-	if (list_empty(&p->children))
-		return NULL;
-	return list_entry(p->children.next,struct task_struct,sibling);
-}
-
-static inline struct task_struct *older_sibling(struct task_struct *p)
-{
-	if (p->sibling.prev==&p->parent->children)
-		return NULL;
-	return list_entry(p->sibling.prev,struct task_struct,sibling);
-}
-
-static inline struct task_struct *younger_sibling(struct ...
From: Ingo Molnar
Date: Friday, April 6, 2007 - 12:22 pm

and this way we get the best change as well: not only will kthreads be 
removed from that list, but all other tasks in the system too. I bet 
this will speed up wait4() _enormously_, on server workloads that 
involve many tasks.

	Ingo
-

From: Ingo Molnar
Date: Tuesday, April 10, 2007 - 6:48 am

on a second thought: the p->children list is needed for the whole 
child/parent task tree, which is needed for sys_getppid(). The question 
is, does anything require us to reparent to within the same thread 
group?

	Ingo
-

From: Oleg Nesterov
Date: Tuesday, April 10, 2007 - 6:38 am

No! That is why I suggest (a long ago, in fact) to move ->children into
->signal_struct. When sub-thread forks, we set ->parent = group_leader.
We don't need forget_original_parent() until the last thead exists. This
also simplify do_wait().

However, this breaks the current ->pdeath_signal behaviour. In fact (and
Eric thinks the same) this _fixes_ this behaviour, but may break things.

Oleg.

-

From: Eric W. Biederman
Date: Tuesday, April 10, 2007 - 8:00 am

Thinking about this.  As contingency planning if there is something in user
space that actually somehow cares about pdeath_signal, from a threaded
parent we can add a pdeath_signal list, to the task_struct and get
rid of the rest of the complexity.

I think I want to wait until someone screams first.

This does very much mean that we can remove the complexity of
a per thread ->children without fear, of having to revert everything.

Eric
-

From: Eric W. Biederman
Date: Tuesday, April 10, 2007 - 8:16 am

Currently each thread can requrest to be notified when it's parent
terminates, and receive a thread specific signal when that occurs.
That we set this on a per thread granularity and then send it to the
whole thread group seems silly, but whatever.

Currently we send a signal when the results of getppid don't change if
our parent thread dies and we are reparented to a different thread.
This seems counterintuitive to what I would expect when programming in
user space and is a major maitenance issue to continue doing.

The only users I recall using this have non threaded parents and
pdeath_signal predates CLONE_THREAD so arguably this code has been
broken with threaded parents since the day CLONE_THREAD was introduced
and no one ever screamed loudly enough to get it fixed.

So this patch fixes the pdeath_signal behaviour only sending a signal
when the results of getppid would change.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---

This patch is against 2.6.21-rc6-mm1 (with utrace applied)
but except for context in the diff that should not matter.

 kernel/exit.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/exit.c b/kernel/exit.c
index 1d91de8..1ec0d1f 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -618,10 +618,6 @@ choose_new_parent(struct task_struct *p, struct task_struct *reaper)
 static void
 reparent_thread(struct task_struct *p, struct task_struct *father)
 {
-	if (p->pdeath_signal)
-		/* We already hold the tasklist_lock here.  */
-		group_send_sig_info(p->pdeath_signal, SEND_SIG_NOINFO, p);
-
 	/* Move the child from its dying parent to the new one.  */
 	list_move_tail(&p->sibling, &p->parent->children);
 
@@ -635,6 +631,10 @@ reparent_thread(struct task_struct *p, struct task_struct *father)
 	if (p->exit_signal != -1)
 		p->exit_signal = SIGCHLD;
 
+	if (p->pdeath_signal)
+		/* We already hold the tasklist_lock here.  */
+		group_send_sig_info(p->pdeath_signal, SEND_SIG_NOINFO, p);
+
 	/* ...
From: Oleg Nesterov
Date: Tuesday, April 10, 2007 - 9:37 am

Don't get me wrong, I personally like this patch very much. However,


Oleg.

-

From: Eric W. Biederman
Date: Tuesday, April 10, 2007 - 10:41 am

Good point.

I guess we have a simple question.
Does a parent death signal make most sense between separately written programs?

   If so this patch is a bug fix.

Does a parent death signal make most sense between processes that are part of
a larger program.

   If so there is little point to this patch.

Eric
-

From: Roland McGrath
Date: Tuesday, April 10, 2007 - 10:48 am

> Does a parent death signal make most sense between separately written programs?

I don't think it does.  It has always seemed an utterly cockamamy feature

That is the only way I can really see it being used.  The only actual
example of use I know is what Albert Cahalan reported.  To my mind, the
only semantics that matter for pdeath_signal are what previous uses
expected in the past and still need for compatibility.  If we started with
a fresh rationale from the ground up on what the feature is good for, I am
rather skeptical it would pass muster to be added today.


Thanks,
Roland
-

From: Albert Cahalan
Date: Tuesday, April 10, 2007 - 8:17 pm

Until inotify and dnotify work on /proc/12345/task, there really isn't
an alternative for some of us. Polling is unusable.

Ideally one could pick any container, session, process group,
mm, task group, or task for notification of state change.
State change means various things like destruction, addition
of something new, exec, etc. (stuff one can see in /proc)
With appropriate privs, having the debug-related stuff would be
good as well.
-

From: Eric W. Biederman
Date: Tuesday, April 10, 2007 - 7:51 am

Yes, something Oleg said made me realize that.

As long as the reparent isn't to complex it isn't required that we

I think my head is finally on straight about this question.

Currently there is the silly linux specific parent death signal
(pdeath_signal).  Oleg's memory was a better than mine on this score.

However there is no indication that the parent death signal being sent
when a thread leader dies is actually correct, or even interesting.
It probably should only be sent when getppid changes.

So with pdeath_signal fixed that is nothing that requires us to
reparent within the same thread group.

I'm trying to remember what the story is now.  There is a nasty
race somewhere with reparenting, a threaded parent setting SIGCHLD to
SIGIGN, and non-default signals that results in an zombie that no one
can wait for and reap.  It requires being reparented twice to trigger.

Anyway it is a real mess and if we can remove the stupid multi-headed
child lists things would become much simpler and the problem could
not occur.

Plus the code would become much simpler...

utrace appears to have removed the ptrace_children list and the
special cases that entailed.

Eric

-

From: Ingo Molnar
Date: Tuesday, April 10, 2007 - 8:06 am

so ... is anyone pursuing this? This would allow us to make sys_wait4() 
faster and more scalable: no tasklist_lock bouncing for example.

	Ingo
-

From: Eric W. Biederman
Date: Tuesday, April 10, 2007 - 8:22 am

which part?

Eric
-

From: Ingo Molnar
Date: Tuesday, April 10, 2007 - 8:53 am

all of it :) Everything you mentioned makes sense quite a bit. The 
thread signal handling of do_wait was added in a pretty arbitrary 
fashion so i doubt there are strong requirements in that area. Apps 
might have grown to get used to it meanwhile though, so we've got to do 
it carefully.

	Ingo
-

From: Eric W. Biederman
Date: Tuesday, April 10, 2007 - 9:17 am

I'm looking at.  If only because there is a reasonable chance doing this
will fix the races with a threaded init.

However I just found something nasty.  The wait __WNOTHREAD flag.

And my quick search seems to find at least one user space applications
that uses it, and it is widely documented so I suspect there are
others :(

I played with moving the lists into signal_struct, and short of
architecture specific users of task->children all I had to touch
were:

 include/linux/init_task.h |    2 +-
 include/linux/sched.h     |    5 +-
 kernel/exit.c             |  159 +++++++++++++++++++++------------------------
 kernel/fork.c             |    2 +-
 mm/oom_kill.c             |    4 +-

So it should be relatively easy to change this child lists around...

Eric
-

From: Ingo Molnar
Date: Tuesday, April 10, 2007 - 11:20 pm

here's a (tested) patch i did that should simplify changes done to 
p->children and p->sibling handling.

	Ingo

---------------------->
Subject: [patch] uninline remove/add_parent() APIs
From: Ingo Molnar <mingo@elte.hu>

uninline/simplify remove/add_parent() APIs.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 include/linux/sched.h |    5 +++--
 kernel/exit.c         |   31 +++++++++++++++++++++----------
 kernel/fork.c         |    2 +-
 kernel/ptrace.c       |   11 ++++-------
 4 files changed, 29 insertions(+), 20 deletions(-)

Index: linux/include/linux/sched.h
===================================================================
--- linux.orig/include/linux/sched.h
+++ linux/include/linux/sched.h
@@ -1418,8 +1418,9 @@ extern void wait_task_inactive(struct ta
 #define wait_task_inactive(p)	do { } while (0)
 #endif
 
-#define remove_parent(p)	list_del_init(&(p)->sibling)
-#define add_parent(p)		list_add_tail(&(p)->sibling,&(p)->parent->children)
+extern void
+task_relink_parent(struct task_struct *p, struct task_struct *new_real_parent,
+		   struct task_struct *new_parent);
 
 #define next_task(p)	list_entry(rcu_dereference((p)->tasks.next), struct task_struct, tasks)
 
Index: linux/kernel/exit.c
===================================================================
--- linux.orig/kernel/exit.c
+++ linux/kernel/exit.c
@@ -52,6 +52,20 @@ extern void sem_exit (void);
 
 static void exit_mm(struct task_struct * tsk);
 
+void
+task_relink_parent(struct task_struct *p,
+		   struct task_struct *new_real_parent,
+		   struct task_struct *new_parent)
+{
+	/*
+	 * Move this task to a new parent's children list.
+	 * (if p->parent == new->parent this this requeues from head to tail)
+	 */
+	list_move_tail(&p->sibling, &new_parent->children);
+	p->real_parent = new_real_parent;
+	p->parent = new_parent;
+}
+
 static void __unhash_process(struct task_struct *p)
 {
 	nr_threads--;
@@ -64,7 +78,7 @@ static void __unhash_process(struct task
 ...
From: Eric W. Biederman
Date: Wednesday, April 11, 2007 - 12:00 am

Looks reasonable.

Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>

Should we copy Andrew or is someone else going to collect up all of these patches?

Eric
-

From: Andrew Morton
Date: Wednesday, April 11, 2007 - 3:06 pm

On Wed, 11 Apr 2007 01:00:17 -0600

Andrew is cowering in terror, because utrace goes tromping through this
code.

It would be rather good to get utrace moving along a bit more quickly.  There
still seem to be rather a lot of issues.
-

From: Eric W. Biederman
Date: Thursday, April 12, 2007 - 3:45 am

Just on that note roughly where is utrace?

My first impression is that it appears to cleanup a little bit of
child list handling at the cost of 2000 lines of extra code.

Eric



-

From: Roland McGrath
Date: Thursday, April 12, 2007 - 3:50 pm

I'm travelling this week (through Monday) and can't be of much immediate
help on improving the situation or explaining it in great detail.  Last
week before I left home I was deep in some strange debugging and didn't get
a chance to look up.  There will be more of that, but I'll try to make some
timely progress on answering all the backlog of correspondence about utrace
too.


Thanks,
Roland
-

From: Oleg Nesterov
Date: Tuesday, April 10, 2007 - 9:44 am

reparent_thread:

	...

        /* If we'd notified the old parent about this child's death,
         * also notify the new parent.
         */
        if (!traced && p->exit_state == EXIT_ZOMBIE &&
            p->exit_signal != -1 && thread_group_empty(p))
                do_notify_parent(p, p->exit_signal);

We notified /sbin/init. If it ignores SIGCHLD, we should release the task.
We don't do this.

The best fix I believe is to cleanup the forget_original_parent/reparent_thread
interaction and factor out this "exit_state == EXIT_ZOMBIE && exit_signal == -1"
checks.

Oleg.

-

From: Bill Davidsen
Date: Wednesday, April 11, 2007 - 12:55 pm

As long as the original parent is preserved for getppid(). There are 
programs out there which communicate between the parent and child with 
signals, and if the original parent dies, it undesirable to have the 
child getppid() and start sending signals to a program not expecting 
them. Invites undefined behavior.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-

From: Eric W. Biederman
Date: Wednesday, April 11, 2007 - 1:17 pm

Then the programs are broken.  getppid is defined to change if the process
is reparented to init.

Eric
-

From: Bill Davidsen
Date: Wednesday, April 11, 2007 - 2:24 pm

The short answer is that kthreads don't do this so it doesn't matter.

But user programs are NOT broken, currently getppid returns either the 
original parent or init, and a program can identify init. Reparenting to 
another pid would not be easily noted, and as SUS notes, no values are 
reserved to error. So there's no way to check, and no neeed for 
kthreads, I was prematurely paranoid.

Presumably user processes will still be reparented to init so that's not 
an issue. Since there's no atomic signal_parent() the ppid could change 
between getppid() and signal(), but that's never actually been a problem 
AFAIK.

Related: Is there a benefit from having separate queues for original 
children of init and reparented (to init) tasks? Even in a server would 
there be enough to save anything?

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-

From: Oleg Nesterov
Date: Wednesday, April 11, 2007 - 1:19 pm

Sorry, can't understand.

If p->exit_signal == -1 after do_notify_parent() above, the task is completely
dead. Nobody can release it, we should do this (if EXIT_ZOMBIE). At this point
"p" was already re-parented, but this (and getppid) doesn't matter at all.

OK. Most likely you meant something else. Could you clarify?

Oleg.

-

From: Eric W. Biederman
Date: Friday, April 6, 2007 - 11:02 am

Well I would prefer to iterate over 2 lists as opposed to N lists that
we have now.

The __ptrace_unlink issue sounds moderately valid although a ptraced

Actually this should be independent of the pdeath_signal issue.
As long as we record pdeath_signal per task_struct we can continue
to implement the existing semantics.

Although we really should fix pdeath_signal and push the patch to
-mm.  We only didn't do that because the original patch was a
bug fix for stable kernels, and we didn't have time to verify fixing

Well it doesn't fix everything yet but it does fix the common case.
I wonder if it would make sense to have other lists as well.

Regardless of how this fixes scaling someone needs to dig in there load
up all the messy state in their head and clean up, simplify,
and optimize this mess.

We still have the stupid case where we can create unkillable
zombies from user space with a threaded init.

I think it is safe to say that this part of the code has reach the point
of fragility where it is hard to maintain.  A clear sign that it is time to
refactor something.

Eric
-

From: Davide Libenzi
Date: Friday, April 6, 2007 - 11:21 am

Ohhh, the "signal" struct! Funny name for something that nowadays has 
probably no more than a 5% affinity with signal-related tasks :/



- Davide


-

From: Eric W. Biederman
Date: Friday, April 6, 2007 - 11:56 am

Hmm.  I wonder if we should just rename it the struct thread_group,
or struct task_group.  Those seem slightly more accurate names.

I remember last time the conversation about renaming it came up no
one had a good name for it that wasn't already taken.

Eric
-

From: Davide Libenzi
Date: Friday, April 6, 2007 - 12:16 pm

Almost *anything* is better than "signal_struct" ;)
A task_group could be fine, so something on the line of task_shared_ctx.



- Davide


-

From: Ingo Molnar
Date: Friday, April 6, 2007 - 12:19 pm

or lets just face it and name it what it is: process_struct ;-)

	Ingo
-

From: Davide Libenzi
Date: Friday, April 6, 2007 - 2:29 pm

That'd be fine too! Wonder if Linus would swallow a rename patch like 
that...


- Davide


-

From: Linus Torvalds
Date: Friday, April 6, 2007 - 2:51 pm

I don't really see the point. It's not even *true*. A "process" includes 
more than the shared signal-handling - it would include files and fs etc 
too.

So it's actually *more* correct to call it the shared signal state than it 
would be to call it "process" state.

		Linus
-

From: Davide Libenzi
Date: Friday, April 6, 2007 - 3:31 pm

But "signal" has *nothing* to do with what the structure store nowadays, 
really. It's a pool of "things" that are not Linux task specific.
IMO something like "struct task_shared_ctx" or simply "struct task_shared" 
would better fit the nature of the thing.



- Davide


-

From: Linus Torvalds
Date: Friday, April 6, 2007 - 3:46 pm

You're ignoring reality. It has more to do with signals than with 
processes. Look at *all* the fields in the top half of the structure, up 
to (and including) the "tty" field. They're *all* about signal semantics 
in one form or another (whether it's directly about shared signal 
behaviour, or indirectly about *sources* of signals like process control 
or timers).

And renaming it really has no upsides, even *if* you had a point, which 
you don't.

			Linus
-

From: Davide Libenzi
Date: Friday, April 6, 2007 - 3:59 pm

OTOH, the other half of the fields has nothing to do with them (signals). 
Not only, the more time it passes, the more ppl (reason why I posted this 
comment in the beginning) sees the "struct signal_struct" has a boilerplate
where to store shared resources.
Chosing a name like "struct task_shared_ctx" fits it, because "signals" 
are *a* task_shared thing, whereas all the fields on the bottom of the 
"struct signal_struct" (on top of the ones that ppl will want to add 
everytime there's somethign to be shared between task structs) are *not* a 
"signal".



- Davide


-

From: Ingo Molnar
Date: Monday, April 9, 2007 - 1:28 am

we could call it "structure for everything that we know to be ugly about 
POSIX process semantics" ;-) The rest, like files and fs we've 
abstracted out already.

	Ingo
-

From: Bill Davidsen
Date: Monday, April 9, 2007 - 11:09 am

So are you voting for ugly_struct? ;-)

I do think this is still waiting for a more descriptive name, like 
proc_misc_struct or some such. Kernel code should be treated as 
literature, intended to be both read and readable.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-

From: Kyle Moffett
Date: Monday, April 9, 2007 - 12:28 pm

Maybe "struct posix_process" is more descriptive?  "struct  
process_posix"?  "Ugly POSIX process semantics data" seems simple  
enough to stick in a struct name.  "struct uglyposix_process"?

Cheers,
Kyle Moffett

-

From: Linus Torvalds
Date: Monday, April 9, 2007 - 12:51 pm

Guys, you didn't read my message.

It's *not* about "process" stuff.  Anything that tries to call it a 
"process" is *by*definition* worse than what it is now. Processes have all 
the things that we've cleanly separated out for filesystem, VM, SysV 
semaphore state, namespaces etc.

The "struct signal_struct" is the random *leftovers* from all the other 
stuff. It's *not* about "processes". Never has been, and never will be. 

It's mainly about signal cruft, because that's the nastiest part of POSIX 
threads behaviour, and that can't be clearly separated as one clear 
structure. 

So
 - it really *is* mostly about signal handling and signal sources. 
 - it has some random *cruft* in it that isn't about signals, but even 
   that is mostly a matter of "it was random cruft in the original task 
   structure too, and it DID NOT MERIT a flag of its own"
 - if you wanted to clean things up, you'd actually make things like 
   the "rlimit" info structures of their own, and have pointers to them. 

So that cruft largely got put into "signal_struct" just because they were 
the last thing to be moved out, along with the signal stuff (which was the 
big and painful part). NOT because "struct signal_struct" is somehow about 
"process state".

So stop blathering about processes. It has *nothing* to do with processes. 
It's primarily about signals, but it has "cruft" in it.

So an accurate name is

	struct signal_struct_with_some_cruft_in_it_that_did_not_merit_a_struct_of_its_own

but that's actually fairly unwieldly to type, and so in the name of sanity 
and clear source code, it's called

	struct signal_struct

and that's it.

And people who have argued for renaming it don't even seem to understand 
what it's *about*, so the arguments for renaming it have been very weak 
indeed so far.

IT IS NOT ABOUT "PROCESSES".

To be a "posix process", you have to share *everything*. The signal-struct 
isn't even a very important part of that sharing. In fact, it's quite ...
From: Davide Libenzi
Date: Monday, April 9, 2007 - 1:03 pm

I proposed "struct task_shared_ctx" but you ducked :)


- Davide


-

From: Bill Davidsen
Date: Tuesday, April 10, 2007 - 8:12 am

Descriptive, correct, I like it!

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-

From: Davide Libenzi
Date: Tuesday, April 10, 2007 - 12:17 pm

He's stubborn, he'll never accept patches. Must be a seasonal thing ;)



- Davide


-

From: Eric W. Biederman
Date: Monday, April 9, 2007 - 1:00 pm

Nack.

Linux internally doesn't have processes it has tasks with different
properties.  Anything that talks about processes will be even more
confusing.

The only thing really wrong with struct signal is that it is easy to
confuse with struct sighand.

Eric
-

From: Chris Snook
Date: Monday, April 9, 2007 - 10:37 am

Linus, Eric, thanks for the history lesson.  I think it's safe to say 
that anything that breaks because of this sort of change was already 
broken anyway.

If we're going to scale to an obscene number of CPUs (which I believe 
was the original motivation on this thread) then putting the dead 
children on their own list will probably scale better.

	-- Chris
-

From: Christoph Hellwig
Date: Friday, April 6, 2007 - 11:05 am

As all kernel thread (1) should be converted to kthread anyway for
proper containers support and general "let's get rid of a crappy API'
cleanups I think that's enough.  It would be nice to have SGI helping
to convert more drivers over to the proper API as conversions have stalled
a little bit.


(1) There's a few very core kernel threads that need to stick to the
    low-level API, but they're too few to make any differences.

-

From: Eric W. Biederman
Date: Friday, April 6, 2007 - 12:39 pm

Yes.  I had to step back and remind myself why I care as daemonize is
currently doing an effective job of removing the namespace information.

However there is nothing daemonize can do about the pid of the task
because it is returned by kernel_thread and used by the callers 
(to at the very least find the task_struct).

So not finishing that conversion is certainly one of the last blockers
for implementing the pid namespace.

Eric
-

From: Jeff Garzik
Date: Friday, April 6, 2007 - 3:38 pm

What about attacking the explosion of kernel threads?

As CPU counts increase, the number of per-CPU kernel threads gets really 
ridiculous.

I would rather change the implementation under the hood to start per-CPU 
threads on demand, similar to a thread-pool implementation.

Boxes with $BigNum CPUs probably won't ever use half of those threads.

	Jeff



-

From: Linus Torvalds
Date: Friday, April 6, 2007 - 3:51 pm

The counter-argument is that boxes with $BigNum CPU's really don't hurt 
from it either, and having per-process data structures is often simpler 
and more efficient than trying to have some thread pool.

IOW, once we get the processes off the global list, there just isn't any 
downside from them. Sure, they use some memory, but people who buy 
1024-cpu machines won't care about a few kB per CPU..

So the *only* downside is literally the process list, and one suggested 
patch already just removes kernel threads entirely from the parenthood 
lists.

The other potential downside could be "ps is slow", but on the other hand, 
having the things stick around and have things like CPU-time accumulate is 
probably worth it - if there are some issues, they'd show up properly 
accounted for in a way that process pools would have a hard time doing.

So I really don't think this is worth changing things over, apart from 
literally removing them from process lists, which I think everybody agrees 
we should just do - it just never even came up before!

			Linus
-

From: Jeff Garzik
Date: Friday, April 6, 2007 - 4:37 pm

Two points here:

* A lot of the users in the current kernel tree don't rely on the 
per-CPU qualities.  They just need multiple threads running.

* Even with per-CPU data structures and code, you don't necessarily have 
to keep a thread alive and running for each CPU.  Reap the ones that 
haven't been used in $TimeFrame, and add thread creation to the slow 
path that already exists in the bowels of schedule_work().

Or if some kernel hacker is really motivated, all workqueue users in the 
kernel would benefit from a "thread audit", looking at working 

Regardless of how things are shuffled about internally, there will 
always be annoying overhead /somewhere/ when you have a metric ton of 
kernel threads.  I think that people should also be working on ways to 

I think there is a human downside.  For an admin you have to wade 
through a ton of processes on your machine, if you are attempting to 
evaluate the overall state of the machine.  Just google around for all 
the admins complaining about the explosion of kernel threads on 
production machines :)

	Jeff



-

From: Nick Piggin
Date: Wednesday, April 11, 2007 - 12:28 am

spawn on demand would require heuristics and complexity though. And

There are a few per CPU, but they should need no human management to
speak of.

Presumably if you have a 1024 CPU system, you'd generally want to be
running at least 1024 of your own processes there, so you already need

User tools should be improved. It shouldn't be too hard to be able to
aggregate kernel thread stats into a single top entry, for example.

I'm not saying the number of threads couldn't be cut down, but there
is still be an order of magnitude problem there...

-- 
SUSE Labs, Novell Inc.
-

From: Andrew Morton
Date: Monday, April 9, 2007 - 5:23 pm

On Fri, 06 Apr 2007 18:38:40 -0400

I suspect there are quite a few kernel threads which don't really need to
be threads at all: the code would quite happily work if it was changed to
use keventd, via schedule_work() and friends.  But kernel threads are
somewhat easier to code for.

I also suspect that there are a number of workqueue threads which
could/should have used create_singlethread_workqueue().  Often this is
because the developer just didn't think to do it.

otoh, a lot of these inefficeincies are probably down in scruffy drivers
rather than in core or top-level code.

<I also wonder where all these parented-by-init,
presumably-not-using-kthread kernel threads are coming from>

-

From: Eric W. Biederman
Date: Monday, April 9, 2007 - 5:48 pm

So it looks like there were about 1500 kernel threads that started up before
kthread started.

So the kernel threads appear to have init as their parent is because
they started before kthread for the most part.

At 10 kernel threads per cpu there may be a little bloat but it isn't
out of control.  It is mostly that we are observing the kernel as
NR_CPUS approaches infinity.  4096 isn't infinity yet but it's easily
a 1000 fold bigger then most people are used to :)

Eric
-

From: Andrew Morton
Date: Monday, April 9, 2007 - 6:15 pm

On Mon, 09 Apr 2007 18:48:54 -0600

Yeah, sorry.  Without mentioning any names, someone@tv-sign.ru broke the

Yes, I expect we could run init_workqueues() much earlier, from process 0
rather than from process 1.  Something _might_ blow up as it often does when
we change startup ordering, but in this case it would be somewhat surprising.

-

From: Jeff Garzik
Date: Monday, April 9, 2007 - 11:53 pm

I disagree there is only a little bloat:  the current mechanism in place 
does not scale as NR_CPUS increases, as this thread demonstrates.

Beyond a certain point, on an 8-CPU box, it gets silly.  You certainly 
don't need eight kblockd threads or eight ata threads.

	Jeff


-

From: Robin Holt
Date: Tuesday, April 10, 2007 - 2:42 am

I should have been more clear, this is from that 4096 broken down to a
512 cpu partition.  This is the configuration the customer will receive
-

From: Dave Jones
Date: Monday, April 9, 2007 - 6:59 pm

On Mon, Apr 09, 2007 at 05:23:39PM -0700, Andrew Morton wrote:

 > I suspect there are quite a few kernel threads which don't really need to
 > be threads at all: the code would quite happily work if it was changed to
 > use keventd, via schedule_work() and friends.  But kernel threads are
 > somewhat easier to code for.

Perhaps too easy.  We have a bunch of kthreads sitting around that afaict,
are there 'just in case', not because they're actually in use.

   10 ?        S<     0:00 [khelper]

Why doesn't this get spawned when it needs to?

  164 ?        S<     0:00 [cqueue/0]
  165 ?        S<     0:00 [cqueue/1]

I'm not even sure wth these are.

  166 ?        S<     0:00 [ksuspend_usbd]

Just in case I decide to suspend ? Sounds.. handy.
But why not spawn this just after we start a suspend?

  209 ?        S<     0:00 [aio/0]
  210 ?        S<     0:00 [aio/1]

I'm sure I'd appreciate these a lot more if I had any AIO using apps.

  364 ?        S<     0:00 [kpsmoused]

I never did figure out why this was a thread.

  417 ?        S<     0:00 [scsi_eh_1]
  418 ?        S<     0:00 [scsi_eh_2]
  419 ?        S<     5:28 [scsi_eh_3]
  426 ?        S<     0:00 [scsi_eh_4]
  427 ?        S<     0:00 [scsi_eh_5]
  428 ?        S<     0:00 [scsi_eh_6]
  429 ?        S<     0:00 [scsi_eh_7]

Just in case my 7-1 in card reader gets an error.
(Which is unlikely on at least 6 of the slots as evidenced by the runtime column.
 -- Though now I'm curious as to why the error handler was running so much given
    I've not experienced any errors.).
This must be a fun one of on huge diskfarms.

  884 ?        S<     0:00 [kgameportd]

Just in case I ever decide to plug something into my soundcard.

 2179 ?        S<     0:00 [kmpathd/0]
 2180 ?        S<     0:00 [kmpathd/1]
 2189 ?        S<     0:00 [kmirrord]

Just loading the modules starts up the threads, regardless
of whether you use them. (Not sure why they're getting loaded,
something for me to look ...
From: Andrew Morton
Date: Monday, April 9, 2007 - 7:30 pm

That one's needed to parent the call_usermodehelper() apps.  I don't think
it does anything else.  We used to use keventd for this but that had some
problem whcih I forget.  (Who went and misnamed keventd to "events", too? 


That's the softlockup detector.  Confusingly named to look like a, err,

That's there to parent the kthread_create()d threads.  Could presumably use



This machine has one CPU, one sata disk and one DVD drive.  The above is







Sure.

I don't think it's completely silly to object to all this.  Sure, a kernel
thread is worth 4k in the best case, but I bet they have associated unused
resources and as we've seen, they can cause overhead.

-

From: Linus Torvalds
Date: Monday, April 9, 2007 - 7:46 pm

I think it was one of a long series of deadlocks. 

Using a "keventd" for many different things sounds clever and nice, but 
then sucks horribly when one event triggers another event, and they depend 
on each other. Solution: use independent threads for the events.

		Linus
-

From: Jeff Garzik
Date: Tuesday, April 10, 2007 - 12:07 am

Nod.  That's the key problem with keventd.  Independent things must wait 
on each other.

That's why I feel thread creation -- cheap under Linux -- is quite 
appropriate for many of these situations.

	Jeff



-

From: Ingo Oeser
Date: Tuesday, April 10, 2007 - 3:20 pm

Maybe that (thread creation) can be done at open(), socket-creation, 
service request, syscall or whatever event triggers a driver/subsystem 
to actually queue work into a thread.

And when there is a close(), socket-destruction, service completion
or whatever these threads can be marked for destruction and destroyed
by a timer or even immediately.

Regards

Ingo Oeser

-- 
If something is getting cheap, it is getting wasted just because it is cheap.
-

From: Alexey Dobriyan
Date: Monday, April 9, 2007 - 10:07 pm

drivers/connector/connector.c:
455	dev->cbdev = cn_queue_alloc_dev("cqueue", dev->nls);

-

From: Dave Jones
Date: Monday, April 9, 2007 - 10:21 pm

On Tue, Apr 10, 2007 at 09:07:54AM +0400, Alexey Dobriyan wrote:
 > On Mon, Apr 09, 2007 at 07:30:56PM -0700, Andrew Morton wrote:
 > > On Mon, 9 Apr 2007 21:59:12 -0400 Dave Jones <davej@redhat.com> wrote:
 > 
 > [possible topic for KS2007]
 > 
 > > >   164 ?        S<     0:00 [cqueue/0]
 > > >   165 ?        S<     0:00 [cqueue/1]
 > > >
 > > > I'm not even sure wth these are.
 > >
 > > Me either.
 > 
 > drivers/connector/connector.c:
 > 455	dev->cbdev = cn_queue_alloc_dev("cqueue", dev->nls);

Maybe I have apps relying on the connector stuff that I don't
even realise, but ttbomk, nothing actually *needs* this
for 99% of users if I'm not mistaken.

* wonders why he never built this modular..

config PROC_EVENTS
        boolean "Report process events to userspace"
        depends on CONNECTOR=y


Yay.

	Dave

-- 
http://www.codemonkey.org.uk
-

From: Torsten Kaiser
Date: Monday, April 9, 2007 - 11:09 pm

One thread per port, not per device.

  796 ?        S      0:00  \_ [scsi_eh_0]
  797 ?        S      0:00  \_ [scsi_eh_1]
  798 ?        S      0:00  \_ [scsi_eh_2]
  819 ?        S      0:00  \_ [scsi_eh_3]
  820 ?        S      0:00  \_ [scsi_eh_4]
  824 ?        S      0:00  \_ [scsi_eh_5]
  825 ?        S      0:14  \_ [scsi_eh_6]

bardioc ~ # lsscsi -d
[0:0:0:0]    disk    ATA      ST3160827AS      3.42  /dev/sda[8:0]
[1:0:0:0]    disk    ATA      ST3160827AS      3.42  /dev/sdb[8:16]
[5:0:0:0]    disk    ATA      IBM-DHEA-36480   HE8O  /dev/sdc[8:32]
[5:0:1:0]    disk    ATA      Maxtor 6L160P0   BAH4  /dev/sdd[8:48]
[6:0:0:0]    cd/dvd  HL-DT-ST DVDRAM GSA-4081B A100  /dev/sr0[11:0]
bardioc ~ # lsscsi -H
[0]    sata_promise
[1]    sata_promise
[2]    sata_promise
[3]    sata_via
[4]    sata_via
[5]    pata_via
[6]    pata_via

The bad is, that there is always a thread, even if the hardware is not
even hotplug capable.

For me its not the 4k that annoy me, but the clutter in ps or top.

Torsten
-

From: Jeff Garzik
Date: Tuesday, April 10, 2007 - 12:08 am

Nope, it's not.  At least for SATA (your chosen examples), hotplug is 
handled by a libata-specific thread.

The SCSI EH threads are there purely for SCSI exception handling.  For 
the majority of SAS and SATA, we replace the entire SCSI EH handling 
code with our own, making those threads less useful than on older (read: 
majority) of SCSI drivers.

	Jeff


-

From: Jeff Garzik
Date: Tuesday, April 10, 2007 - 12:05 am

I would think this would run into the keventd "problem", where $N 
processes can lock out another?

IMO a lot of these could potentially be simply started as brand new 

No, it does not.

It is used for PIO data transfer, so it merely has to respond quickly, 
which rules out keventd.  You also don't want PIO data xfer for port A 
blocked, sitting around waiting for PIO data xfer to complete on port C.

So, we merely need fast-reacting threads that keventd will not block. 
We do not need per-CPU threads.

Again, I think a model where threads are created on demand, and reaped 
after inactivity, would fit here.  As I feel it would fit with many 

That is used by libata exception handler, for hotpug and such.  My main 
worry with keventd is that we might get stuck behind an unrelated 
process for an undefined length of time.

IMO the best model would be to create ata_aux thread on demand, and kill 

Nod.  I've never thought we needed this many threads.  At least it 
doesn't scale out of control for $BigNum-CPU boxen.

As the name implies, this is SCSI exception handling thread.  Although 
some synchronization is required, this could probably work with an 

This kernel thread is used as a "bottom half" handler for the PS2 mouse 

Agreed.

	Jeff


-

From: Andrew Morton
Date: Tuesday, April 10, 2007 - 12:37 am

I don't think it has ever been demonstrated that keventd latency is
excessive, or a problem.  I guess we could instrument it and fix stuff
easily enough.

The main problem with keventd has been flush_scheduled_work() deadlocks: the
caller of flush_scheduled_work() wants to flush work item A, but holds some
lock which is also needed by unrelated work item B.  Most of the time, it
works.  But if item B happens to be queued the flush_scheduled_work() will
deadlock.

The fix is to flush-and-cancel just item A: if it's not running yet, cancel
it.  If it is running, wait until it has finished.  Oleg's

	void cancel_work_sync(struct work_struct *work)

is queued for 2.6.22 and should permit some kthread->keventd conversions
which would previously been deadlocky.


The thing to concentrate on here is the per-cpu threads, which is where the
proliferation appears to be coming from.  Conversions to
schedule_work()+cancel_work_sync() and conversions to
create_singlethread_workqueue().


-

From: Jeff Garzik
Date: Tuesday, April 10, 2007 - 1:33 am

It's simple math, combined with user expectations.

On a 1-CPU or 2-CPU box, if you have three or more tasks, all of which 
are doing hardware reset tasks that could take 30-60 seconds (realistic 
for libata, SCSI and network drivers, at least), then OBVIOUSLY you have 
other tasks blocked for that length of time.

Since the cause of the latency is msleep() -- the entire reason why the 
driver wanted to use a kernel thread in the first place -- it would seem 
to me that the simple fix is to start a new thread, possibly exceeding 

That's been a problem in the past, yes, but a minor one.

I'm talking about a key conceptual problem with keventd.

It is easy to see how an extra-long tg3 hardware reset might prevent a 
disk hotplug event from being processed for 30-60 seconds.  And as 
hardware gets more complex -- see the Intel IOP storage card which runs 
Linux -- the reset times get longer, too.


Strongly agreed.

	Jeff


-

From: Andrew Morton
Date: Tuesday, April 10, 2007 - 1:41 am

Well that obviously would be a dumb way to use keventd.  One would need
to do schedule_work(), kick off the reset then do schedule_delayed_work()
to wait (or poll) for its termination.
-

From: Jeff Garzik
Date: Tuesday, April 10, 2007 - 1:48 am

Far too complex.  See what Russell wrote, for instance.

When you are in a kernel thread, you can write more simple, 
straightforward, easy-to-debug code that does

	blah
	msleep()
	blah

rather than creating an insanely complicated state machine for the same 
thing.

ESPECIALLY if you are already inside a state machine (the case with 
libata PIO data xfer), you do not want to add another state machine 
inside of that.

A great many uses of kernel threads are to simplify device reset and 
polling in this way.  I know; a year ago I audited every kernel thread, 
because I was so frustrated with the per-CPU thread explosion.

Thus, rather than forcing authors to make their code more complex, we 
should find another solution.

	Jeff


-

From: Ingo Oeser
Date: Tuesday, April 10, 2007 - 3:35 pm

What about sth. like the "pre-forking" concept? So just have a thread creator thread,
which checks the amount of unused threads and keeps them within certain limits.

So that anything which needs a thread now simply queues up the work and
specifies, that it wants a new thread, if possible.

One problem seems to be, that a thread is nothing else but a statement
on what other tasks I can wait before doing my current one (e.g. I don't want to 
mlseep() twice on the same reset timeout). 
But we usually use locking to order that.

Do I miss anything fundamental here?

Regards

Ingo Oeser
-

From: Matt Mackall
Date: Tuesday, April 10, 2007 - 9:35 am

There are some upsides to persistent kernel threads that we might want
to keep in mind:

- they can be reniced, made RT, etc. as needed
- they can be bound to CPUs
- they collect long-term CPU usage information

Most of the above doesn't matter for the average kernel thread that's
handling the occassional hardware reset, but for others it could.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Russell King
Date: Tuesday, April 10, 2007 - 12:44 am

One per PC card socket to avoid the sysfs locking crappyness that would
otherwise deadlock, and to convert from the old unreadable state machine
implementation to a much more readable linearly coded implementation.

Could probably be eliminated if we had some mechanism to spawn a helper
thread to do some task as required which didn't block other helper
threads until it completes.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:
-

From: Jeff Garzik
Date: Tuesday, April 10, 2007 - 1:16 am

kthread_run() should go that for you.  Creates a new thread with 
kthread_create(), and wakes it up immediately.  Goes away when you're 
done with it.

	Jeff


-

From: Ingo Molnar
Date: Tuesday, April 10, 2007 - 1:59 am

looks like the perfect usecase for threadlets. (threadlets only use up a 
separate context if necessary and can be coded in the familiar 
sequential/linear model)

(btw., threadlets could in theory be executed in irq context too, and if 
we block on anything it gets bounced off to a real context - although 
this certainly pushes the limits and there would still be some deadlock 
potential for things like irq-unsafe non-sleeping locks (spinlocks, 
rwlocks).)

	Ingo
-

From: Jeff Garzik
Date: Tuesday, April 10, 2007 - 2:33 am

Same response as to Andrew:  AFAICS that just increases complexity.

The simple path for programmers is writing straightforward code that 
does something like

	blah
	msleep()
	blah

or in pccardd's case,

	mutex_lock()
	blah
	mutex_unlock()

to permit sleeping without having to write more-complex code that deals 
with context transitions.

For slow-path, infrequently executed code, it is best to keep it as 
simple as possible.

	Jeff


-

Previous thread: What protects cpu_tlbstate? by Jeremy Fitzhardinge on Thursday, April 5, 2007 - 12:44 pm. (5 messages)

Next thread: Re: AHCI exception, ext3 journal aborted on a VIA K8M890 / VT8251 by Stephen Evanchik on Thursday, April 5, 2007 - 12:52 pm. (1 message)