[ Resending, hopefully with all pieces ]
Detached tasks are not seen by zap_pid_ns_processes()->sys_wait4(), so
that release_task()->proc_flush_task() of container init can be called
before it is for some detached tasks in the namespace.
Pin proc_mnt's in copy_process(), so that proc_flush_task() becomes safe
whatever the ordering of tasks.
Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com>
---
fs/proc/base.c | 18 ++++++++++++++++++
include/linux/proc_fs.h | 4 ++++
kernel/fork.c | 1 +
3 files changed, 23 insertions(+), 0 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index acb7ef8..d6cdd91 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2663,6 +2663,23 @@ static const struct inode_operations proc_tgid_base_inode_operations = {
.setattr = proc_setattr,
};
+/*
+ * Pin all proc_mnt so that detached tasks can safely call proc_flush_task()
+ * after container init calls itself proc_flush_task().
+ */
+void proc_new_task(struct task_struct *task)
+{
+ struct pid *pid;
+ int i;
+
+ if (!task->pid)
+ return;
+
+ pid = task_pid(task);
+ for (i = 0; i <= pid->level; i++)
+ mntget(pid->numbers[i].ns->proc_mnt);
+}
+
static void proc_flush_task_mnt(struct vfsmount *mnt, pid_t pid, pid_t tgid)
{
struct dentry *dentry, *leader, *dir;
@@ -2744,6 +2761,7 @@ void proc_flush_task(struct task_struct *task)
upid = &pid->numbers[i];
proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
tgid->numbers[i].nr);
+ mntput(upid->ns->proc_mnt);
}
upid = &pid->numbers[pid->level];
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 379eaed..f24faa1 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -104,6 +104,7 @@ struct vmcore {
extern void proc_root_init(void);
+void proc_new_task(struct task_struct *task);
void proc_flush_task(struct task_struct *task);
extern struct proc_dir_entry *create_proc_entry(const char *name, mode_t mode,
@@ -184,6 ...NAK. Pids live their own lives - task can change one, pid will become orphan and will be destroyed, so you'll leak. There's another big problem with proc mount - you can umount proc and the namespace will have a stale pointer. --
There are two ways we can go from here. - Completely asynchronous children exiting. - Waiting for all of our children to exit. Your patch appears to be a weird middle ground, that is hard to analyze, abusing the mount count as a thread count. I have been weighing the options between them, and to me properly waiting for all the processes to exit in zap_pid_ns_processes as we currently try to do is in our advantage. It is simple and it means one less cache line bounce for a write to global variable in the much more common fork/exit path that we have to deal with. The task->children isn't changed until __unhash_process() which runs after flush_proc_task(). So we should be able to come up with a variant of do_wait() that zap_pid_ns_processes can use that does what we need. There is that nasty case in exec isn't there. Why we ever made it part of the ABI that calling exec on a thread changes the pid of Not true. pid_ns->proc_mnt is an internal kernel mount. See I agree, and we mostly do. In my queue for the unsharing of the pid namespace I have a patch that makes that more explicit. Eric --
Give me a few days to look at that. IMHO my patch fixes the bug (see the comment below), which was an emergency at work. Coding an improved fix is You're right that I forgot about de_thread(). However, de_thread() does not replace a task pid with an arbitrary other pid. The new pid lives in the same pid namespaces, so that proc_flush_task() will call mntput() on the same proc_mnts as the ones on which proc_new_task() called mntget(). Thanks, Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Yes. But this is only the current implementation detail. It would be nice to cleanup the code so that EXIT_DEAD tasks are See above... Even if we modify do_wait() or add the new variant, how the caller can wait for EXIT_DEAD tasks? I don't think we want to modify release_task() to do __wake_up_parent() or something similar. Oleg. --
Indeed, I was thinking about calling __wake_up_parent() from release_task() once parent->children becomes empty. Not sure about the performance impact though. Maybe some WAIT_NO_CHILDREN flag in parent->signal could limit it. But if EXIT_DEAD children are removed from ->children before release_task(), I'm afraid that this becomes impossible. Thanks, Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Thinking more, even the current do_wait() from zap_pid_ns_processes() is not really good. Suppose that some none-init thread is ptraced, then zap_pid_ns_processes() will hange until the tracer does do_wait() or exits. Oleg. --
Is this really a bad thing? If somebody ptraces a task in a pid namespace, that sounds reasonable to have this namespace (and it's init task) pinned. Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Louis. Have you seen this problem hit without my setns patch? I'm pretty certain that this hits because there are processes do_wait does not wait for, in particular processes in a disjoint process tree. So at this point I am really favoring killing the do_wait and making this all asynchronous. Eric --
Yes. I hit it with Kerrighed patches. I also have an ugly reproducer on 2.6.35-rc3 (see attachments). Ugly because I introduced artifical delays in release_task(). I couldn't trigger the bug without it, probably because the scheduler is too kind :) I'm using memory poisoining (SLAB and DEBUG_SLAB) to make it easy to observe the bug. Example: Any idea about how to do it? Thanks, Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Some variant of the patches Oleg just recently posted. I'm still not comfortable with the extending the kernel mount to the entire lifetime of the pid_namespace. But it certainly is better than a lot of the alternatives. Eric --
I must have missed something, but can't we just move mntput() ?
Oleg.
--- x/kernel/pid_namespace.c
+++ x/kernel/pid_namespace.c
@@ -110,6 +110,9 @@ static void destroy_pid_namespace(struct
{
int i;
+ if (ns->proc_mount)
+ mntput(ns->proc_mount);
+
for (i = 0; i < PIDMAP_ENTRIES; i++)
kfree(ns->pidmap[i].page);
kmem_cache_free(pid_ns_cachep, ns);
--- x/fs/proc/base.c
+++ x/fs/proc/base.c
@@ -2745,10 +2745,6 @@ void proc_flush_task(struct task_struct
proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
tgid->numbers[i].nr);
}
-
- upid = &pid->numbers[pid->level];
- if (upid->nr == 1)
- pid_ns_release_proc(upid->ns);
}
static struct dentry *proc_pid_instantiate(struct inode *dir,
--
See the log of the commit introducing pid_ns_release_proc() (6f4e6433):
Sice the namespace holds the vfsmnt, vfsmnt holds the superblock and the
superblock holds the namespace we must explicitly break this circle to destroy
all the stuff. This is done after the init of the namespace dies. Running a
few steps forward - when init exits it will kill all its children, so no
proc_mnt will be needed after its death.
Thanks,
--
Dr Louis Rilling Kerlabs
Skype: louis.rilling Batiment Germanium
Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes
http://www.kerlabs.com/ 35700 Rennes
I see thanks. Besides, put_pid_ns() can be called in any context... Oleg. --
Not sure I ever understood this code. Certainly I can't say I understand
it now. Still, do we really need this circle? I am almost sure the patch
below is not right (and it wasn't tested at all), but could you take a
look?
Note: afaics we have another problem. What if copy_process(CLONE_NEWPID)
fails after pid_ns_prepare_proc() ? Who will do mntput() ? This patch
should fix this too (again, ___if___ it correct).
Oleg.
include/linux/pid_namespace.h | 1 +
kernel/pid_namespace.c | 3 +++
fs/proc/base.c | 4 ----
fs/proc/root.c | 18 ++++++++++++++----
4 files changed, 18 insertions(+), 8 deletions(-)
--- 34-rc1/include/linux/pid_namespace.h~PID_NS 2010-06-18 17:48:56.000000000 +0200
+++ 34-rc1/include/linux/pid_namespace.h 2010-06-18 17:53:11.000000000 +0200
@@ -26,6 +26,7 @@ struct pid_namespace {
struct pid_namespace *parent;
#ifdef CONFIG_PROC_FS
struct vfsmount *proc_mnt;
+ struct work_struct proc_put;
#endif
#ifdef CONFIG_BSD_PROCESS_ACCT
struct bsd_acct_struct *bacct;
--- 34-rc1/kernel/pid_namespace.c~PID_NS 2010-06-18 17:48:56.000000000 +0200
+++ 34-rc1/kernel/pid_namespace.c 2010-06-18 17:53:44.000000000 +0200
@@ -10,6 +10,7 @@
#include <linux/pid.h>
#include <linux/pid_namespace.h>
+#include <linux/proc_fs.h>
#include <linux/syscalls.h>
#include <linux/err.h>
#include <linux/acct.h>
@@ -109,6 +110,8 @@ static void destroy_pid_namespace(struct
{
int i;
+ pid_ns_release_proc(ns);
+
for (i = 0; i < PIDMAP_ENTRIES; i++)
kfree(ns->pidmap[i].page);
kmem_cache_free(pid_ns_cachep, ns);
--- 34-rc1/fs/proc/base.c~PID_NS 2010-06-18 17:48:56.000000000 +0200
+++ 34-rc1/fs/proc/base.c 2010-06-18 17:49:45.000000000 +0200
@@ -2720,10 +2720,6 @@ void proc_flush_task(struct task_struct
proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
tgid->numbers[i].nr);
}
-
- upid = &pid->numbers[pid->level];
- if (upid->nr == 1)
- pid_ns_release_proc(upid->ns);
}
static struct ...I won't pretend understanding better than you do. Still I can try to give my 2 cents. Overall, I don't feel comfortable at being able to have a living proc_mnt with a dead pid_ns. On the contrary, proc_mnt is almost never used from pid_ns. proc_flush_task() excepted, the only users I saw use I don't think that this is correct. IIUC, proc_delete_inode() calls put_pid() on ei->pid. So either a special case is added in proc_delete_inode(), or we try to live with get_pid() here. I'm actually not sure that we can pretend that this pid remains valid if we don't get_pid() here. Thanks, -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Yes, this should be fixed, I already realized this. work->func(ns) is called when ns is already fixed. it clears ei->pid. We are called from free_pid_ns() path, this ->pid must not have any reference. Any get_pid() implies get_pid_ns(). What do you think? Oleg. --
^^^^^
(I meant freed)
We can move kmem_cache_free(ns) into work->func(), or we can optimize
the usage of schedule_work(), something like the patch below.
Once again, it is completely untested, I do not pretend I understand
this code today, and I am sure the patch is wrong. I only try to discuss
the idea to break the circular reference. In any case, I am not proud
of this patch ;)
Or we should do the more sophisticated change suggested by Pavel. But
I'd like to avoid the changes in do_wait/release_task, this doesn't
look right to me.
Oleg.
include/linux/pid_namespace.h | 1 +
kernel/pid_namespace.c | 35 ++++++++++++++++++++++++++++++++++-
fs/proc/base.c | 4 ----
fs/proc/root.c | 10 ++++++----
4 files changed, 41 insertions(+), 9 deletions(-)
--- 34-rc1/include/linux/pid_namespace.h~PID_NS 2010-06-18 17:48:56.000000000 +0200
+++ 34-rc1/include/linux/pid_namespace.h 2010-06-18 22:37:45.000000000 +0200
@@ -26,6 +26,7 @@ struct pid_namespace {
struct pid_namespace *parent;
#ifdef CONFIG_PROC_FS
struct vfsmount *proc_mnt;
+ struct list_head dead_node;
#endif
#ifdef CONFIG_BSD_PROCESS_ACCT
struct bsd_acct_struct *bacct;
--- 34-rc1/kernel/pid_namespace.c~PID_NS 2010-06-18 17:48:56.000000000 +0200
+++ 34-rc1/kernel/pid_namespace.c 2010-06-18 22:52:41.000000000 +0200
@@ -10,6 +10,7 @@
#include <linux/pid.h>
#include <linux/pid_namespace.h>
+#include <linux/proc_fs.h>
#include <linux/syscalls.h>
#include <linux/err.h>
#include <linux/acct.h>
@@ -105,15 +106,47 @@ out:
return ERR_PTR(-ENOMEM);
}
-static void destroy_pid_namespace(struct pid_namespace *ns)
+static LIST_HEAD(dead_list);
+static DEFINE_SPINLOCK(dead_lock);
+
+static void do_destroy_pid_namespace(struct pid_namespace *ns)
{
int i;
+ pid_ns_release_proc(ns);
+
for (i = 0; i < PIDMAP_ENTRIES; i++)
kfree(ns->pidmap[i].page);
kmem_cache_free(pid_ns_cachep, ns);
}
+static void ...I don't know what I have missed, but this looks really right to me. Besides, we have yet another problem: proc_flush_task()->mntput() is just wrong. Consider the multithreaded execing init. I am going to simplify, test, and send the fix which moves mntput() into free_pid_ns() paths. But first of all I think we should cleanup the pid_ns_prepare_proc() logic. Imho, this code is really ugly. Please see the patches. The last one, 4/4, is orthogonal to other changes. Oleg. --
Depending on whether it is called by the init_pid_ns or not, proc_get_sb()
does different things to initialize PROC_I(s_root->d_inode)->pid, and both
look ugly, imho.
- init_pid_ns tries to initialize ->pid every time proc_get_sb() is
called, and we check the global proc_mnt != NULL to ensure proc_get_sb()
was already called in the past and thus it is safe to use sb->s_root.
- sub-namespaces set ->pid at MS_KERNMOUNT stage and thus proc_get_sb()
can'be called before the first alloc_pid() in copy_process(), ugly.
Consolidate this code, and initialize PROC_I(s_root->d_inode)->pid when
the proc fs is actually mounted (MS_KERNMOUNT is not set). This also
allows us to do more cleanups.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/root.c | 23 +++++++----------------
1 file changed, 7 insertions(+), 16 deletions(-)
--- 35-rc3/fs/proc/root.c~PNS_1_PROC_GET_SB_ROOT_PID 2010-06-19 19:20:23.000000000 +0200
+++ 35-rc3/fs/proc/root.c 2010-06-19 19:21:38.000000000 +0200
@@ -41,18 +41,6 @@ static int proc_get_sb(struct file_syste
int err;
struct super_block *sb;
struct pid_namespace *ns;
- struct proc_inode *ei;
-
- if (proc_mnt) {
- /* Seed the root directory with a pid so it doesn't need
- * to be special in base.c. I would do this earlier but
- * the only task alive when /proc is mounted the first time
- * is the init_task and it doesn't have any pids.
- */
- ei = PROC_I(proc_mnt->mnt_sb->s_root->d_inode);
- if (!ei->pid)
- ei->pid = find_get_pid(1);
- }
if (flags & MS_KERNMOUNT)
ns = (struct pid_namespace *)data;
@@ -71,15 +59,18 @@ static int proc_get_sb(struct file_syste
return err;
}
- ei = PROC_I(sb->s_root->d_inode);
+ sb->s_flags |= MS_ACTIVE;
+ ns->proc_mnt = mnt;
+ }
+
+ /* MS_KERNMOUNT is called before ns has any pids */
+ if (!(flags & MS_KERNMOUNT)) {
+ struct proc_inode *ei = PROC_I(sb->s_root->d_inode);
if (!ei->pid) {
rcu_read_lock();
ei->pid = get_pid(find_pid_ns(1, ...After the previous cleanup in proc_get_sb() the global proc_mnt has
no reasons to exists, kill it.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
fs/proc/internal.h | 1 -
fs/proc/inode.c | 2 --
fs/proc/root.c | 5 +++--
3 files changed, 3 insertions(+), 5 deletions(-)
--- 35-rc3/fs/proc/internal.h~PNS_2_KILL_ROOT_MNT 2010-06-19 19:20:23.000000000 +0200
+++ 35-rc3/fs/proc/internal.h 2010-06-19 19:21:39.000000000 +0200
@@ -106,7 +106,6 @@ static inline struct proc_dir_entry *pde
}
void pde_put(struct proc_dir_entry *pde);
-extern struct vfsmount *proc_mnt;
int proc_fill_super(struct super_block *);
struct inode *proc_get_inode(struct super_block *, unsigned int, struct proc_dir_entry *);
--- 35-rc3/fs/proc/inode.c~PNS_2_KILL_ROOT_MNT 2010-06-19 19:20:23.000000000 +0200
+++ 35-rc3/fs/proc/inode.c 2010-06-19 19:21:39.000000000 +0200
@@ -43,8 +43,6 @@ static void proc_delete_inode(struct ino
clear_inode(inode);
}
-struct vfsmount *proc_mnt;
-
static struct kmem_cache * proc_inode_cachep;
static struct inode *proc_alloc_inode(struct super_block *sb)
--- 35-rc3/fs/proc/root.c~PNS_2_KILL_ROOT_MNT 2010-06-19 19:21:38.000000000 +0200
+++ 35-rc3/fs/proc/root.c 2010-06-19 19:21:39.000000000 +0200
@@ -94,14 +94,15 @@ static struct file_system_type proc_fs_t
void __init proc_root_init(void)
{
+ struct vfsmount *mnt;
int err;
proc_init_inodecache();
err = register_filesystem(&proc_fs_type);
if (err)
return;
- proc_mnt = kern_mount_data(&proc_fs_type, &init_pid_ns);
- if (IS_ERR(proc_mnt)) {
+ mnt = kern_mount_data(&proc_fs_type, &init_pid_ns);
+ if (IS_ERR(mnt)) {
unregister_filesystem(&proc_fs_type);
return;
}
--
Imho, pid_ns_prepare_proc() in copy_process() looks very ugly.
Now that pid_ns_prepare_proc()->proc_get_sb() does not depend on the
first alloc_pid(), we can move this code into create_pid_namespace().
This looks much more clean to me, and in theory more correct in case
we ever implement sys_unshare(CLONE_NEWPID).
Note: with or without this patch we have a bug. If CLONE_NEWPID fails
after pid_ns_prepare_proc(), nobody frees ns->proc_mnt. This will be
fixed by the next patches.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/fork.c | 7 -------
kernel/pid_namespace.c | 10 +++++++++-
2 files changed, 9 insertions(+), 8 deletions(-)
--- 35-rc3/kernel/fork.c~PNS_3_KILL_FORK_MOUNT 2010-06-19 19:20:22.000000000 +0200
+++ 35-rc3/kernel/fork.c 2010-06-19 20:17:35.000000000 +0200
@@ -58,7 +58,6 @@
#include <linux/taskstats_kern.h>
#include <linux/random.h>
#include <linux/tty.h>
-#include <linux/proc_fs.h>
#include <linux/blkdev.h>
#include <linux/fs_struct.h>
#include <linux/magic.h>
@@ -1154,12 +1153,6 @@ static struct task_struct *copy_process(
pid = alloc_pid(p->nsproxy->pid_ns);
if (!pid)
goto bad_fork_cleanup_io;
-
- if (clone_flags & CLONE_NEWPID) {
- retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
- if (retval < 0)
- goto bad_fork_free_pid;
- }
}
p->pid = pid_nr(pid);
--- 35-rc3/kernel/pid_namespace.c~PNS_3_KILL_FORK_MOUNT 2010-06-19 19:20:22.000000000 +0200
+++ 35-rc3/kernel/pid_namespace.c 2010-06-19 19:21:42.000000000 +0200
@@ -10,6 +10,7 @@
#include <linux/pid.h>
#include <linux/pid_namespace.h>
+#include <linux/proc_fs.h>
#include <linux/syscalls.h>
#include <linux/err.h>
#include <linux/acct.h>
@@ -72,6 +73,7 @@ static struct pid_namespace *create_pid_
{
struct pid_namespace *ns;
unsigned int level = parent_pid_ns->level + 1;
+ int err = -ENOMEM;
int i;
ns = kmem_cache_zalloc(pid_ns_cachep, GFP_KERNEL);
@@ -96,14 +98,20 @@ static struct pid_namespace ...Cleanup: kill the dead code which does nothing but complicates the code
and confuses the reader.
sys_unshare(CLONE_THREAD/SIGHAND/VM) is not really implemented, and I doubt
very much it will ever work. At least, nobody even tried since the original
"unshare system call -v5: system call handler function" commit
99d1419d96d7df9cfa56bc977810be831bd5ef64 was applied more than 4 years ago.
And the code is not consistent. unshare_thread() always fails unconditionally,
while unshare_sighand() and unshare_vm() pretend to work if there is nothing
to unshare.
Remove unshare_thread(), unshare_sighand(), unshare_vm() helpers and related
variables and add a simple CLONE_THREAD | CLONE_SIGHAND| CLONE_VM check into
check_unshare_flags().
Also, move the "CLONE_NEWNS needs CLONE_FS" check from check_unshare_flags()
to sys_unshare(). This looks more consistent and matches the similar
do_sysvsem check in sys_unshare().
Note: with or without this patch "atomic_read(mm->mm_users) > 1" can give
a false positive due to get_task_mm().
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Acked-by: Roland McGrath <roland@redhat.com>
---
kernel/fork.c | 119 ++++++++++++----------------------------------------------
1 file changed, 25 insertions(+), 94 deletions(-)
--- 35-rc3/kernel/fork.c~PNS_4_UNSHARE_IS_REALLY_REALLY_UGLY 2010-06-19 20:17:35.000000000 +0200
+++ 35-rc3/kernel/fork.c 2010-06-19 20:17:51.000000000 +0200
@@ -1500,38 +1500,24 @@ void __init proc_caches_init(void)
}
/*
- * Check constraints on flags passed to the unshare system call and
- * force unsharing of additional process context as appropriate.
+ * Check constraints on flags passed to the unshare system call.
*/
-static void check_unshare_flags(unsigned long *flags_ptr)
+static int check_unshare_flags(unsigned long unshare_flags)
{
+ if (unshare_flags & ~(CLONE_THREAD|CLONE_FS|CLONE_NEWNS|CLONE_SIGHAND|
+ CLONE_VM|CLONE_FILES|CLONE_SYSVSEM|
+ CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWNET))
+ return -EINVAL;
...Since I have a patchset that makes it possible to unshare the pid namespace about ready to send I figure we should combine the two efforts. This patchset is a prerequisite to my patches for giving namespaces file descriptors and allowing you to join and existing namespace. When I look over my old notes it appears there Daniel managed to hit this proc_mnt reference counting in that context. So that is definitely interesting. Oleg take a look I think I have combined the best of our two patchsets. Eric --
It turns out that the existing assignment in copy_process of
the child_reaper can handle the initial assignment of child_reaper
we just need to generalize the test in kernel/fork.c
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
init/main.c | 9 ---------
kernel/fork.c | 2 +-
2 files changed, 1 insertions(+), 10 deletions(-)
diff --git a/init/main.c b/init/main.c
index 3bdb152..38f7edc 100644
--- a/init/main.c
+++ b/init/main.c
@@ -865,15 +865,6 @@ static int __init kernel_init(void * unused)
* init can run on any cpu.
*/
set_cpus_allowed_ptr(current, cpu_all_mask);
- /*
- * Tell the world that we're going to be the grim
- * reaper of innocent orphaned children.
- *
- * We don't want people to have to make incorrect
- * assumptions about where in the task array this
- * can be found.
- */
- init_pid_ns.child_reaper = current;
cad_pid = task_pid(current);
diff --git a/kernel/fork.c b/kernel/fork.c
index b6cce14..8b85b17 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1263,7 +1263,7 @@ static struct task_struct *copy_process(unsigned long clone_flags,
tracehook_finish_clone(p, clone_flags, trace);
if (thread_group_leader(p)) {
- if (clone_flags & CLONE_NEWPID)
+ if (pid->numbers[pid->level].nr == 1)
p->nsproxy->pid_ns->child_reaper = p;
p->signal->leader_pid = pid;
--
1.6.5.2.143.g8cc62
--
I must admit, personally I dislike this change. If it is needed for the next changes, please explain the need? Yes, it removes the line from __init function, but it complicates copy_process(), this doesn't look fair to me ;) I agree, the complication is minor, but still. And, in fact, to me this change hides CLONE_NEWPID from grep. In fact, I was looking at this code when I did 1/4. And I think it is better to move it (and perhaps another CLONE_NEWPID check in copy_signal) into copy_pid_ns() path. Oleg. --
OK, this is needed for 6/6. I still can't say I like this change (and 6/6 too ;), and it is not enough. If we spawn the new init because we called sys_unshare(CLONE_NEWPID) in the past (Eric, imho this can't be the really nice idea) we should also set TASK_UNKILLABLE at least. IOW. Not only this hides CLONE_NEWPID from grep, unshare() also hides it from paths which should know about this flag. I'd rather prefer the straightforward implementation of unshare(NEWPID) which merely adds SIGNAL_THE_NEXT_FORK_SHOULD_USE_CLONE_NEWPID flag to current->signal->flags. Yes, this is very ugly too. Oleg. --
Reorganize proc_get_sb so it can be called before the struct pid
of the first process is allocated.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/proc/root.c | 26 ++++++++------------------
kernel/fork.c | 6 ------
kernel/pid_namespace.c | 4 ++++
3 files changed, 12 insertions(+), 24 deletions(-)
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 4258384..5f15353 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -43,17 +43,6 @@ static int proc_get_sb(struct file_system_type *fs_type,
struct pid_namespace *ns;
struct proc_inode *ei;
- if (proc_mnt) {
- /* Seed the root directory with a pid so it doesn't need
- * to be special in base.c. I would do this earlier but
- * the only task alive when /proc is mounted the first time
- * is the init_task and it doesn't have any pids.
- */
- ei = PROC_I(proc_mnt->mnt_sb->s_root->d_inode);
- if (!ei->pid)
- ei->pid = find_get_pid(1);
- }
-
if (flags & MS_KERNMOUNT)
ns = (struct pid_namespace *)data;
else
@@ -71,17 +60,18 @@ static int proc_get_sb(struct file_system_type *fs_type,
return err;
}
- ei = PROC_I(sb->s_root->d_inode);
- if (!ei->pid) {
- rcu_read_lock();
- ei->pid = get_pid(find_pid_ns(1, ns));
- rcu_read_unlock();
- }
-
sb->s_flags |= MS_ACTIVE;
ns->proc_mnt = mnt;
}
+ /* Ensure the root directory has it's associated pid. */
+ ei = PROC_I(sb->s_root->d_inode);
+ if (!ei->pid) {
+ rcu_read_lock();
+ ei->pid = get_pid(find_pid_ns(1, ns));
+ rcu_read_unlock();
+ }
+
simple_set_mnt(mnt, sb);
return 0;
}
diff --git a/kernel/fork.c b/kernel/fork.c
index 8b85b17..3c2501d 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1154,12 +1154,6 @@ static struct task_struct *copy_process(unsigned long clone_flags,
pid = alloc_pid(p->nsproxy->pid_ns);
if (!pid)
goto bad_fork_cleanup_io;
-
- if (clone_flags & CLONE_NEWPID) {
- retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
- if ...This is what [PATCH 1/4] procfs: proc_get_sb: consolidate/cleanup root_inode->pid logic [PATCH 2/4] procfs: kill the global proc_mnt variable Afaics, this is wrong. If pid_ns_prepare_proc() fails we shouldn't blindly return ENOMEM and, more importantly, we need put_pid_ns(parent_ns). Oleg. --
From: Oleg Nesterov <oleg@redhat.com>
After the previous cleanup in proc_get_sb() the global proc_mnt has
no reasons to exists, kill it.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/proc/inode.c | 2 --
fs/proc/internal.h | 1 -
fs/proc/root.c | 5 +++--
3 files changed, 3 insertions(+), 5 deletions(-)
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index aea8502..e538886 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -43,8 +43,6 @@ static void proc_delete_inode(struct inode *inode)
clear_inode(inode);
}
-struct vfsmount *proc_mnt;
-
static struct kmem_cache * proc_inode_cachep;
static struct inode *proc_alloc_inode(struct super_block *sb)
diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 1f24a3e..ae01ca6 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -106,7 +106,6 @@ static inline struct proc_dir_entry *pde_get(struct proc_dir_entry *pde)
}
void pde_put(struct proc_dir_entry *pde);
-extern struct vfsmount *proc_mnt;
int proc_fill_super(struct super_block *);
struct inode *proc_get_inode(struct super_block *, unsigned int, struct proc_dir_entry *);
diff --git a/fs/proc/root.c b/fs/proc/root.c
index 5f15353..310ac20 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -93,14 +93,15 @@ static struct file_system_type proc_fs_type = {
void __init proc_root_init(void)
{
+ struct vfsmount *mnt;
int err;
proc_init_inodecache();
err = register_filesystem(&proc_fs_type);
if (err)
return;
- proc_mnt = kern_mount_data(&proc_fs_type, &init_pid_ns);
- if (IS_ERR(proc_mnt)) {
+ mnt = kern_mount_data(&proc_fs_type, &init_pid_ns);
+ if (IS_ERR(mnt)) {
unregister_filesystem(&proc_fs_type);
return;
}
--
1.6.5.2.143.g8cc62
--
In the case of unsharing or joining a pid namespace, it becomes
possible to attempt to allocate a pid after zap_pid_namespace has
killed everything in the namespace. Close the hole for now by simply
not allowing any of those pid allocations to succeed. At least for
now it is too strange to think about.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
include/linux/pid_namespace.h | 1 +
kernel/pid.c | 4 ++++
kernel/pid_namespace.c | 2 ++
3 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 38d1032..b447d37 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -20,6 +20,7 @@ struct pid_namespace {
struct kref kref;
struct pidmap pidmap[PIDMAP_ENTRIES];
int last_pid;
+ atomic_t dead;
struct task_struct *child_reaper;
struct kmem_cache *pid_cachep;
unsigned int level;
diff --git a/kernel/pid.c b/kernel/pid.c
index e9fd8c1..e1f26ec 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -248,6 +248,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
struct pid_namespace *tmp;
struct upid *upid;
+ pid = NULL;
+ if (atomic_read(&ns->dead))
+ goto out;
+
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
goto out;
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index b90e4ab..34a251d 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -90,6 +90,7 @@ static struct pid_namespace *create_pid_namespace(struct pid_namespace *parent_p
kref_init(&ns->kref);
ns->level = level;
ns->parent = get_pid_ns(parent_pid_ns);
+ atomic_set(&ns->dead, 0);
set_bit(0, ns->pidmap[0].page);
atomic_set(&ns->pidmap[0].nr_free, BITS_PER_PAGE - 1);
@@ -161,6 +162,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
*
*/
read_lock(&tasklist_lock);
+ atomic_set(&pid_ns->dead, 1);
nr = next_pidmap(pid_ns, 1);
while (nr > 0) {
...Well, I didn't read the next patch, so I don't understand the changelog. Why it is atomic_t? It is used like a simple boolean, and the next The only caller of alloc_pid() is copy_process(). So, at first glance this patch tries to block the attempts to create the tasks in this namespace. But what if copy_process() has already called alloc_pid() using this ns, but didn't do attach_pid() yet? Oleg. --
The expressions tsk->nsproxy->pid_ns and task_active_pid_ns
aka ns_of_pid(task_pid(tsk)) should have the same number of
cache line misses with the practical difference that
ns_of_pid(task_pid(tsk)) is released later in a processes life.
Furthermore by using task_active_pid_ns it becomes trivial
to write an unshare implementation for the the pid namespace.
So I have used task_active_pid_ns everywhere I can.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
arch/powerpc/platforms/cell/spufs/sched.c | 2 +-
arch/um/drivers/mconsole_kern.c | 2 +-
fs/proc/root.c | 2 +-
kernel/cgroup.c | 3 +--
kernel/perf_event.c | 2 +-
kernel/pid.c | 8 ++++----
kernel/signal.c | 9 ++++-----
kernel/sysctl_binary.c | 2 +-
8 files changed, 14 insertions(+), 16 deletions(-)
diff --git a/arch/powerpc/platforms/cell/spufs/sched.c b/arch/powerpc/platforms/cell/spufs/sched.c
index 0b04662..82e26a0 100644
--- a/arch/powerpc/platforms/cell/spufs/sched.c
+++ b/arch/powerpc/platforms/cell/spufs/sched.c
@@ -1095,7 +1095,7 @@ static int show_spu_loadavg(struct seq_file *s, void *private)
LOAD_INT(c), LOAD_FRAC(c),
count_active_contexts(),
atomic_read(&nr_spu_contexts),
- current->nsproxy->pid_ns->last_pid);
+ task_active_pid_ns(current)->last_pid);
return 0;
}
diff --git a/arch/um/drivers/mconsole_kern.c b/arch/um/drivers/mconsole_kern.c
index de317d0..77787d2 100644
--- a/arch/um/drivers/mconsole_kern.c
+++ b/arch/um/drivers/mconsole_kern.c
@@ -125,7 +125,7 @@ void mconsole_log(struct mc_request *req)
void mconsole_proc(struct mc_request *req)
{
struct nameidata nd;
- struct vfsmount *mnt = current->nsproxy->pid_ns->proc_mnt;
+ struct vfsmount *mnt = task_active_pid_ns(current)->proc_mnt;
struct file *file;
int n, err;
char *ptr = req->request.data, *buf;
diff --git ...- Allow CLONEW_NEWPID into unshare.
- Pass both nsproxy->pid_ns and task_active_pid_ns to copy_pid_ns
As they can now be different.
Unsharing of the pid namespace unlike unsharing of other namespaces
does not take affect immediately. Instead it affects the children
created with fork and clone. The first of these children becomes the init
process of the new pid namespace, the rest become oddball children
of pid 0. From the point of view of the new pid namespace the process
that created it is pid 0, as it's pid does not map.
A couple of different semantics were considered but this one was
settled on because it is easy to implement and it is usable from
pam modules. The core reasons for the existence of unshare.
I took a survey of the callers of pam modules and the following
appears to be a representative sample of their logic.
{
setup stuff include pam
child = fork();
if (!child) {
setuid()
exec /bin/bash
}
waitpid(child);
pam and other cleanup
}
As you can see there is a fork to create the unprivileged user
space process. Which means that the unprivileged user space
process will appear as pid 1 in the new pid namespace. Further
most login processes do not cope with extraneous children which
means shifting the duty of reaping extraneous child process to
the creator of those extraneous children makes the system more
comprehensible.
The practical reason for this set of pid namespace semantics is
that it is simple to implement and verify they work correctly.
Whereas an implementation that requres changing the struct
pid on a process comes with a lot more races and pain. Not
the least of which is that glibc caches getpid().
These semantics are implemented by having two notions
of the pid namespace of a proces. There is task_active_pid_ns
which is the pid namspace the process was created with
and the pid namespace that all pids are presented to
that process in. The task_active_pid_ns is stored
in the ...Cough. It is too late to me to even try to understand the changelog. Instead I tried to quickly read the patch. Most probably I missed somthing, but still I'd like to ask the quiestion. So. If I understand correctly, the patch is simple: - unshare(CLONE_NEWPID) changes current->proxy->pid_ns, but do not change current->pids[] and thus it doesn't change task_active_pid_ns(). - since copy_process() uses ->proxy->pid_ns for alloc_pid() the new children will fall into the new ns. IOW, the caller becomes the "swapper" for the new namespace. Correct? If yes, I'm afraid nobody except you will understand this magic ;) But what if the task T does unshare(CLONE_NEWPID) and then, say, pthread_create() ? Unless I missed something, the new thread won't be able to see T ? OK, suppose it does fork() after unshare(), then another fork(). In this case the second child lives in the same namespace with init created by the 1st fork, but it is not descendant ? This means in particular that if the new init exits, zap_pid_ns_processes()-> do_wait() can't work. I hope I missed something, this all is too subtle for me. And I still do not understand 4/6 which adds ns->dead. Oleg. --
and, in this case the exiting sub-namespace init also kills its And, forgot to mention. With this change proc_flush_task()->mntput() becomes even more wrong. Oleg. --
do_wait() can't work and I missed that dependency the first time around. Having looked at my earlier bug report from Daniel when I was playing with this patchset earlier it is clear that he was triggering the proc_mnt race with such a process. So except for ptrace I don't think the proc_mnt problem is possible ns->dead is just a flag to say no more processes in the pid namespace. Which means an unshare into the pid namespace after zap_pid_ns_processes has been called will fail(). Eric --
Why? Once again, it is very possible I am wrong. I forgot this code if ever I do not understand. Eric, why you can't do these changes on top of the cleanups I sent? OK, personally I certainly dislike 1/6, but perhaps it is needed for 6/6 which I didn't read yet. But, in any case, it is orthogonal to pid_ns_prepare_proc() cleanups? Now. You joined the first 2 patches I sent into 2/6. It is not that I care about the "From:" tag, but why? And (unless I missed something) you added the following changes compared to my patches: - remove the MS_KERNMOUNT check around ei->pid = find_pid(1). OK, I agree it was not strictly needed, but imho makes the code cleaner. Or I missed something and this check was wrong? - introduce the bug in create_pid_namespace(). If pid_ns_prepare_proc() fails, we return the wrong error code and leak parent_pid_ns(). So. Afaics - nack to 2/6 at least. Could you please do this on top of the cleanups I sent? Of course, unless you think they are wrong. And. I do not think these series can fix the discussed problems. ns->dead definitely can't, no? I think we should fix the bugs first. Oleg. --
So. Please the the patches. On top of [PATCH 1/4] procfs: proc_get_sb: consolidate/cleanup root_inode->pid logic [PATCH 2/4] procfs: kill the global proc_mnt variable [PATCH 3/4] procfs: move pid_ns_prepare_proc() from copy_process() to create_pid_namespace() Still untested, sorry. There was no electricity in office this weekend, and I can hardly work from home ;) I am sending this fix for review. And let me repeat, I do not think this approach is the best (even _if_ this change is correct), just my attempt to fix the bugs. What do you think? Oleg. --
A separate patch to simplify the review of the next change.
Move destroy_pid_namespace() into workqueue context. This allows us to do
mntput() from free_pid_ns() paths, see the next patch.
Add the new member, "struct work_struct destroy" into struct pid_namespace
and change free_pid_ns() to call destroy_pid_namespace() via schedule_work().
The patch looks a bit complicated because it also moves copy_pid_ns() up.
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
include/linux/pid_namespace.h | 1 +
kernel/pid_namespace.c | 25 +++++++++++++++++--------
2 files changed, 18 insertions(+), 8 deletions(-)
--- 35-rc3/include/linux/pid_namespace.h~PNS_5_DESTROY_FROM_WQ 2009-04-06 00:03:41.000000000 +0200
+++ 35-rc3/include/linux/pid_namespace.h 2010-06-20 17:47:05.000000000 +0200
@@ -24,6 +24,7 @@ struct pid_namespace {
struct kmem_cache *pid_cachep;
unsigned int level;
struct pid_namespace *parent;
+ struct work_struct destroy;
#ifdef CONFIG_PROC_FS
struct vfsmount *proc_mnt;
#endif
--- 35-rc3/kernel/pid_namespace.c~PNS_5_DESTROY_FROM_WQ 2010-06-19 19:21:42.000000000 +0200
+++ 35-rc3/kernel/pid_namespace.c 2010-06-20 18:36:00.000000000 +0200
@@ -114,6 +114,16 @@ out:
return ERR_PTR(err);
}
+struct pid_namespace *copy_pid_ns(unsigned long flags,
+ struct pid_namespace *old_ns)
+{
+ if (!(flags & CLONE_NEWPID))
+ return get_pid_ns(old_ns);
+ if (flags & (CLONE_THREAD|CLONE_PARENT))
+ return ERR_PTR(-EINVAL);
+ return create_pid_namespace(old_ns);
+}
+
static void destroy_pid_namespace(struct pid_namespace *ns)
{
int i;
@@ -123,13 +133,11 @@ static void destroy_pid_namespace(struct
kmem_cache_free(pid_ns_cachep, ns);
}
-struct pid_namespace *copy_pid_ns(unsigned long flags, struct pid_namespace *old_ns)
+static void destroy_pid_namespace_work(struct work_struct *work)
{
- if (!(flags & CLONE_NEWPID))
- return get_pid_ns(old_ns);
- if (flags & (CLONE_THREAD|CLONE_PARENT))
- return ERR_PTR(-EINVAL);
- return ...pid_namespace holds ns->proc_mnt, while this vfsmount has a referene to
the namespace via PROC_I(sb->s_root->d_inode)->pid. To break this circle
/sbin/init does mntput() in pid_ns_release_proc(). See 6f4e6433.
But we have the following problems:
- Nobody does mntput() if copy_process() fails after
pid_ns_prepare_proc().
- proc_flush_task() checks upid->nr == 1 to verify we are init,
this is wrong if a multi-threaded init does exec.
- As Louis pointed out, this namespace can have the detached
EXIT_DEAD tasks which can use ns->proc_mnt after this mntput().
With this patch only pid_namespace has a reference to ns->proc_mnt, and
mntput(ns->proc_mnt) is called by destroy_pid_namespace() paths when we
know that this ns must not have any references (in particular, there are
no pids in this namespace).
Changes:
- kill proc_flush_task()->pid_ns_release_proc()
- change fs/proc/root.c so that we don't create the "artificial"
references to the namespace or its pid==1.
- change destroy_pid_namespace() to call pid_ns_release_proc().
- change pid_ns_release_proc() to clear s_root->d_inode->pid.
The caller is destroy_pid_namespace(), this pid was already
freed.
Reported-by: Louis Rilling <louis.rilling@kerlabs.com>
Signed-off-by: Oleg Nesterov <oleg@redhat.com>
---
kernel/pid_namespace.c | 2 ++
fs/proc/base.c | 4 ----
fs/proc/root.c | 10 ++++++----
3 files changed, 8 insertions(+), 8 deletions(-)
--- 35-rc3/kernel/pid_namespace.c~PNS_6_BREAK_CIRCLE 2010-06-20 18:36:00.000000000 +0200
+++ 35-rc3/kernel/pid_namespace.c 2010-06-20 18:50:30.000000000 +0200
@@ -128,6 +128,8 @@ static void destroy_pid_namespace(struct
{
int i;
+ pid_ns_release_proc(ns);
+
for (i = 0; i < PIDMAP_ENTRIES; i++)
kfree(ns->pidmap[i].page);
kmem_cache_free(pid_ns_cachep, ns);
--- 35-rc3/fs/proc/base.c~PNS_6_BREAK_CIRCLE 2010-05-28 13:41:41.000000000 +0200
+++ 35-rc3/fs/proc/base.c 2010-06-20 18:51:14.000000000 +0200
@@ ...There are two kinds of dead for a pid namespace. There are: - no processes left. - no more references to struct pid_namespace. I just looked and I don't see any references to proc_mnt except from living processes. So while it isn't necessary that we kill the proc_mnt earlier it does Because there are conflicts, and if we are going to be going to 1/6 is. If you unshare a pid namespace. Your first child is pid one. Which means we can on longer count on CLONE_PID. I wrote that patch in March. So it is equally fair to say you split The MS_KERNMOUNT check was simply unnecessary, and it makes the code uglier to read and more brittle. Since I already had something that was only looking at the essential details I didn't see the Because I goofed, in March when I wrote it. Your patch got that right Well I think that entire MS_KERNMOUNT test is unnecessary and I'm am fairly confident that we have the signal sending races fixed so we can reasonably expect having sent SIGKILL to all processes in a pid namespace ns->dead certainly doesn't help in it's current form. I do think my series informs us of the direction the code is going, and that is Your patchset currently goes beyond the minimal that would make sense for 2.6.35. So we are talking about code for 2.6.36, and I think the unshare of the pid namespace code is certainly close enough that it can also be ready for 2.6.36. So what we get is more brains engaged and caring on the project so hopefully get better code review. Eric --
Yes, it can delay destroy_pid_namespace(), sure. OK, I am not going to argue. I never said this fix is perfect. I'll be wating for the I disagree here. Sure, it is unnecessary, and I already said this. I added it to simply document the fact that find_pid() can't work if MS_KERNMOUNT is set, to me this certainly makes the code more understandable to the reader. Oleg. --
Sorry, didn't notice this part... Which races? I am talking about the current problems with pid_ns_release_proc(), we have at least 3 bugs, from the 2/2 changelog: - Nobody does mntput() if copy_process() fails after pid_ns_prepare_proc(). - proc_flush_task() checks upid->nr == 1 to verify we are init, this is wrong if a multi-threaded init does exec. - As Louis pointed out, this namespace can have the detached EXIT_DEAD tasks which can use ns->proc_mnt after this mntput(). Oleg. --
And the last one on top of this series, before I go away from this thread ;) Since my further fixes were not approved, I think at least it makes sense to cleanup pid_ns_release_proc() logic a bit. Oleg. --
This is mostly cleanup and optimization, but also fixes the bug. proc_flush_task() checks upid->nr == 1 to detect the case when a sub-namespace exits. However, this doesn't work in case when a multithreaded init execs and calls release_task(old_leader), the old leader has the same pid 1. Move pid_ns_release_proc() to zap_pid_ns_processes(), it is called when we know for sure that init is exiting. Note: with or without this change this mntput() can happen before the EXIT_DEAD tasks not visible to do_wait() have passed proc_flush_task(). We need more fixes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> --- fs/proc/base.c | 4 ---- kernel/pid_namespace.c | 2 ++ 2 files changed, 2 insertions(+), 4 deletions(-) --- 35-rc3/fs/proc/base.c~PNS_5_MOVE_MNTPUT_TO_ZAP 2010-06-23 22:06:01.000000000 +0200 +++ 35-rc3/fs/proc/base.c 2010-06-23 22:10:26.000000000 +0200 @@ -2745,10 +2745,6 @@ void proc_flush_task(struct task_struct proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr, tgid->numbers[i].nr); } - - upid = &pid->numbers[pid->level]; - if (upid->nr == 1) - pid_ns_release_proc(upid->ns); } static struct dentry *proc_pid_instantiate(struct inode *dir, --- 35-rc3/kernel/pid_namespace.c~PNS_5_MOVE_MNTPUT_TO_ZAP 2010-06-23 22:13:07.000000000 +0200 +++ 35-rc3/kernel/pid_namespace.c 2010-06-23 22:13:55.000000000 +0200 @@ -189,6 +189,8 @@ void zap_pid_ns_processes(struct pid_nam } while (rc != -ECHILD); acct_exit_ns(pid_ns); + pid_ns_release_proc(pid_ns); + return; } --
Oleg Nesterov [oleg@redhat.com] wrote: | This is mostly cleanup and optimization, but also fixes the bug. | | proc_flush_task() checks upid->nr == 1 to detect the case when | a sub-namespace exits. However, this doesn't work in case when | a multithreaded init execs and calls release_task(old_leader), | the old leader has the same pid 1. | | Move pid_ns_release_proc() to zap_pid_ns_processes(), it is called | when we know for sure that init is exiting. Hmm, I almost agreed, but have a question :-) Yes, we know that the container-init is exiting. But if its parent (in the parent ns) waits on it and calls release_task(), won't we call proc_flush_task_mnt() on this container-init ? This would happen after dropping the mnt in zap_pid_ns_processes() no ? At the time zap_pid_ns_processes() is called, the container-init is still not in EXIT_ZOMBIE state right ? (Or does your statement below include EXIT_DEAD and EXIT_ZOMBIE tasks ?) | | Note: with or without this change this mntput() can happen before the | EXIT_DEAD tasks not visible to do_wait() have passed proc_flush_task(). | We need more fixes. | Sukadev --
Indeed. Thanks! Somehow I forgot that init itself has not passed proc_flush_task(). Oleg. --
Oleg with respect to your other patches I think they are some of
This actually guarantees a use after free for the namespace init:
do_exit()
exit_notify()
forget_original_parent()
find_new_reaper()
zap_pid_ns_processes()
release_task()
I agree.
--
Yes, thanks. I am stupid. Please ignore the patch. Oleg. --
It's completely untested and could be split into three patches. But I think that it solves the issues found so far, and that it will work with Eric's unshare(CLONE_NEWPID) too. What do you think about this approach? Thanks, Louis This patch does four things: - defer pid_ns_release_proc()->mntput() to a worqueue context, so that pid_ns_release_proc() can be called in atomic context; - introduce pid_ns->nr_pids, so that we can count the number of pids allocated by alloc_pidmap(); - move the call to pid_ns_prepare_proc() to alloc_pid(), where we know when the first pid of a namespace is allocated; - move the call to pid_ns_release_proc() to free_pid(), where we are now able to know when the last pid of a namespace is detached. This solves the missing mntput() in copy_process() cleanup path, since free_pid() is called to cleanup alloc_pid(). This solves the multi-threaded init doing exec issue, since all sub-threads including former leader have called proc_flush_task() when the last pid is detached. This solves the EXIT_DEAD tasks issue for the same reason. Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> --- fs/proc/base.c | 4 ---- fs/proc/root.c | 11 ++++++++++- include/linux/pid_namespace.h | 3 +++ kernel/fork.c | 6 ------ kernel/pid.c | 16 +++++++++++++--- kernel/pid_namespace.c | 1 + 6 files changed, 27 insertions(+), 14 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index acb7ef8..455b109 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2745,10 +2745,6 @@ void proc_flush_task(struct task_struct *task) proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr, tgid->numbers[i].nr); } - - upid = &pid->numbers[pid->level]; - if (upid->nr == 1) - pid_ns_release_proc(upid->ns); } static struct dentry *proc_pid_instantiate(struct inode *dir, diff --git a/fs/proc/root.c b/fs/proc/root.c index 4258384..9876cd9 ...
[ Resending because of buggy quotes in Eric's address. Sorry for the noise. ] It's completely untested and could be split into three patches. But I think that it solves the issues found so far, and that it will work with Eric's unshare(CLONE_NEWPID) too. What do you think about this approach? Thanks, Louis This patch does four things: - defer pid_ns_release_proc()->mntput() to a worqueue context, so that pid_ns_release_proc() can be called in atomic context; - introduce pid_ns->nr_pids, so that we can count the number of pids allocated by alloc_pidmap(); - move the call to pid_ns_prepare_proc() to alloc_pid(), where we know when the first pid of a namespace is allocated; - move the call to pid_ns_release_proc() to free_pid(), where we are now able to know when the last pid of a namespace is detached. This solves the missing mntput() in copy_process() cleanup path, since free_pid() is called to cleanup alloc_pid(). This solves the multi-threaded init doing exec issue, since all sub-threads including former leader have called proc_flush_task() when the last pid is detached. This solves the EXIT_DEAD tasks issue for the same reason. Signed-off-by: Louis Rilling <louis.rilling@kerlabs.com> --- fs/proc/base.c | 4 ---- fs/proc/root.c | 11 ++++++++++- include/linux/pid_namespace.h | 3 +++ kernel/fork.c | 6 ------ kernel/pid.c | 16 +++++++++++++--- kernel/pid_namespace.c | 1 + 6 files changed, 27 insertions(+), 14 deletions(-) diff --git a/fs/proc/base.c b/fs/proc/base.c index acb7ef8..455b109 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -2745,10 +2745,6 @@ void proc_flush_task(struct task_struct *task) proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr, tgid->numbers[i].nr); } - - upid = &pid->numbers[pid->level]; - if (upid->nr == 1) - pid_ns_release_proc(upid->ns); } static struct dentry *proc_pid_instantiate(struct inode ...
Oh. I shouldn't continue to participate in this discussion... I don't have the time and my previous patch proves that my patches should be ignored ;) This is what I personally dislike. I do not think pid_ns_prepare_proc() should depend on the fact that the first pid was already allocated. And, this subjective, yes, but it looks a bit strange that upid->nr has a reference to proc_mnt. And of course, imho it would nice to not create the circular reference we currently have. Louis, Eric. I am attaching my 2 patches (on top of cleanups) again. Could you take a look? Changes: - pid_ns_release_proc() nullifies sb->s_fs_info - proc_pid_lookup() and proc_self_readlink() check ns != NULL (this is sb->s_fs_info) I even tried to test this finally, seems to work. I am not going to argue if you prefer Louis's approach. But I will appreciate if you explain why my fix is wrong. I am curious because I spent 3 hours doing grep fs/proc ;) Oleg.
Oleg, I hope that you realize that we all appreciate your efforts to solve
Hopefully this is not much in alloc_pidmap() since we can expect that nr_pids
This patch itself does not look wrong to me.
This last point is what made me worry about your approach so far, although I did
not take time to spot the precise issues. Unfortunately I don't see what the
checks you added in proc_self_readlink(), proc_self_follow_link() and
proc_pid_lookup() buy. What does prevent destroy_pid_namespace() from running
concurrently? Maybe RCU could help in those cases?
Moreover, I think that proc_pid_readdir() should get some check too.
So what make me really worry about this approach is that it looks fragile. We
should find all locations where procfs expects that sb->s_fs_info is a valid pid
namespace, even if no more pids live in this namespace. I presume that this is
what you did, but this needs double-check at least.
Grepping for s_fs_info in fs/proc gives me:
- proc_test_super(), proc_set_super():
Ok because tasks only mount procfs of their active pid namespace.
- proc_kill_sb():
Your patch makes it ok.
- proc_single_show():
Ok because a pid must be alive to get here.
- proc_self_readlink(), proc_self_follow_link(), proc_pid_lookup():
Your patch could make them ok, provided that we protect against
destroy_pid_namespace()->kmem_cache_free().
- proc_pid_readdir():
Needs similar check and protection to proc_pid_lookup(), but there is another
issue: next_tgid() can find a dying task:
next_tgid() finds a task
task dies
last reference to ns is dropped
destroy_pid_namespace()
proc_pid_readdir()
->proc_pid_fill_cache()
->proc_fill_cache()
->proc_pid_instantiate()
->proc_pid_make_inode()
->new_inode()
->alloc_inode()
->kmem_cache_alloc(, GFP_KERNEL) blocks
So RCU would not be ...This also adds atomic op. But I mostly dislike the pid-ns-specific complications itself (including the mount-after-the-first-alloc_pid dependancy), not the minor perfomance penalty. Well, it checks ->reaper != NULL, that is why I don't verify ns. Looks like you are right, but this doesn't help proc_self_readlink(). I think we can fix all these problems, but I no longer think this approach can pretend to simplify the code. No, it will make the code more complex/ugly and potentially more buggy. Louis, thank you very much. Oleg. --
Louis Rilling [Louis.Rilling@kerlabs.com] wrote: | - proc_pid_readdir(): | Needs similar check and protection to proc_pid_lookup(), but there is another | issue: next_tgid() can find a dying task: Hmm, I thought proc_pid_readdir() would be a problem too but convinced myself that it would not - since a process running proc_pid_readdir() would have a reference to the pid namespace, in which case destroy_pid_ns() would not be called. | | next_tgid() finds a task | task dies | last reference to ns is dropped | destroy_pid_namespace() caller of next_tgid() holds a ref to pid-ns right ? Sukadev --
Where does this reference comes from ? proc_pid_readdir() pins the task_struct (ns->child_reaper), not the pid/ns. But I won't be surprised if I am wrong again ;) Oleg. --
Oleg Nesterov [oleg@redhat.com] wrote: | On 06/25, Sukadev Bhattiprolu wrote: | > | > Louis Rilling [Louis.Rilling@kerlabs.com] wrote: | > | - proc_pid_readdir(): | > | Needs similar check and protection to proc_pid_lookup(), but there is another | > | issue: next_tgid() can find a dying task: | > | > Hmm, I thought proc_pid_readdir() would be a problem too but convinced myself | > that it would not - since a process running proc_pid_readdir() would have | > a reference to the pid namespace, | | Where does this reference comes from ? Caller of proc_pid_readdir() would be living in the same pid namespace right ? i.e the pid namespace is not empty. If not they would be accessing a different instance of /proc -no ? Hmm, but thinking some more, if pid ns is created but /proc not remounted, the process would be accessing the parent ns (which cannot go away). A process cannot access a descendant/unrelated pid namespaces's /proc right ? | | proc_pid_readdir() pins the task_struct (ns->child_reaper), not the pid/ns. | | But I won't be surprised if I am wrong again ;) | | Oleg. --
Afaics, in general not.
Suppose that we do something like
if (!clone(CLONE_NEWPID)) {
mount("none", "/SUB_PROC", "proc", 0, NULL);
exit();
}
After that /SUB_PROC/ still exists, one can do "ls /SUB_PROC/".
This particular case is fine, ns->child_reaper was already cleared.
Yes ;)
Oleg.
--
Oleg Nesterov [oleg@redhat.com] wrote:
| On 06/25, Sukadev Bhattiprolu wrote:
| >
| > Oleg Nesterov [oleg@redhat.com] wrote:
| > | On 06/25, Sukadev Bhattiprolu wrote:
| > | >
| > | > Louis Rilling [Louis.Rilling@kerlabs.com] wrote:
| > | > | - proc_pid_readdir():
| > | > | Needs similar check and protection to proc_pid_lookup(), but there is another
| > | > | issue: next_tgid() can find a dying task:
| > | >
| > | > Hmm, I thought proc_pid_readdir() would be a problem too but convinced myself
| > | > that it would not - since a process running proc_pid_readdir() would have
| > | > a reference to the pid namespace,
| > |
| > | Where does this reference comes from ?
| >
| > Caller of proc_pid_readdir() would be living in the same pid namespace right ?
|
| Afaics, in general not.
|
| Suppose that we do something like
|
| if (!clone(CLONE_NEWPID)) {
| mount("none", "/SUB_PROC", "proc", 0, NULL);
| exit();
| }
|
| After that /SUB_PROC/ still exists, one can do "ls /SUB_PROC/".
Yes, I see it now. Thanks,
Sukadev
--
Currently we have some subtle races when a pid namespace
exits and we need a simple way of close those races.
To close those races in a simple way I introduce an atomic
flag PIDNS_DEAD that we can teest to see if a pid namespace
has died.
When PIDNS_DEAD is set for a pid namespace all attempts to
lookup or add a pid to the pid namespace will fail.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
include/linux/pid_namespace.h | 7 +++++++
kernel/fork.c | 3 ++-
kernel/pid.c | 7 +++++++
kernel/pid_namespace.c | 1 +
4 files changed, 17 insertions(+), 1 deletions(-)
diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 38d1032..dcee0b3 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -16,10 +16,17 @@ struct pidmap {
struct bsd_acct_struct;
+enum pidns_flags {
+ PIDNS_DEAD, /* When set do not allow lookups of pids in the pid namespace,
+ * or adding new pids to the pid namespace.
+ */
+};
+
struct pid_namespace {
struct kref kref;
struct pidmap pidmap[PIDMAP_ENTRIES];
int last_pid;
+ unsigned long flags;
struct task_struct *child_reaper;
struct kmem_cache *pid_cachep;
unsigned int level;
diff --git a/kernel/fork.c b/kernel/fork.c
index f36585c..9818b20 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1237,7 +1237,8 @@ static struct task_struct *copy_process(unsigned long clone_flags,
* thread can't slip out of an OOM kill (or normal SIGKILL).
*/
recalc_sigpending();
- if (signal_pending(current)) {
+ if (signal_pending(current) ||
+ test_bit(PIDNS_DEAD, &p->nsproxy->pid_ns->flags)) {
spin_unlock(&current->sighand->siglock);
write_unlock_irq(&tasklist_lock);
retval = -ERESTARTNOINTR;
diff --git a/kernel/pid.c b/kernel/pid.c
index e9fd8c1..1a921c7 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -248,6 +248,10 @@ struct pid *alloc_pid(struct pid_namespace *ns)
struct ...Currently it is possible to put proc_mnt before we have flushed the
last process that will use the proc_mnt to flush it's proc entries.
This race is fixed by not flushing proc entries for dead pid
namespaces, and calling pid_ns_release_proc unconditionally from
zap_pid_ns_processes after the pid namespace has been declared dead.
To ensure we don't unnecessarily leak any dcache entries with skipped
flushes pid_ns_release_proc flushes the entire proc_mnt when it is
called.
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
fs/proc/base.c | 9 +++++----
fs/proc/root.c | 3 +++
kernel/pid_namespace.c | 1 +
3 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/fs/proc/base.c b/fs/proc/base.c
index acb7ef8..e9d84e1 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -2742,13 +2742,14 @@ void proc_flush_task(struct task_struct *task)
for (i = 0; i <= pid->level; i++) {
upid = &pid->numbers[i];
+
+ /* Don't bother flushing dead pid namespaces */
+ if (test_bit(PIDNS_DEAD, &upid->ns->flags))
+ continue;
+
proc_flush_task_mnt(upid->ns->proc_mnt, upid->nr,
tgid->numbers[i].nr);
}
-
- upid = &pid->numbers[pid->level];
- if (upid->nr == 1)
- pid_ns_release_proc(upid->ns);
}
static struct dentry *proc_pid_instantiate(struct inode *dir,
diff --git a/fs/proc/root.c b/fs/proc/root.c
index cfdf032..2298fdd 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -209,5 +209,8 @@ int pid_ns_prepare_proc(struct pid_namespace *ns)
void pid_ns_release_proc(struct pid_namespace *ns)
{
+ /* Flush any cached proc dentries for this pid namespace */
+ shrink_dcache_parent(ns->proc_mnt->mnt_root);
+
mntput(ns->proc_mnt);
}
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index 92032d1..43dec5d 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -189,6 +189,7 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns)
rc = sys_wait4(-1, NULL, __WALL, NULL);
} while (rc ...IMHO, nothing prevents zap_pid_ns_processes() from setting PIDNS_DEAD and calling pid_ns_release_proc() right now. zap_pid_ns_processes() does not wait for EXIT_DEAD (self-reaping) children to be released. Thanks, -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Good point we need something probably a lock to prevent proc_mnt from going away here. We might do a little better if we were starting with a specific dentry, those at least have some rcu properties but that isn't a big help. Hmm. Perhaps there is a way to completely restructure this flushing of dentries. It is just an optimization after all so we don't get too many stale dentries building up. It might just be worth it simply kill proc_flush_mnt altogether. I know it is measurable when we don't do the flushing but perhaps there can be a work struct that periodically wakes up and smacks stale proc dentries. Right now I really don't think proc_flush_task is worth the hassle it causes. Grumble, Grumble more thinking to do. Eric --
Indeed, proc_flush_task() seems to be the only bad guy trying to access pid_ns->proc_mnt after the death of the init process. But I don't know enough about the performance impact of removing it. -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Having proc reference the pid_namespace and the pid_namespace
reference proc is a serious reference counting problem, which has
resulted in both leaks and use after free problems. Mount already
knows how to go from a pid_namespace to a mount of proc, so we don't
need to cache the proc mount.
To do this I introduce get_proc_mnt and replace pid_ns->proc_mnt users
with it. Additionally I remove pid_ns_(prepare|release)_proc as they
are now unneeded.
This is slightly less efficient but it is much easier to avoid the
races. If efficiency winds up being a problem we can revisit our data
structures.
Reported-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
arch/um/drivers/mconsole_kern.c | 4 +++-
fs/hppfs/hppfs.c | 2 +-
fs/proc/base.c | 13 +++++++------
fs/proc/root.c | 15 ++-------------
include/linux/pid_namespace.h | 3 ---
include/linux/proc_fs.h | 13 +------------
kernel/fork.c | 6 ------
kernel/sysctl_binary.c | 3 ++-
8 files changed, 16 insertions(+), 43 deletions(-)
diff --git a/arch/um/drivers/mconsole_kern.c b/arch/um/drivers/mconsole_kern.c
index de317d0..e89f72c 100644
--- a/arch/um/drivers/mconsole_kern.c
+++ b/arch/um/drivers/mconsole_kern.c
@@ -125,7 +125,7 @@ void mconsole_log(struct mc_request *req)
void mconsole_proc(struct mc_request *req)
{
struct nameidata nd;
- struct vfsmount *mnt = current->nsproxy->pid_ns->proc_mnt;
+ struct vfsmount *mnt;
struct file *file;
int n, err;
char *ptr = req->request.data, *buf;
@@ -134,7 +134,9 @@ void mconsole_proc(struct mc_request *req)
ptr += strlen("proc");
ptr = skip_spaces(ptr);
+ mnt = get_proc_mnt(task_active_pid_ns(current));
err = vfs_path_lookup(mnt->mnt_root, mnt, ptr, LOOKUP_FOLLOW, &nd);
+ mntput(mnt);
if (err) {
mconsole_reply(req, "Failed to look up file", 1, 0);
goto out;
diff --git ...Quoting Eric W. Biederman (ebiederm@xmission.com): Uh, that looks like it's got to be a *huge* hit. Each kern_mount_data() will at least alloc space for a vfsmnt and proc_fs_type->name. Once for each pid level of each exiting task. (Or am I misreading?) -serge --
IIUC, the difference between this solution and the first one I proposed is that instead of pinning proc_mnt with mntget() at copy_process()-time, proc_mnt is looked for and, if possible, mntget() at release_task()-time. Could you elaborate on the trade-off, that is accessing proc_mnt at copy_process()-time vs looking up proc_mnt at release_task()-time? Thanks, -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
A little code simplicity. But Serge was right there is cost a noticeable cost. About 5%-7% more on lat_proc from lmbench. The real benefit was simplicity. Eric --
Fix zap_pid_ns_processes so that it successfully waits for all of the tasks in the pid namespace to be reaped, even if called for a non-leader task of the init process. This guarantees that no task can escpae the pid namespace, and that it is safe for proc_flush_task to put the proc_mnt of the dead pid_namespace when pid 1 of the pid namespace calls proc_flush_task. Before zap_pid_ns_processes was fixed to wait for all zombies in the pid namespace to be reaped the easiest way to generate a zombie that would escape the pid namespace would be to attach a debugger to a process inside a pidnamespace from outside the pid namespace and then exit the pid namespace. In the process of trying to fix this bug I have looked at a lot of different options and a lot of different directions we can go. There are several limiting factors. - We need to guarantee that the init process of a pid namespace is not reaped before every other task in the pid namespace is reaped. Wait succeeding on the init process of a pid namespace gives the guarantee that all processes in the pid namespace are dead and gone. Or more succinctly it is not possible to escape from a pid namespace. The previous behaviour where some zombies could escape the pid namespace violates the assumption made by some reapers of a pid namespace init that all of the pid namespace cleanup has completed by the time that init is reaped. - proc_flush_task needs to be called after each task is reaped. Tasks are volatile and applications like top and ps frequently touch every thread group directory in /proc which triggers dcache entries to be created. If we don't remove those dcache entries when tasks are reaped we can get a large build up of useless inodes and dentries. Shrink_slab is designed to flush out useful cache entries not useless ones so while in the big picture it doesn't hurt if we leak a few if we leak a lot of dentries we put unnatural pressure on the kernels memory ...
Hi Eric, This patch looks like it is working (only a small RCU issue shown below). I couldn't try it yet though. I must admit that I am using a similar back-off solution in Kerrighed (not to solve the issue of proc_flush_task(), but for one of the reasons that you stated above: we want to be sure that all tasks of the namespace have been reaped), but I considered it too ugly to propose it for Linux ;) rcu_read_unlock() should not be called before here, or task may be freed before Thanks, Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Yeah I haven't had a chance to look at it in detail, but this looks right. Thanks, Eric. -serge --
Well sometimes you have to go with what works. Thanks for spotting those issue with my patch. I guess it needs one more Thanks for the review. The bug where processes can escape a pid namespace is really the reason I did it this way. I also have the patches needed to cleanly fix the pid namespace ref counting. (Hopefully I can get them posted soon). So if there was just the ref counting bug I would drop this patch. Eric --
Eric W. Biederman [ebiederm@xmission.com] wrote: | Louis Rilling <Louis.Rilling@kerlabs.com> writes: | | > This patch looks like it is working (only a small RCU issue shown below). I | > couldn't try it yet though. | | It certainly worked in my testing. | | > I must admit that I am using a similar back-off solution in Kerrighed (not to | > solve the issue of proc_flush_task(), but for one of the reasons that you stated | > above: we want to be sure that all tasks of the namespace have been reaped), but | > I considered it too ugly to propose it for Linux ;) | | Well sometimes you have to go with what works. | | Thanks for spotting those issue with my patch. I guess it needs one more | pass before I can call it done. Eric, Do you plan to resend this patch ? FYI, we ran into the problem that Louis Rilling reported when testing with the checkpoint/restart code and this patch fixes the problem (after a minor tweak to the C/R code). Thanks, Sukadev --
Eric W. Biederman [ebiederm@xmission.com] wrote:
|
| Changing zap_pid_ns_processes to fix the problem instead of
| changing the code elsewhere is one of the few solutions I have
| seen that does not increase the cost of the lat_proc test from
| lmbench.
I think its a good fix for the problem. but I have a nit and a minor
comment below.
Thanks,
Sukadev
|
| Reported-by: Louis Rilling <Louis.Rilling@kerlabs.com>
Reviewed-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
| Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
| ---
| kernel/pid_namespace.c | 50 ++++++++++++++++++++++++++++++++++++-----------
| 1 files changed, 38 insertions(+), 12 deletions(-)
|
| diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
| index a5aff94..aaf2ab0 100644
| --- a/kernel/pid_namespace.c
| +++ b/kernel/pid_namespace.c
| @@ -139,16 +139,20 @@ void free_pid_ns(struct kref *kref)
|
| void zap_pid_ns_processes(struct pid_namespace *pid_ns)
| {
| + struct task_struct *me = current;
| int nr;
| int rc;
| struct task_struct *task;
|
| /*
| - * The last thread in the cgroup-init thread group is terminating.
| - * Find remaining pid_ts in the namespace, signal and wait for them
| - * to exit.
| + * The last task in the pid namespace-init threa group is terminating.
nit: thread
| + * Find remaining pids in the namespace, signal and wait for them
| + * to to be reaped.
| *
| - * Note: This signals each threads in the namespace - even those that
| + * By waiting for all of the tasks to be reaped before init is reaped
| + * we provide the invariant that no task can escape the pid namespace.
| + *
| + * Note: This signals each task in the namespace - even those that
| * belong to the same thread group, To avoid this, we would have
| * to walk the entire tasklist looking a processes in this
| * namespace, but that could be unnecessarily expensive if the
| ...Which means we exercise that code path, and ensure we have same_thread_group test working properly. Given how rare threaded inits are every little bit It is possible that other threads of a multi-threaded init are in the PF_EXITING state and still visible for sending signals to. I really don't want to send SIG_KILL to another thread of init. There is a chance of messing up the return code if I do that, and do not want to need to think about that case. Eric --
Hm, I didn't look close enough. Sorry about that. However, I'm still concerned with this since this pid can have been freed right after container init's release_task(), and I don't see how it is guaranteed that nobody still tries to access this proc_mnt. Thanks, Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
I meant the opposite. proc_mnt can be kept mounted somewhere, and accesses to it will likely try to access (freed) pid_ns from it. Thanks, Louis -- Dr Louis Rilling Kerlabs Skype: louis.rilling Batiment Germanium Phone: (+33|0) 6 80 89 08 23 80 avenue des Buttes de Coesmes http://www.kerlabs.com/ 35700 Rennes
Not sure, there must be no tasks (and pids) in this ns. By anyway I agree. This needs more thinking and we should do something with sb->s_fs_info. But given that nobody (including me) seem to like this approach - let's forget this patch. Thanks, Oleg. --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
