[RFC][0/3] Virtual address space control for cgroups

Previous thread: [PATCH] LED: add support to leds with readable status by Henrique de Moraes Holschuh on Sunday, March 16, 2008 - 9:50 am. (3 messages)

Next thread: [git patches] parisc bug fixes for 2.6.25 by Kyle McMartin on Sunday, March 16, 2008 - 10:35 am. (1 message)
From: Balbir Singh
Date: Sunday, March 16, 2008 - 10:29 am

This is an early patchset for virtual address space control for cgroups.
The patches are against 2.6.25-rc5-mm1 and have been tested on top of
User Mode Linux.

The first patch adds the user interface. The second patch adds accounting
and control. The third patch updates documentation.

Review suggestions would be appreciated along the lines of

1. What is missing? Are all virtual address space expansion points covered?
2. Do we need to account and control address space at insert_special_mapping?
3. Address space accounting may contain duplications. Do we need to avoid it?

Comments?

series

memory-controller-virtual-address-space-control-user-interface
memory-controller-virtual-address-space-accounting-and-control
memory-controller-virtual-address-control-documentation



-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Balbir Singh
Date: Sunday, March 16, 2008 - 10:29 am

Add as_usage_in_bytes and as_limit_in_bytes interfaces. These provide
control over the total address space that the processes combined together
in the cgroup can grow upto. This functionality is analogous to
the RLIMIT_AS function of the getrlimit(2) and setrlimit(2) calls.
A as_res resource counter is added to the mem_cgroup structure. The
as_res counter handles all the accounting associated with the virtual
address space accounting and control of cgroups.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 mm/memcontrol.c |   31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff -puN mm/memcontrol.c~memory-controller-virtual-address-space-control-user-interface mm/memcontrol.c
--- linux-2.6.25-rc5/mm/memcontrol.c~memory-controller-virtual-address-space-control-user-interface	2008-03-16 22:57:38.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/memcontrol.c	2008-03-16 22:57:38.000000000 +0530
@@ -128,6 +128,10 @@ struct mem_cgroup {
 	 */
 	struct res_counter res;
 	/*
+	 * Address space limits
+	 */
+	struct res_counter as_res;
+	/*
 	 * Per cgroup active and inactive list, similar to the
 	 * per zone LRU lists.
 	 */
@@ -870,6 +874,21 @@ static ssize_t mem_cgroup_write(struct c
 				mem_cgroup_write_strategy);
 }
 
+static u64 mem_cgroup_as_read(struct cgroup *cont, struct cftype *cft)
+{
+	return res_counter_read_u64(&mem_cgroup_from_cont(cont)->as_res,
+				    cft->private);
+}
+
+static ssize_t mem_cgroup_as_write(struct cgroup *cont, struct cftype *cft,
+				struct file *file, const char __user *userbuf,
+				size_t nbytes, loff_t *ppos)
+{
+	return res_counter_write(&mem_cgroup_from_cont(cont)->as_res,
+				cft->private, userbuf, nbytes, ppos,
+				mem_cgroup_write_strategy);
+}
+
 static ssize_t mem_force_empty_write(struct cgroup *cont,
 				struct cftype *cft, struct file *file,
 				const char __user *userbuf,
@@ -931,6 +950,17 @@ static struct cftype mem_cgroup_files[] 
 		.read_u64 = mem_cgroup_read,
 	},
 ...
From: Balbir Singh
Date: Sunday, March 16, 2008 - 10:30 am

This patch implements accounting and control of virtual address space.
Accounting is done when the virtual address space of any task/mm_struct
belonging to the cgroup is incremented or decremented. This patch
fails the expansion if the cgroup goes over its limit. A new function
mem_cgroup_update_as() is added to deal with the accounting of the virtual
address space usage of cgroups.

TODOs

1. IA64 has code in perfmon.c pfm_smpl_buffer_alloc(), which increments
   the total_vm of the mm_struct. This code has not yet been brought into
   virtual address space control
2. Only when CONFIG_MMU is enabled, is the virtual address space control
   enabled. Should we do this for nommu cases as well? My suspicion is
   that we don't have to.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 arch/x86/kernel/ptrace.c   |   10 +++++++++-
 include/linux/memcontrol.h |    7 +++++++
 init/Kconfig               |    4 +++-
 kernel/fork.c              |    9 +++++++--
 mm/memcontrol.c            |   37 +++++++++++++++++++++++++++++++++++++
 mm/memory.c                |    5 +++++
 mm/mmap.c                  |   22 ++++++++++++++++++++--
 mm/mremap.c                |   21 ++++++++++++++++++---
 8 files changed, 106 insertions(+), 9 deletions(-)

diff -puN mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control mm/memcontrol.c
--- linux-2.6.25-rc5/mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control	2008-03-16 22:57:40.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/memcontrol.c	2008-03-16 22:57:40.000000000 +0530
@@ -525,6 +525,32 @@ unsigned long mem_cgroup_isolate_pages(u
 }
 
 /*
+ * Check if the current cgroup exceeds its address space limit.
+ * Returns 0 on success and 1 on failure.
+ */
+int mem_cgroup_update_as(struct mm_struct *mm, long nr_pages)
+{
+	int ret = 0;
+	struct mem_cgroup *mem;
+	if (mem_cgroup_subsys.disabled)
+		return ret;
+
+	rcu_read_lock();
+	mem = ...
From: Dave Hansen
Date: Monday, March 17, 2008 - 9:53 am

I think splattering these things all over is probably a bad idea.

If you're going to do this, I think you need a couple of phases.  

1. update the vm_(un)acct_memory() functions to take an mm
2. start using them (or some other abstracted functions in place)
3. update the new functions for cgroups

It's a bit non-obvious why you do the mem_cgroup_update_as() calls in
the places that you do from context.

Having some other vm-abstracted functions will also keep you from
splattering mem_cgroup_update_as() across the tree.  That's a pretty bad
name. :)  ...update_mapped() or ...update_vm() might be a wee bit
better. 

-- Dave

--

From: Balbir Singh
Date: Monday, March 17, 2008 - 6:14 pm

There are other problems

1. vm_(un)acct_memory is conditionally dependent on VM_ACCOUNT. Look at
shmem_(un)acct_size for example
2. These routines are not called from all contexts that we care about (look at

I am going to split mem_cgroup_update_as() to two routines with a better name. I
agree with you in principle about splattering, but please see my comments above

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Dave Hansen
Date: Tuesday, March 18, 2008 - 10:11 am

Yeah, but if VM_ACCOUNT isn't set, do you really want the controller
accounting for them?  It's there for a reason. :)

The shmem_acct_size() helpers look good.  I wonder if we should be using

Could you explain why "we" care about it and why it isn't accounted for
now?

-- Dave

--

From: Balbir Singh
Date: Tuesday, March 18, 2008 - 10:58 am

From: YAMAMOTO Takashi
Date: Monday, March 17, 2008 - 4:35 pm

i think you can sum and uncharge it with a single call.

YAMAMOTO Takashi
--

From: Balbir Singh
Date: Monday, March 17, 2008 - 6:10 pm

Like nr_accounted? I'll have to duplicate nr_accounted since that depends
conditionally on VM_ACCOUNT.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Pavel Emelyanov
Date: Monday, March 17, 2008 - 4:36 am

No, please, no. Let's make two calls - mem_cgroup_charge_as and mem_cgroup_uncharge_as.


Why not use existintg cap_vm_enough_memory and co?

[snip]
--

From: Balbir Singh
Date: Monday, March 17, 2008 - 5:29 am

I thought about it and almost used may_expand_vm(), but there is a slight catch
there. With cap_vm_enough_memory() or security_vm_enough_memory(), they are
called after total_vm has been calculated. In our case we need to keep the
cgroups equivalent of total_vm up to date, and we do this in mem_cgorup_update_as.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Pavel Emelyanov
Date: Monday, March 17, 2008 - 5:40 am

So? What prevents us from using these hooks? :)
--

From: Balbir Singh
Date: Monday, March 17, 2008 - 5:51 am

1. We need to account total_vm usage of the task anyway. So why have two places,
   one for accounting and second for control?
2. These hooks are activated for conditionally invoked for vma's with VM_ACCOUNT
   set.


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Pavel Emelyanov
Date: Monday, March 17, 2008 - 6:01 am

We still have two of them even placing hooks in each place manually.

Besides, putting the mem_cgroup_(un)charge_as() in these vm hooks will
1. save the number of places to patch
2. help keeping memcgroup consistent in case someone adds more places
   that expand tasks vm (arches, drivers) - in case we have our hooks

This is a good point against. But, wrt my previous comment, can we handle 
this somehow?
--

From: Balbir Singh
Date: Monday, March 17, 2008 - 7:39 am

I am not sure I understand your proposal. Without manually placing these hooks
how do we track

1. When the vm size has increased/decreased
2. In case due to some reason, the call following these hooks fail, how do we

Not sure I understand

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Paul Menage
Date: Sunday, March 16, 2008 - 7:02 pm

How about if this function avoided charging the root cgroup? You'd
save 4 atomic operations on a global data structure on every
mmap/munmap when the virtual address limit cgroup wasn't in use, which
could be significant on a large system. And I don't see situations
where you really need to limit the address space of the root cgroup.

Paul
--

From: Balbir Singh
Date: Sunday, March 16, 2008 - 7:57 pm

4 atomic operations is very tempting, but we want to account for root usage due
to the following reasons:

1. We want to be able to support hierarchial accounting and control
2. We want to track usage of the root cgroup and report it back to the user
3. We don't want to treat the root cgroup as a special case.



-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Paul Menage
Date: Sunday, March 16, 2008 - 8:03 pm

On Mon, Mar 17, 2008 at 10:57 AM, Balbir Singh


Why? It is a special case, in that in a lot of machines there's only
going to be the root cgroup, and the subsystem won't be mounted. So in
those cases, paying any overhead is a cost without a benefit.

Alternatively, how about you skip tracking virtual address space
changes if the virtual address cgroup isn't mounted on any hierarchy?
When you mount it, you can do a pass across all mms and set the root
cgroup usage to their total.

Paul
--

From: Balbir Singh
Date: Sunday, March 16, 2008 - 10:30 am

This patch adds documentation for virtual address space control.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/controllers/memory.txt |   26 +++++++++++++++++++++++++-
 1 file changed, 25 insertions(+), 1 deletion(-)

diff -puN Documentation/controllers/memory.txt~memory-controller-virtual-address-control-documentation Documentation/controllers/memory.txt
--- linux-2.6.25-rc5/Documentation/controllers/memory.txt~memory-controller-virtual-address-control-documentation	2008-03-16 22:57:44.000000000 +0530
+++ linux-2.6.25-rc5-balbir/Documentation/controllers/memory.txt	2008-03-16 22:57:44.000000000 +0530
@@ -237,7 +237,31 @@ cgroup might have some charge associated
 tasks have migrated away from it. Such charges are automatically dropped at
 rmdir() if there are no tasks.
 
-5. TODO
+5. Virtual address space accounting
+
+A new resource counter controls the address space expansion of the tasks in
+the cgroup. Address space control is provided along the same lines as
+RLIMIT_AS control, which is available via getrlimit(2)/setrlimit(2).
+The interface for controlling address space is provided through
+"as_limit_in_bytes". The file is similar to "limit_in_bytes" w.r.t the user
+interface. Please see section 3 for more details on how to use the user
+interface to get and set values.
+
+The "as_usage_in_bytes" file provides information about the total address
+space usage of the cgroup in bytes.
+
+5.1 Advantages of providing this feature
+
+1. Control over virtual address space allows for a cgroup to fail gracefully
+   i.e, via a malloc or mmap failure as compared to OOM kill when no
+   pages can be reclaimed
+2. It provides better control over how many pages can be swapped out when
+   the cgroup goes over it's limit. A badly setup cgroup can cause excessive
+   swapping. Providing control over the address space allocations ensures
+   that the system administrator has control over the total swapping that
+   can take place.
+
+6. TODO
 ...
From: Randy Dunlap
Date: Sunday, March 16, 2008 - 11:32 am

w.r.t.




---
~Randy
--

From: Balbir Singh
Date: Sunday, March 16, 2008 - 6:33 pm

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Paul Menage
Date: Sunday, March 16, 2008 - 4:26 pm

What's the performance hit of doing these accounting checks on every
mmap/munmap? If it's not totally lost in the noise, couldn't it be
made a separate control group, so that it could be just enabled (and
the performance hit taken) for users that actually want it?

Paul
--

From: Li Zefan
Date: Sunday, March 16, 2008 - 6:47 pm

It will be code duplication to make it a new subsystem, and it will be useful
to control both of them, am I right? :)

So could we just add a CONFIG to this patch series, like:
	CONFIG_CGROUP_MEM_RES_AS_CTLR
--

From: Paul Menage
Date: Sunday, March 16, 2008 - 6:57 pm

Would it? Other than the basic cgroup boilerplate, the only real
duplication that I could see would be that there'd need to be an
additional per-mm pointer back to the cgroup. (Which could be avoided
if we added a single per-mm pointer back to the "owning" task, which
would generally be the mm's thread group leader, so that you could go
quickly from an mm to a set of cgroup subsystems).

And the advantage would that you'd be able to more easily pick/choose
which bits of control you use (and pay for).

Paul
--

From: Balbir Singh
Date: Sunday, March 16, 2008 - 10:08 pm

I understand the per-mm pointer overhead back to the cgroup. I don't understand
the part about adding a per-mm pointer back to the "owning" task. We already
have task->mm. BTW, the reason by we directly add the mm_struct to mem_cgroup
mapping is that there are contexts from where only the mm_struct is known (when
we charge/uncharge). Assuming that current->mm's mem_cgorup is the one we want

I am not sure I understand your proposal fully. But, if it can help provide the
flexibility you are referring to, I am all ears.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Paul Menage
Date: Sunday, March 16, 2008 - 10:22 pm

Yes, but we don't have mm->owner, which is what I was proposing -
mm->owner would be a pointer typically to the mm's thread group
leader. It would remove the need to have to have pointers for the
various different cgroup subsystems that need to act on an mm rather
than a task_struct, since then you could use
mm->owner->cgroups[subsys_id].

But this is kind of orthogonal to whether virtual address space limits
should be a separate cgroup subsystem.

Paul
--

From: Balbir Singh
Date: Monday, March 17, 2008 - 8:15 am

Aaahh.. Yes.. mm->owner might be a good idea. The only thing we'll need to
handle is when mm->owner dies (I think the thread group is still kept around).
The other disadvantage is the double dereferencing, which should not be all that

Yes, sure.


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Balbir Singh
Date: Sunday, March 16, 2008 - 6:50 pm

I am yet to measure the performance overhead of the accounting checks. I'll try
and get started on that today. I did not consider making it a separate system,
because I suspect that anybody wanting memory control would also want address
space control (for the advantages listed in the documentation). I am not against
the idea of making it a separate subsystem, but first let me get back with the
numbers.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Paul Menage
Date: Sunday, March 16, 2008 - 6:55 pm

I'm a counter-example to your suspicion :-)

Trying to control virtual address space is a complete nightmare in the
presence of anything that uses large sparsely-populated mappings
(mmaps of large files, or large sparse heaps such as the JVM uses.)

If we want to control the effect of swapping, the right way to do it
is to control disk I/O, and ensure that the swapping is accounted to
that. Or simply just not give apps much swap space.

Paul
--

From: Balbir Singh
Date: Sunday, March 16, 2008 - 8:12 pm

Not really. Virtual limits are more gentle than an OOM kill that can occur if
the cgroup runs out of memory. Please also see

Yes, a disk I/O and swap I/O controller are being developed (not by us, but
others in the community). How does one restrict swap space for a particular
application? I can think of RLIMIT_AS for a process and something similar to
what I've posted for cgroups. Not enabling swap is an option, but not very
practical IMHO.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

Previous thread: [PATCH] LED: add support to leds with readable status by Henrique de Moraes Holschuh on Sunday, March 16, 2008 - 9:50 am. (3 messages)

Next thread: [git patches] parisc bug fixes for 2.6.25 by Kyle McMartin on Sunday, March 16, 2008 - 10:35 am. (1 message)