This is an early patchset for virtual address space control for cgroups. The patches are against 2.6.25-rc5-mm1 and have been tested on top of User Mode Linux. The first patch adds the user interface. The second patch adds accounting and control. The third patch updates documentation. Review suggestions would be appreciated along the lines of 1. What is missing? Are all virtual address space expansion points covered? 2. Do we need to account and control address space at insert_special_mapping? 3. Address space accounting may contain duplications. Do we need to avoid it? Comments? series memory-controller-virtual-address-space-control-user-interface memory-controller-virtual-address-space-accounting-and-control memory-controller-virtual-address-control-documentation -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
Add as_usage_in_bytes and as_limit_in_bytes interfaces. These provide
control over the total address space that the processes combined together
in the cgroup can grow upto. This functionality is analogous to
the RLIMIT_AS function of the getrlimit(2) and setrlimit(2) calls.
A as_res resource counter is added to the mem_cgroup structure. The
as_res counter handles all the accounting associated with the virtual
address space accounting and control of cgroups.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
mm/memcontrol.c | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)
diff -puN mm/memcontrol.c~memory-controller-virtual-address-space-control-user-interface mm/memcontrol.c
--- linux-2.6.25-rc5/mm/memcontrol.c~memory-controller-virtual-address-space-control-user-interface 2008-03-16 22:57:38.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/memcontrol.c 2008-03-16 22:57:38.000000000 +0530
@@ -128,6 +128,10 @@ struct mem_cgroup {
*/
struct res_counter res;
/*
+ * Address space limits
+ */
+ struct res_counter as_res;
+ /*
* Per cgroup active and inactive list, similar to the
* per zone LRU lists.
*/
@@ -870,6 +874,21 @@ static ssize_t mem_cgroup_write(struct c
mem_cgroup_write_strategy);
}
+static u64 mem_cgroup_as_read(struct cgroup *cont, struct cftype *cft)
+{
+ return res_counter_read_u64(&mem_cgroup_from_cont(cont)->as_res,
+ cft->private);
+}
+
+static ssize_t mem_cgroup_as_write(struct cgroup *cont, struct cftype *cft,
+ struct file *file, const char __user *userbuf,
+ size_t nbytes, loff_t *ppos)
+{
+ return res_counter_write(&mem_cgroup_from_cont(cont)->as_res,
+ cft->private, userbuf, nbytes, ppos,
+ mem_cgroup_write_strategy);
+}
+
static ssize_t mem_force_empty_write(struct cgroup *cont,
struct cftype *cft, struct file *file,
const char __user *userbuf,
@@ -931,6 +950,17 @@ static struct cftype mem_cgroup_files[]
.read_u64 = mem_cgroup_read,
},
...This patch implements accounting and control of virtual address space.
Accounting is done when the virtual address space of any task/mm_struct
belonging to the cgroup is incremented or decremented. This patch
fails the expansion if the cgroup goes over its limit. A new function
mem_cgroup_update_as() is added to deal with the accounting of the virtual
address space usage of cgroups.
TODOs
1. IA64 has code in perfmon.c pfm_smpl_buffer_alloc(), which increments
the total_vm of the mm_struct. This code has not yet been brought into
virtual address space control
2. Only when CONFIG_MMU is enabled, is the virtual address space control
enabled. Should we do this for nommu cases as well? My suspicion is
that we don't have to.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
arch/x86/kernel/ptrace.c | 10 +++++++++-
include/linux/memcontrol.h | 7 +++++++
init/Kconfig | 4 +++-
kernel/fork.c | 9 +++++++--
mm/memcontrol.c | 37 +++++++++++++++++++++++++++++++++++++
mm/memory.c | 5 +++++
mm/mmap.c | 22 ++++++++++++++++++++--
mm/mremap.c | 21 ++++++++++++++++++---
8 files changed, 106 insertions(+), 9 deletions(-)
diff -puN mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control mm/memcontrol.c
--- linux-2.6.25-rc5/mm/memcontrol.c~memory-controller-virtual-address-space-accounting-and-control 2008-03-16 22:57:40.000000000 +0530
+++ linux-2.6.25-rc5-balbir/mm/memcontrol.c 2008-03-16 22:57:40.000000000 +0530
@@ -525,6 +525,32 @@ unsigned long mem_cgroup_isolate_pages(u
}
/*
+ * Check if the current cgroup exceeds its address space limit.
+ * Returns 0 on success and 1 on failure.
+ */
+int mem_cgroup_update_as(struct mm_struct *mm, long nr_pages)
+{
+ int ret = 0;
+ struct mem_cgroup *mem;
+ if (mem_cgroup_subsys.disabled)
+ return ret;
+
+ rcu_read_lock();
+ mem = ...I think splattering these things all over is probably a bad idea. If you're going to do this, I think you need a couple of phases. 1. update the vm_(un)acct_memory() functions to take an mm 2. start using them (or some other abstracted functions in place) 3. update the new functions for cgroups It's a bit non-obvious why you do the mem_cgroup_update_as() calls in the places that you do from context. Having some other vm-abstracted functions will also keep you from splattering mem_cgroup_update_as() across the tree. That's a pretty bad name. :) ...update_mapped() or ...update_vm() might be a wee bit better. -- Dave --
There are other problems 1. vm_(un)acct_memory is conditionally dependent on VM_ACCOUNT. Look at shmem_(un)acct_size for example 2. These routines are not called from all contexts that we care about (look at I am going to split mem_cgroup_update_as() to two routines with a better name. I agree with you in principle about splattering, but please see my comments above -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
Yeah, but if VM_ACCOUNT isn't set, do you really want the controller accounting for them? It's there for a reason. :) The shmem_acct_size() helpers look good. I wonder if we should be using Could you explain why "we" care about it and why it isn't accounted for now? -- Dave --
i think you can sum and uncharge it with a single call. YAMAMOTO Takashi --
Like nr_accounted? I'll have to duplicate nr_accounted since that depends conditionally on VM_ACCOUNT. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
No, please, no. Let's make two calls - mem_cgroup_charge_as and mem_cgroup_uncharge_as. Why not use existintg cap_vm_enough_memory and co? [snip] --
I thought about it and almost used may_expand_vm(), but there is a slight catch there. With cap_vm_enough_memory() or security_vm_enough_memory(), they are called after total_vm has been calculated. In our case we need to keep the cgroups equivalent of total_vm up to date, and we do this in mem_cgorup_update_as. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
So? What prevents us from using these hooks? :) --
1. We need to account total_vm usage of the task anyway. So why have two places, one for accounting and second for control? 2. These hooks are activated for conditionally invoked for vma's with VM_ACCOUNT set. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
We still have two of them even placing hooks in each place manually. Besides, putting the mem_cgroup_(un)charge_as() in these vm hooks will 1. save the number of places to patch 2. help keeping memcgroup consistent in case someone adds more places that expand tasks vm (arches, drivers) - in case we have our hooks This is a good point against. But, wrt my previous comment, can we handle this somehow? --
I am not sure I understand your proposal. Without manually placing these hooks how do we track 1. When the vm size has increased/decreased 2. In case due to some reason, the call following these hooks fail, how do we Not sure I understand -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
How about if this function avoided charging the root cgroup? You'd save 4 atomic operations on a global data structure on every mmap/munmap when the virtual address limit cgroup wasn't in use, which could be significant on a large system. And I don't see situations where you really need to limit the address space of the root cgroup. Paul --
4 atomic operations is very tempting, but we want to account for root usage due to the following reasons: 1. We want to be able to support hierarchial accounting and control 2. We want to track usage of the root cgroup and report it back to the user 3. We don't want to treat the root cgroup as a special case. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
On Mon, Mar 17, 2008 at 10:57 AM, Balbir Singh Why? It is a special case, in that in a lot of machines there's only going to be the root cgroup, and the subsystem won't be mounted. So in those cases, paying any overhead is a cost without a benefit. Alternatively, how about you skip tracking virtual address space changes if the virtual address cgroup isn't mounted on any hierarchy? When you mount it, you can do a pass across all mms and set the root cgroup usage to their total. Paul --
This patch adds documentation for virtual address space control. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- Documentation/controllers/memory.txt | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff -puN Documentation/controllers/memory.txt~memory-controller-virtual-address-control-documentation Documentation/controllers/memory.txt --- linux-2.6.25-rc5/Documentation/controllers/memory.txt~memory-controller-virtual-address-control-documentation 2008-03-16 22:57:44.000000000 +0530 +++ linux-2.6.25-rc5-balbir/Documentation/controllers/memory.txt 2008-03-16 22:57:44.000000000 +0530 @@ -237,7 +237,31 @@ cgroup might have some charge associated tasks have migrated away from it. Such charges are automatically dropped at rmdir() if there are no tasks. -5. TODO +5. Virtual address space accounting + +A new resource counter controls the address space expansion of the tasks in +the cgroup. Address space control is provided along the same lines as +RLIMIT_AS control, which is available via getrlimit(2)/setrlimit(2). +The interface for controlling address space is provided through +"as_limit_in_bytes". The file is similar to "limit_in_bytes" w.r.t the user +interface. Please see section 3 for more details on how to use the user +interface to get and set values. + +The "as_usage_in_bytes" file provides information about the total address +space usage of the cgroup in bytes. + +5.1 Advantages of providing this feature + +1. Control over virtual address space allows for a cgroup to fail gracefully + i.e, via a malloc or mmap failure as compared to OOM kill when no + pages can be reclaimed +2. It provides better control over how many pages can be swapped out when + the cgroup goes over it's limit. A badly setup cgroup can cause excessive + swapping. Providing control over the address space allocations ensures + that the system administrator has control over the total swapping that + can take place. + +6. TODO ...
-- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
What's the performance hit of doing these accounting checks on every mmap/munmap? If it's not totally lost in the noise, couldn't it be made a separate control group, so that it could be just enabled (and the performance hit taken) for users that actually want it? Paul --
It will be code duplication to make it a new subsystem, and it will be useful to control both of them, am I right? :) So could we just add a CONFIG to this patch series, like: CONFIG_CGROUP_MEM_RES_AS_CTLR --
Would it? Other than the basic cgroup boilerplate, the only real duplication that I could see would be that there'd need to be an additional per-mm pointer back to the cgroup. (Which could be avoided if we added a single per-mm pointer back to the "owning" task, which would generally be the mm's thread group leader, so that you could go quickly from an mm to a set of cgroup subsystems). And the advantage would that you'd be able to more easily pick/choose which bits of control you use (and pay for). Paul --
I understand the per-mm pointer overhead back to the cgroup. I don't understand the part about adding a per-mm pointer back to the "owning" task. We already have task->mm. BTW, the reason by we directly add the mm_struct to mem_cgroup mapping is that there are contexts from where only the mm_struct is known (when we charge/uncharge). Assuming that current->mm's mem_cgorup is the one we want I am not sure I understand your proposal fully. But, if it can help provide the flexibility you are referring to, I am all ears. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
Yes, but we don't have mm->owner, which is what I was proposing - mm->owner would be a pointer typically to the mm's thread group leader. It would remove the need to have to have pointers for the various different cgroup subsystems that need to act on an mm rather than a task_struct, since then you could use mm->owner->cgroups[subsys_id]. But this is kind of orthogonal to whether virtual address space limits should be a separate cgroup subsystem. Paul --
Aaahh.. Yes.. mm->owner might be a good idea. The only thing we'll need to handle is when mm->owner dies (I think the thread group is still kept around). The other disadvantage is the double dereferencing, which should not be all that Yes, sure. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
I am yet to measure the performance overhead of the accounting checks. I'll try and get started on that today. I did not consider making it a separate system, because I suspect that anybody wanting memory control would also want address space control (for the advantages listed in the documentation). I am not against the idea of making it a separate subsystem, but first let me get back with the numbers. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
I'm a counter-example to your suspicion :-) Trying to control virtual address space is a complete nightmare in the presence of anything that uses large sparsely-populated mappings (mmaps of large files, or large sparse heaps such as the JVM uses.) If we want to control the effect of swapping, the right way to do it is to control disk I/O, and ensure that the swapping is accounted to that. Or simply just not give apps much swap space. Paul --
Not really. Virtual limits are more gentle than an OOM kill that can occur if the cgroup runs out of memory. Please also see Yes, a disk I/O and swap I/O controller are being developed (not by us, but others in the community). How does one restrict swap space for a particular application? I can think of RLIMIT_AS for a process and something similar to what I've posted for cgroups. Not enabling swap is an option, but not very practical IMHO. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build wi |
