This patch set provides the ability for each cgroup to have independent dirty
page limits.
Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim)
page cache used by a cgroup. So, in case of multiple cgroup writers, they will
not be able to consume more than their designated share of dirty pages and will
be forced to perform write-out if they cross that limit.
These patches were developed and tested on mmotm 2010-09-28-16-13. The patches
are based on a series proposed by Andrea Righi in Mar 2010.
Overview:
- Add page_cgroup flags to record when pages are dirty, in writeback, or nfs
unstable.
- Extend mem_cgroup to record the total number of pages in each of the
interesting dirty states (dirty, writeback, unstable_nfs).
- Add dirty parameters similar to the system-wide /proc/sys/vm/dirty_*
limits to mem_cgroup. The mem_cgroup dirty parameters are accessible
via cgroupfs control files.
- Consider both system and per-memcg dirty limits in page writeback when
deciding to queue background writeback or block for foreground writeback.
Known shortcomings:
- When a cgroup dirty limit is exceeded, then bdi writeback is employed to
writeback dirty inodes. Bdi writeback considers inodes from any cgroup, not
just inodes contributing dirty pages to the cgroup exceeding its limit.
Performance measurements:
- kernel builds are unaffected unless run with a small dirty limit.
- all data collected with CONFIG_CGROUP_MEM_RES_CTLR=y.
- dd has three data points (in secs) for three data sizes (100M, 200M, and 1G).
As expected, dd slows when it exceed its cgroup dirty limit.
kernel_build dd
mmotm 2:37 0.18, 0.38, 1.65
root_memcg
mmotm 2:37 0.18, 0.35, 1.66
non-root_memcg
mmotm+patches 2:37 0.18, 0.35, 1.68
root_memcg
mmotm+patches 2:37 0.19, 0.35, 1.69
non-root_memcg
mmotm+patches 2:37 0.19, 2.34, 22.82
non-root_memcg
...Add additional flags to page_cgroup to track dirty pages
within a mem_cgroup.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
include/linux/page_cgroup.h | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 5bb13b3..b59c298 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -40,6 +40,9 @@ enum {
PCG_USED, /* this object is in use. */
PCG_ACCT_LRU, /* page has been accounted for */
PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+ PCG_FILE_DIRTY, /* page is dirty */
+ PCG_FILE_WRITEBACK, /* page is under writeback */
+ PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
PCG_MIGRATION, /* under page migration */
};
@@ -59,6 +62,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
static inline int TestClearPageCgroup##uname(struct page_cgroup *pc) \
{ return test_and_clear_bit(PCG_##lname, &pc->flags); }
+#define TESTSETPCGFLAG(uname, lname) \
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
+ { return test_and_set_bit(PCG_##lname, &pc->flags); }
+
TESTPCGFLAG(Locked, LOCK)
/* Cache flag is set only once (at allocation) */
@@ -80,6 +87,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
CLEARPCGFLAG(FileMapped, FILE_MAPPED)
TESTPCGFLAG(FileMapped, FILE_MAPPED)
+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
+
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTPCGFLAG(FileUnstableNFS, ...On Sun, 3 Oct 2010 23:57:56 -0700 Ack...oh, but it seems I've signed. Thanks. --
On Sun, 3 Oct 2010 23:57:56 -0700 Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Thanks, --
Looks good to me Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> -- Three Cheers, Balbir --
Document cgroup dirty memory interfaces and statistics. Signed-off-by: Andrea Righi <arighi@develer.com> Signed-off-by: Greg Thelen <gthelen@google.com> --- Documentation/cgroups/memory.txt | 37 +++++++++++++++++++++++++++++++++++++ 1 files changed, 37 insertions(+), 0 deletions(-) diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt index 7781857..eab65e2 100644 --- a/Documentation/cgroups/memory.txt +++ b/Documentation/cgroups/memory.txt @@ -385,6 +385,10 @@ mapped_file - # of bytes of mapped file (includes tmpfs/shmem) pgpgin - # of pages paged in (equivalent to # of charging events). pgpgout - # of pages paged out (equivalent to # of uncharging events). swap - # of bytes of swap usage +dirty - # of bytes that are waiting to get written back to the disk. +writeback - # of bytes that are actively being written back to the disk. +nfs - # of bytes sent to the NFS server, but not yet committed to + the actual storage. inactive_anon - # of bytes of anonymous memory and swap cache memory on LRU list. active_anon - # of bytes of anonymous and swap cache memory on active @@ -453,6 +457,39 @@ memory under it will be reclaimed. You can reset failcnt by writing 0 to failcnt file. # echo 0 > .../memory.failcnt +5.5 dirty memory + +Control the maximum amount of dirty pages a cgroup can have at any given time. + +Limiting dirty memory is like fixing the max amount of dirty (hard to reclaim) +page cache used by a cgroup. So, in case of multiple cgroup writers, they will +not be able to consume more than their designated share of dirty pages and will +be forced to perform write-out if they cross that limit. + +The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*. It +is possible to configure a limit to trigger both a direct writeback or a +background writeback performed by per-bdi flusher threads. The root cgroup +memory.dirty_* control files are read-only and match the contents of +the /proc/sys/vm/dirty_* ...
On Sun, 3 Oct 2010 23:57:57 -0700 Nice. --
On Sun, 3 Oct 2010 23:57:57 -0700 I think you will change "nfs" to "nfs_unstable", but anyway, Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Thanks --
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> -- Three Cheers, Balbir --
Replace usage of the mem_cgroup_update_file_mapped() memcg
statistic update routine with two new routines:
* mem_cgroup_inc_page_stat()
* mem_cgroup_dec_page_stat()
As before, only the file_mapped statistic is managed. However,
these more general interfaces allow for new statistics to be
more easily added. New statistics are added with memcg dirty
page accounting.
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
include/linux/memcontrol.h | 31 ++++++++++++++++++++++++++++---
mm/memcontrol.c | 17 ++++++++---------
mm/rmap.c | 4 ++--
3 files changed, 38 insertions(+), 14 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 159a076..7c7bec4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -25,6 +25,11 @@ struct page_cgroup;
struct page;
struct mm_struct;
+/* Stats that can be updated by kernel. */
+enum mem_cgroup_write_page_stat_item {
+ MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+};
+
extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,
@@ -121,7 +126,22 @@ static inline bool mem_cgroup_disabled(void)
return false;
}
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+void mem_cgroup_update_page_stat(struct page *page,
+ enum mem_cgroup_write_page_stat_item idx,
+ int val);
+
+static inline void mem_cgroup_inc_page_stat(struct page *page,
+ enum mem_cgroup_write_page_stat_item idx)
+{
+ mem_cgroup_update_page_stat(page, idx, 1);
+}
+
+static inline void mem_cgroup_dec_page_stat(struct page *page,
+ enum mem_cgroup_write_page_stat_item idx)
+{
+ mem_cgroup_update_page_stat(page, idx, -1);
+}
+
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask);
u64 mem_cgroup_get_limit(struct mem_cgroup *mem);
@@ -293,8 +313,13 ...Not seeing this function in mmotm 28/09. So not able to apply this patch. --
How are you getting mmotm? I see the mem_cgroup_update_file_stat() routine added in mmotm (stamp-2010-09-28-16-13) using patch file: http://userweb.kernel.org/~akpm/mmotm/broken-out/memcg-generic-filestat-update-interfa... Author: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Date: Tue Sep 28 21:48:19 2010 -0700 This patch extracts the core logic from mem_cgroup_update_file_mapped() as mem_cgroup_update_file_stat() and adds a wrapper. As a planned future update, memory cgroup has to count dirty pages to implement dirty_ratio/limit. And more, the number of dirty pages is required to kick flusher thread to start writeback. (Now, no kick.) This patch is preparation for it and makes other statistics implementation clearer. Just a clean up. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: Greg Thelen <gthelen@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> If you are using the zen mmotm repository, git://zen-kernel.org/kernel/mmotm.git, the commit id of memcg-generic-filestat-update-interface.patch is --
Sorry for the noise Greg. It was a mistake at my end. Corrected now. --
On Sun, 3 Oct 2010 23:57:58 -0700 Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Why you move this_cpu_add() placement ? (This placement is ok but I just wonder..) Thanks, -Kame --
The reason this_cpu_add() is moved to after the switch is because the switch is needed to convert the input parameter from an enum mem_cgroup_write_page_stat_item (example: MEMCG_NR_FILE_MAPPED) to enum mem_cgroup_stat_index (example: MEM_CGROUP_STAT_FILE_MAPPED) before indexing into the count array. Also in subsequent patches (in this series) "val" is updated depending on page_cgroup flags before usage by this_cpu_add(). --
mem_cgrou_"write"_page_stat_item? Does "write" make sense to abstract page_state generally? -- Kind regards, Minchan Kim --
First I will summarize the portion of the design relevant to this
comment:
This patch series introduces two sets of memcg statistics.
a) the writable set of statistics the kernel updates when pages change
state (example: when a page becomes dirty) using:
mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_write_page_stat_item idx)
mem_cgroup_dec_page_stat(struct page *page,
enum mem_cgroup_write_page_stat_item idx)
b) the read-only set of statistics the kernel queries to measure the
amount of dirty memory used by the current cgroup using:
s64 mem_cgroup_page_stat(enum mem_cgroup_read_page_stat_item item)
This read-only set of statistics is set of a higher level conceptual
counters. For example, MEMCG_NR_DIRTYABLE_PAGES is the sum of the
counts of pages in various states (active + inactive). mem_cgroup
exports this value as a higher level counter rather than individual
counters (active & inactive) to minimize the number of calls into
mem_cgroup_page_stat(). This avoids extra calls to cgroup tree
iteration with for_each_mem_cgroup_tree().
Notice that each of the two sets of statistics are addressed by a
different type, mem_cgroup_{read vs write}_page_stat_item.
This particular patch (memcg: create extensible page stat update
routines) introduces part of this design. A later patch I emailed
(memcg: add dirty limits to mem_cgroup) added
mem_cgroup_read_page_stat_item.
I think the code would read better if I renamed
enum mem_cgroup_write_page_stat_item to
enum mem_cgroup_update_page_stat_item.
Would this address your concern?
--
Greg
--
Thanks for the kind explanation. I understand your concept. I think you makes update and query as completely different level abstraction but you could use similar terms. Even the terms(write VS read) make me more confusing. How about renaming following as? 1. mem_cgroup_write_page_stat_item -> mem_cgroup_page_stat_item 2. mem_cgroup_read_page_stat_item -> mem_cgroup_nr_pages_item At least it looks to be easy for me to understand the code. But it's just my preference. If others think your semantic is more desirable, I am not against it strongly. -- Kind regards, Minchan Kim --
I think your suggestion is good. I will include it in the next revision --
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> -- Three Cheers, Balbir --
If pages are being migrated from a memcg, then updates to that
memcg's page statistics are protected by grabbing a bit spin lock
using lock_page_cgroup(). In an upcoming commit memcg dirty page
accounting will be updating memcg page accounting (specifically:
num writeback pages) from softirq. Avoid a deadlocking nested
spin lock attempt by disabling interrupts on the local processor
when grabbing the page_cgroup bit_spin_lock in lock_page_cgroup().
This avoids the following deadlock:
statistic
CPU 0 CPU 1
inc_file_mapped
rcu_read_lock
start move
synchronize_rcu
lock_page_cgroup
softirq
test_clear_page_writeback
mem_cgroup_dec_page_stat(NR_WRITEBACK)
rcu_read_lock
lock_page_cgroup /* deadlock */
unlock_page_cgroup
rcu_read_unlock
unlock_page_cgroup
rcu_read_unlock
By disabling interrupts in lock_page_cgroup, nested calls
are avoided. The softirq would be delayed until after inc_file_mapped
enables interrupts when calling unlock_page_cgroup().
The normal, fast path, of memcg page stat updates typically
does not need to call lock_page_cgroup(), so this change does
not affect the performance of the common case page accounting.
Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
include/linux/page_cgroup.h | 8 +++++-
mm/memcontrol.c | 51 +++++++++++++++++++++++++-----------------
2 files changed, 36 insertions(+), 23 deletions(-)
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index b59c298..872f6b1 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -117,14 +117,18 @@ static inline enum zone_type page_cgroup_zid(struct page_cgroup *pc)
return page_zonenum(pc->page);
}
...On Sun, 3 Oct 2010 23:57:59 -0700 Nice Catch! But..hmm this wasn't necessary for FILE_MAPPED but necesary for new statistics, right ? (This affects the order of patches.) Anyway --
This patch (disabling interrupts) is not needed until later patches (in this series) update memcg statistics from softirq. If we only had FILE_MAPPED, then this patch would not be needed. I placed this patch before the following dependent patches that need it. The opposite order seemed wrong because it would introduce the possibility of the deadlock until this patch was applied. By having this patch come first there should be no way to apply the series in order and see the mentioned --
Hmm. Let me ask questions. 1. Why do you add new irq disable region in general function? I think __do_fault is a one of fast path. Could you disable softirq using _local_bh_disable_ not in general function but in your context? How do you expect that how many users need irq lock to update page state? If they don't need to disalbe irq? We can pass some argument which present to need irq lock or not. But it seems to make code very ugly. 2. So could you solve the problem in your design? I mean you could update page state out of softirq? (I didn't look at the your patches all. Sorry if I am missing something) 3. Normally, we have updated page state without disable irq. Why does memcg need it? I hope we don't add irq disable region as far as possbile. -- Kind regards, Minchan Kim --
This is true. I used pft to measure the cost of this extra locking code. This pft workload exercises this memcg call stack: lock_page_cgroup+0x39/0x5b __mem_cgroup_commit_charge+0x2c/0x98 mem_cgroup_charge_common+0x66/0x76 mem_cgroup_newpage_charge+0x40/0x4f handle_mm_fault+0x2e3/0x869 do_page_fault+0x286/0x29b page_fault+0x1f/0x30 I ran 100 iterations of "pft -m 8g -t 16 -a" and focused on the flt/cpu/s. First I established a performance baseline using upstream mmotm locking (not disabling interrupts). 100 samples: mean 51930.16383 stddev 2.032% (1055.40818272) Then I introduced this patch, which disabled interrupts in lock_page_cgroup(): 100 samples: mean 52174.17434 stddev 1.306% (681.14442646) Then I replaced this patch's usage of local_irq_save/restore() with local_bh_disable/enable(). 100 samples: mean 51810.58591 stddev 1.892% (980.340335322) The proposed patch (#2) actually improves allocation performance by 0.47% when compared to the baseline (#1). However, I believe that this is in the statistical noise. This particular workload does not seem to lock_page_cgroup() is only used by mem_cgroup in memcontrol.c. local_bh_disable() should also work instead of my proposed patch, which used local_irq_save/restore(). local_bh_disable() will not disable all interrupts so it should have less impact. But I think that usage of local_bh_disable() it still something that has to happen in the general lock_page_cgroup() function. The softirq can occur at an arbitrary time and processor with the possibility of interrupting anyone who does not have interrupts or softirq disabled. Therefore the softirq could interrupt code that has used lock_page_cgroup(), unless lock_page_cgroup() explicitly (as proposed) disables interrupts (or softirq). If (as you suggest) some calls to lock_page_cgroup() did not disable softirq, then a deadlock is possible because the softirq may interrupt the holder of the page cgroup spinlock and the softirq routine that ...
Yes. But irq disable has a interrupt latency problem as well as just overhead of instruction. I have a concern about interrupt latency. I have a experience that too many disable irq makes irq handler latency too slow in embedded system. For example, irq handler latency is a important factor in ARM perf to capture program counter. If many users need to update page stat in interrupt handler in future, local_irq_save would be good candidate. otherwise, local_bh_disable doesn't affect the system as you said. We could add the comment following as. /* * NOTE : * If some user want to update page stat in interrupt handler, * We should consider local_irq_save instead of local_bh_disable. First of all, we could add your patch as it is and I don't expect any regression report about interrupt latency. That's because many embedded guys doesn't use mmotm and have a tendency to not report regression of VM. Even they don't use memcg. Hmm... I pass the decision to MAINTAINER Kame and Balbir. -- Kind regards, Minchan Kim --
On Wed, 6 Oct 2010 09:15:34 +0900 Hmm. IRQ delay is a concern. So, my option is this. How do you think ? 1. remove local_irq_save()/restore() in lock/unlock_page_cgroup(). yes, I don't like it. 2. At moving charge, do this: a) lock_page()/ or trylock_page() b) wait_on_page_writeback() c) do move_account under lock_page_cgroup(). c) unlock_page() Then, Writeback updates will never come from IRQ context while lock/unlock_page_cgroup() is held by move_account(). There will be no race. Do I miss something ? Thanks, -Kame --
On Thu, 7 Oct 2010 09:35:45 +0900 hmm, if we'll do that, I think we need to do that under pte_lock in mem_cgroup_move_charge_pte_range(). But, we can't do wait_on_page_writeback() under pte_lock, right? Or, we need re-organize current move-charge implementation. Thanks, Daisuke Nishimura. --
On Thu, 7 Oct 2010 10:54:56 +0900 Nice catch. I think releaseing pte_lock() is okay. (and it should be released) IIUC, task's css_set() points to new cgroup when "move" is called. Then, it's not necessary to take pte_lock, I guess. (And taking pte_lock too long is not appreciated..) I'll write a sample patch today. Thanks, --
On Thu, 7 Oct 2010 11:17:43 +0900
Here.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Now, at task migration among cgroup, memory cgroup scans page table and moving
account if flags are properly set.
The core code, mem_cgroup_move_charge_pte_range() does
pte_offset_map_lock();
for all ptes in a page table:
1. look into page table, find_and_get a page
2. remove it from LRU.
3. move charge.
4. putback to LRU. put_page()
pte_offset_map_unlock();
for pte entries on a 3rd(2nd) level page table.
This pte_offset_map_lock seems a bit long. This patch modifies a rountine as
pte_offset_map_lock()
for 32 pages:
find_and_get a page
record it
pte_offset_map_unlock()
for all recorded pages
isolate it from LRU.
move charge
putback to LRU
for all recorded pages
put_page()
Note: newly-charged pages while we move account are charged to the new group.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 92 ++++++++++++++++++++++++++++++++++++--------------------
1 file changed, 60 insertions(+), 32 deletions(-)
Index: mmotm-0928/mm/memcontrol.c
===================================================================
--- mmotm-0928.orig/mm/memcontrol.c
+++ mmotm-0928/mm/memcontrol.c
@@ -4475,17 +4475,22 @@ one_by_one:
*
* Called with pte lock held.
*/
-union mc_target {
- struct page *page;
- swp_entry_t ent;
-};
enum mc_target_type {
- MC_TARGET_NONE, /* not used */
+ MC_TARGET_NONE, /* used as failure code(0) */
MC_TARGET_PAGE,
MC_TARGET_SWAP,
};
+struct mc_target {
+ enum mc_target_type type;
+ union {
+ struct page *page;
+ swp_entry_t ent;
+ } val;
+};
+
+
static struct page *mc_handle_present_pte(struct vm_area_struct *vma,
unsigned long addr, pte_t ptent)
{
@@ -4561,7 +4566,7 @@ static struct page *mc_handle_file_pte(s
}
static int is_target_pte_for_mc(struct vm_area_struct *vma,
- unsigned long ...Greg, I think clear_page_writeback() will not require _any_ locks with this patch.
But set_page_writeback() requires it...
(Maybe adding a special function for clear_page_writeback() is better rather than
adding some complex to switch() in update_page_stat())
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Now, at page information accounting, we do lock_page_cgroup() if pc->mem_cgroup
points to a cgroup where someone is moving charges from.
At supporing dirty-page accounting, one of troubles is writeback bit.
In general, writeback can be cleared via IRQ context. To update writeback bit
with lock_page_cgroup() in safe way, we'll have to disable IRQ.
....or do something.
This patch waits for completion of writeback under lock_page() and do
lock_page_cgroup() in safe way. (We never got end_io via IRQ context.)
By this, writeback-accounting will never see race with account_move() and
it can trust pc->mem_cgroup always _without_ any lock.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
Index: mmotm-0928/mm/memcontrol.c
===================================================================
--- mmotm-0928.orig/mm/memcontrol.c
+++ mmotm-0928/mm/memcontrol.c
@@ -2183,17 +2183,35 @@ static void __mem_cgroup_move_account(st
/*
* check whether the @pc is valid for moving account and call
* __mem_cgroup_move_account()
+ * Don't call this under pte_lock etc...we'll do lock_page() and wait for
+ * the end of I/O.
*/
static int mem_cgroup_move_account(struct page_cgroup *pc,
struct mem_cgroup *from, struct mem_cgroup *to, bool uncharge)
{
int ret = -EINVAL;
+
+ /*
+ * We move severl flags and accounting information here. So we need to
+ * avoid the races with update_stat routines. For most of routines,
+ * lock_page_cgroup() is enough for avoiding race. But we need to take
+ * care of IRQ context. If flag updates comes from IRQ context, ...On Thu, 7 Oct 2010 15:24:22 +0900
I'm testing a code like this.
==
/* pc->mem_cgroup is unstable ? */
if (unlikely(mem_cgroup_stealed(mem))) {
/* take a lock against to access pc->mem_cgroup */
if (!in_interrupt()) {
lock_page_cgroup(pc);
need_unlock = true;
mem = pc->mem_cgroup;
if (!mem || !PageCgroupUsed(pc))
goto out;
} else if (idx == MEMCG_NR_FILE_WRITEBACK && (val < 0)) {
/* This is allowed */
} else
BUG();
}
==
Thanks,
-Kame
--
Hi Kame, On Thu, Oct 7, 2010 at 3:24 PM, KAMEZAWA Hiroyuki Looks good to me. But let me ask a question. Why do only move_account need this logic? Is deadlock candidate is only this place? How about mem_cgroup_prepare_migration? unmap_and_move lock_page mem_cgroup_prepare_migration lock_page_cgroup ... softirq happen lock_page_cgroup If race happens only where move_account and writeback, please describe it as comment. It would help to review the code in future. -- Kind regards, Minchan Kim --
On Fri, 8 Oct 2010 08:35:30 +0900 Because charge/uncharge (add/remove to radix-tree or swapcache) Sure, updates are necessary. Thanks, -Kame --
On Thu, 7 Oct 2010 15:21:11 +0900 hmm? I can't see it freed anywhere. Considering you reset target[]->type to MC_TARGET_NONE, do you intended to reuse targe[] while walking the page table ? If so, how about adding a new member(struct mc_target *targe) to move_charge_struct, and allocate/free it at mem_cgroup_move_charge() ? Thanks, --
On Thu, 7 Oct 2010 16:28:11 +0900 Hmm, sounds nice. I'll do. Thanks, --
On Thu, 7 Oct 2010 16:42:04 +0900
Here.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Now, at task migration among cgroup, memory cgroup scans page table and moving
account if flags are properly set.
The core code, mem_cgroup_move_charge_pte_range() does
pte_offset_map_lock();
for all ptes in a page table:
1. look into page table, find_and_get a page
2. remove it from LRU.
3. move charge.
4. putback to LRU. put_page()
pte_offset_map_unlock();
for pte entries on a 3rd level? page table.
This pte_offset_map_lock seems a bit long. This patch modifies a rountine as
for 32 pages: pte_offset_map_lock()
find_and_get a page
record it
pte_offset_map_unlock()
for all recorded pages
isolate it from LRU.
move charge
putback to LRU
for all recorded pages
put_page()
Changelog: v1->v2
- removed kzalloc() of mc_target. preallocate it on "mc"
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 95 ++++++++++++++++++++++++++++++++++----------------------
1 file changed, 59 insertions(+), 36 deletions(-)
Index: mmotm-0928/mm/memcontrol.c
===================================================================
--- mmotm-0928.orig/mm/memcontrol.c
+++ mmotm-0928/mm/memcontrol.c
@@ -276,6 +276,21 @@ enum move_type {
NR_MOVE_TYPE,
};
+enum mc_target_type {
+ MC_TARGET_NONE, /* used as failure code(0) */
+ MC_TARGET_PAGE,
+ MC_TARGET_SWAP,
+};
+
+struct mc_target {
+ enum mc_target_type type;
+ union {
+ struct page *page;
+ swp_entry_t ent;
+ } val;
+};
+#define MC_MOVE_ONCE (32)
+
/* "mc" and its members are protected by cgroup_mutex */
static struct move_charge_struct {
spinlock_t lock; /* for from, to, moving_task */
@@ -284,6 +299,7 @@ static struct move_charge_struct {
unsigned long precharge;
unsigned long moved_charge;
unsigned long moved_swap;
+ struct mc_target target[MC_MOVE_ONCE];
struct task_struct ...On Thu, 7 Oct 2010 17:04:05 +0900 The patch makes the code larger, more complex and slower! I do think we're owed a more complete description of its benefits than "seems a bit long". Have problems been observed? Any measurements taken? --
On Thu, 7 Oct 2010 16:14:54 -0700 Before this patch: text data bss dec hex filename 27163 11782 4100 43045 a825 mm/memcontrol.o After this patch: text data bss dec hex filename 27307 12294 4100 43701 aab5 mm/memcontrol.o hmm, allocating mc.target[] statically might be bad, but I'm now wondering IIUC, this patch is necessary for "[PATCH] memcg: lock-free clear page writeback" later, but I agree we should describe it. Thanks, Daisuke Nishimura. --
On Thu, 7 Oct 2010 16:14:54 -0700 I'll rewrite the whole patch against today's mmotom. Thanks, --
Sure. It walks the same data three times, potentially causing thrashing in the L1 cache. It takes and releases locks at a higher frequency. It increases the text size. --
On Thu, 7 Oct 2010 21:55:56 -0700 But I don't think page_table_lock is a lock which someone can hold so long that 1. find_get_page 2. spin_lock(zone->lock) 3. remove it from LRU 4. lock_page_cgroup() 5. move charge (This means page 5. putback to LRU for 4096/8=1024 pages long. will try to make the routine smarter. But I want to get rid of page_table_lock -> lock_page_cgroup(). Thanks, -Kame --
On Fri, 8 Oct 2010 14:12:01 +0900
How about this ?
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Presently, at task migration among cgroups, memory cgroup scans page tables and
moves accounting if flags are properly set.
The core code, mem_cgroup_move_charge_pte_range() does
pte_offset_map_lock();
for all ptes in a page table:
1. look into page table, find_and_get a page
2. remove it from LRU.
3. move charge.
4. putback to LRU. put_page()
pte_offset_map_unlock();
for pte entries on a 3rd level? page table.
As a planned updates, we'll support dirty-page accounting. Because move_charge()
is highly race, we need to add more check in move_charge.
For example, lock_page();-> wait_on_page_writeback();-> unlock_page();
is an candidate for new check.
This patch modifies a rountine as
for 32 pages: pte_offset_map_lock()
find_and_get a page
record it
pte_offset_map_unlock()
for all recorded pages
isolate it from LRU.
move charge
putback to LRU
put_page()
Code size change is:
(Before)
[kamezawa@bluextal mmotm-1008]$ size mm/memcontrol.o
text data bss dec hex filename
28247 7685 4100 40032 9c60 mm/memcontrol.o
(After)
[kamezawa@bluextal mmotm-1008]$ size mm/memcontrol.o
text data bss dec hex filename
28591 7685 4100 40376 9db8 mm/memcontrol.o
Easy Bencmark score.
Moving 2Gbytes anonymous memory task between cgroup/A and cgroup/B.
<===== shows a function under pte_lock.
Before Patch.
real 0m42.346s
user 0m0.002s
sys 0m39.668s
13.88% swap_task.sh [kernel.kallsyms] [k] put_page <=====
10.37% swap_task.sh [kernel.kallsyms] [k] isolate_lru_page <=====
10.25% swap_task.sh [kernel.kallsyms] [k] is_target_pte_for_mc <=====
7.85% swap_task.sh [kernel.kallsyms] [k] mem_cgroup_move_account <=====
7.63% swap_task.sh [kernel.kallsyms] [k] lookup_page_cgroup ...Is this a change to help dirty limits or is it a generic bug fix. -- Three Cheers, Balbir --
On Tue, 12 Oct 2010 09:09:15 +0530 Not a bug fix. This for adding lock_page() to moge_charge(). It helps us to remove "irq disable" in update_stat(). Thanks, -Kame --
Excellent! Thanks -- Three Cheers, Balbir --
Is this putback_lru_page() necessary ? is_target_pte_for_mc() doesn't isolate the page. Thanks, --
On Tue, 12 Oct 2010 12:56:13 +0900 Unnecessary, will post v2. I'm sorry for my low-quality patches :( Thanks, -Kame --
On Tue, 12 Oct 2010 12:56:13 +0900
Ok, v4 here. tested failure path and success path.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Presently, at task migration among cgroups, memory cgroup scans page tables and
moves accounting if flags are properly set.
The core code, mem_cgroup_move_charge_pte_range() does
pte_offset_map_lock();
for all ptes in a page table:
1. look into page table, find_and_get a page
2. remove it from LRU.
3. move charge.
4. putback to LRU. put_page()
pte_offset_map_unlock();
for pte entries on a 3rd level? page table.
As a planned updates, we'll support dirty-page accounting. Because move_charge()
is highly race, we need to add more check in move_charge.
For example, lock_page();-> wait_on_page_writeback();-> unlock_page();
is an candidate for new check.
This patch modifies a rountine as
for 32 pages: pte_offset_map_lock()
find_and_get a page
record it
pte_offset_map_unlock()
for all recorded pages
isolate it from LRU.
move charge
putback to LRU
put_page()
Code size change is:
(Before)
[kamezawa@bluextal mmotm-1008]$ size mm/memcontrol.o
text data bss dec hex filename
28247 7685 4100 40032 9c60 mm/memcontrol.o
(After)
[kamezawa@bluextal mmotm-1008]$ size mm/memcontrol.o
text data bss dec hex filename
28591 7685 4100 40376 9db8 mm/memcontrol.o
Easy Bencmark score.
Moving 2Gbytes anonymous memory task between cgroup/A and cgroup/B.
<===== shows a function under pte_lock.
Before Patch.
real 0m42.346s
user 0m0.002s
sys 0m39.668s
13.88% swap_task.sh [kernel.kallsyms] [k] put_page <=====
10.37% swap_task.sh [kernel.kallsyms] [k] isolate_lru_page <=====
10.25% swap_task.sh [kernel.kallsyms] [k] is_target_pte_for_mc <=====
7.85% swap_task.sh [kernel.kallsyms] [k] mem_cgroup_move_account <=====
7.63% swap_task.sh ...On Tue, 12 Oct 2010 14:48:01 +0900 Looks good to me. Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Thanks, Daisuke Nishimura. --
-- It will take more convincing, all important functions (charge/uncharge use lock_page_cgroup()). I'd like to see the page fault scalability test results. I am not against this patch, just want to see the scalability numbers. Three Cheers, Balbir --
Add cgroupfs interface to memcg dirty page limits:
Direct write-out is controlled with:
- memory.dirty_ratio
- memory.dirty_bytes
Background write-out is controlled with:
- memory.dirty_background_ratio
- memory.dirty_background_bytes
Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
mm/memcontrol.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 89 insertions(+), 0 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 6ec2625..2d45a0a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -100,6 +100,13 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_NSTATS,
};
+enum {
+ MEM_CGROUP_DIRTY_RATIO,
+ MEM_CGROUP_DIRTY_BYTES,
+ MEM_CGROUP_DIRTY_BACKGROUND_RATIO,
+ MEM_CGROUP_DIRTY_BACKGROUND_BYTES,
+};
+
struct mem_cgroup_stat_cpu {
s64 count[MEM_CGROUP_STAT_NSTATS];
};
@@ -4292,6 +4299,64 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
return 0;
}
+static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
+ bool root;
+
+ root = mem_cgroup_is_root(mem);
+
+ switch (cft->private) {
+ case MEM_CGROUP_DIRTY_RATIO:
+ return root ? vm_dirty_ratio : mem->dirty_param.dirty_ratio;
+ case MEM_CGROUP_DIRTY_BYTES:
+ return root ? vm_dirty_bytes : mem->dirty_param.dirty_bytes;
+ case MEM_CGROUP_DIRTY_BACKGROUND_RATIO:
+ return root ? dirty_background_ratio :
+ mem->dirty_param.dirty_background_ratio;
+ case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+ return root ? dirty_background_bytes :
+ mem->dirty_param.dirty_background_bytes;
+ default:
+ BUG();
+ }
+}
+
+static int
+mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ int type = cft->private;
+
+ if (cgrp->parent == NULL)
+ return -EINVAL;
+ if ((type == MEM_CGROUP_DIRTY_RATIO ||
+ type == ...On Sun, 3 Oct 2010 23:58:03 -0700 Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Curious....is this same behavior as vm_dirty_ratio ? Thanks, -Kame --
I think this is same behavior as vm_dirty_ratio. When vm_dirty_ratio is changed then dirty_ratio_handler() will set vm_dirty_bytes=0. When vm_dirty_bytes is written dirty_bytes_handler() will set vm_dirty_ratio=0. So I think that the per-memcg dirty memory parameters mimic the behavior of vm_dirty_ratio, vm_dirty_bytes and the other global dirty parameters. --
On Tue, 05 Oct 2010 00:33:15 -0700 No. Thank you for clarification. -Kame --
mmh... looking at the code it seems the same behaviour, but in Documentation/sysctl/vm.txt we say a different thing (i.e., for dirty_bytes): "If dirty_bytes is written, dirty_ratio becomes a function of its value (dirty_bytes / the amount of dirtyable system memory)." However, in dirty_bytes_handler()/dirty_ratio_handler() we actually set the counterpart value as 0. I think we should clarify the documentation. Signed-off-by: Andrea Righi <arighi@develer.com> --- Documentation/sysctl/vm.txt | 12 ++++++++---- 1 files changed, 8 insertions(+), 4 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index b606c2c..30289fa 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -80,8 +80,10 @@ dirty_background_bytes Contains the amount of dirty memory at which the pdflush background writeback daemon will start writeback. -If dirty_background_bytes is written, dirty_background_ratio becomes a function -of its value (dirty_background_bytes / the amount of dirtyable system memory). +Note: dirty_background_bytes is the counterpart of dirty_background_ratio. Only +one of them may be specified at a time. When one sysctl is written it is +immediately taken into account to evaluate the dirty memory limits and the +other appears as 0 when read. ============================================================== @@ -97,8 +99,10 @@ dirty_bytes Contains the amount of dirty memory at which a process generating disk writes will itself start writeback. -If dirty_bytes is written, dirty_ratio becomes a function of its value -(dirty_bytes / the amount of dirtyable system memory). +Note: dirty_bytes is the counterpart of dirty_ratio. Only one of them may be +specified at a time. When one sysctl is written it is immediately taken into +account to evaluate the dirty memory limits and the other appears as 0 when +read. Note: the minimum value allowed for dirty_bytes is two pages (in bytes); any value lower than this ...
Acked-by: David Rientjes <rientjes@google.com> Thanks for cc'ing me on this, Andrea. --
Reviewed-by: Greg Thelen <gthelen@google.com> This documentation change is general cleanup that is independent of the --
Thanks Greg. I'll resend it as an independent patch. -Andrea --
The added interface is not uniform with the rest of our write
operations. Does the patch below help? I did a quick compile and run
test.
Make writes to memcg dirty tunables more uniform
From: Balbir Singh <balbir@linux.vnet.ibm.com>
We today support 'M', 'm', 'k', 'K', 'g' and 'G' suffixes for
general memcg writes. This patch provides the same functionality
for dirty tunables.
---
mm/memcontrol.c | 47 +++++++++++++++++++++++++++++++++++++----------
1 files changed, 37 insertions(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d45a0a..3c360e6 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4323,6 +4323,41 @@ static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
}
static int
+mem_cgroup_dirty_write_string(struct cgroup *cont, struct cftype *cft,
+ const char *buffer)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ int type = cft->private;
+ int ret = -EINVAL;
+ unsigned long long val;
+
+ if (cgrp->parent == NULL)
+ return ret;
+
+ switch (type) {
+ case MEM_CGROUP_DIRTY_BYTES:
+ /* This function does all necessary parse...reuse it */
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ memcg->dirty_param.dirty_bytes = val;
+ memcg->dirty_param.dirty_ratio = 0;
+ break;
+ case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ memcg->dirty_param.dirty_background_bytes = val;
+ memcg->dirty_param.dirty_background_ratio = 0;
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return ret;
+}
+
+static int
mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
@@ -4338,18 +4373,10 @@ mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
memcg->dirty_param.dirty_ratio = val;
memcg->dirty_param.dirty_bytes = 0;
break;
- case ...here is a version with my signed-off-by
Make writes to memcg dirty tunables more uniform
From: Balbir Singh <balbir@linux.vnet.ibm.com>
We today support 'M', 'm', 'k', 'K', 'g' and 'G' suffixes for
general memcg writes. This patch provides the same functionality
for dirty tunables.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
mm/memcontrol.c | 47 +++++++++++++++++++++++++++++++++++++----------
1 files changed, 37 insertions(+), 10 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2d45a0a..116fecd 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4323,6 +4323,41 @@ static u64 mem_cgroup_dirty_read(struct cgroup *cgrp, struct cftype *cft)
}
static int
+mem_cgroup_dirty_write_string(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ int type = cft->private;
+ int ret = -EINVAL;
+ unsigned long long val;
+
+ if (cgrp->parent == NULL)
+ return ret;
+
+ switch (type) {
+ case MEM_CGROUP_DIRTY_BYTES:
+ /* This function does all necessary parse...reuse it */
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ memcg->dirty_param.dirty_bytes = val;
+ memcg->dirty_param.dirty_ratio = 0;
+ break;
+ case MEM_CGROUP_DIRTY_BACKGROUND_BYTES:
+ ret = res_counter_memparse_write_strategy(buffer, &val);
+ if (ret)
+ break;
+ memcg->dirty_param.dirty_background_bytes = val;
+ memcg->dirty_param.dirty_background_ratio = 0;
+ break;
+ default:
+ BUG();
+ break;
+ }
+ return ret;
+}
+
+static int
mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
{
struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
@@ -4338,18 +4373,10 @@ mem_cgroup_dirty_write(struct cgroup *cgrp, struct cftype *cft, u64 val)
memcg->dirty_param.dirty_ratio = val;
memcg->dirty_param.dirty_bytes = 0;
break;
- case MEM_CGROUP_DIRTY_BYTES:
- memcg->dirty_param.dirty_bytes = ...Looks good to me. I am currently gather performance data on the memcg series. It should be done in an hour or so. I'll then repost V2 of the memcg dirty limits series. I'll integrate this patch into the series, unless there's objection. --
Please go ahead and incorporate it. Thanks! -- Three Cheers, Balbir --
Is it a good idea to rename "dirty_bytes" to "dirty_limit_in_bytes" ? So that it can match with other memcg tunable naming convention. We already have memory.memsw.limit_in_bytes, memory.limit_in_bytes, --
I see your point in trying to be more internally consistent with other memcg counter. It's a trade-off, either use names consistent with /proc/sys/vm, or use names similar to other memory.* control files. I prefer your suggestion PS: I am collecting performance data on patch series (including Kame's lockless writeback stats). I should have some useful data today. --
Add memcg routines to track dirty, writeback, and unstable_NFS pages.
These routines are not yet used by the kernel to count such pages.
A later change adds kernel calls to these new routines.
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
include/linux/memcontrol.h | 3 +
mm/memcontrol.c | 89 ++++++++++++++++++++++++++++++++++++++++----
2 files changed, 84 insertions(+), 8 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7c7bec4..6303da1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -28,6 +28,9 @@ struct mm_struct;
/* Stats that can be updated by kernel. */
enum mem_cgroup_write_page_stat_item {
MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+ MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+ MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
};
extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 267d774..f40839f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -85,10 +85,13 @@ enum mem_cgroup_stat_index {
*/
MEM_CGROUP_STAT_CACHE, /* # of pages charged as cache */
MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
- MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_PGPGIN_COUNT, /* # of pages paged in */
MEM_CGROUP_STAT_PGPGOUT_COUNT, /* # of pages paged out */
MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
+ MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
+ MEM_CGROUP_STAT_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
/* incremented at every pagein/pageout */
...On Sun, 3 Oct 2010 23:58:00 -0700 Could you make this as nfs_unstable as meminfo shows ? If I am a user, I think this is the number of NFS pages not NFS_UNSTABLE pages. Thanks, -Kame --
Good catch! In the next revision I will change this from "nfs"/"total_nfs" to "NFS_Unstable"/"total_NFS_Unstable" to match --
Nitpick. The comment doesn't give any useful information. -- Kind regards, Minchan Kim --
I agree. I removed the four redundant comments you referred to. Thanks --
Extend mem_cgroup to contain dirty page limits. Also add routines
allowing the kernel to query the dirty usage of a memcg.
These interfaces not used by the kernel yet. A subsequent commit
will add kernel calls to utilize these new routines.
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
include/linux/memcontrol.h | 44 +++++++++++
mm/memcontrol.c | 180 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 223 insertions(+), 1 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6303da1..dc8952d 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,6 +19,7 @@
#ifndef _LINUX_MEMCONTROL_H
#define _LINUX_MEMCONTROL_H
+#include <linux/writeback.h>
#include <linux/cgroup.h>
struct mem_cgroup;
struct page_cgroup;
@@ -33,6 +34,30 @@ enum mem_cgroup_write_page_stat_item {
MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
};
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_read_page_stat_item {
+ MEMCG_NR_DIRTYABLE_PAGES,
+ MEMCG_NR_RECLAIM_PAGES,
+ MEMCG_NR_WRITEBACK,
+ MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/* Dirty memory parameters */
+struct vm_dirty_param {
+ int dirty_ratio;
+ int dirty_background_ratio;
+ unsigned long dirty_bytes;
+ unsigned long dirty_background_bytes;
+};
+
+static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
+{
+ param->dirty_ratio = vm_dirty_ratio;
+ param->dirty_bytes = vm_dirty_bytes;
+ param->dirty_background_ratio = dirty_background_ratio;
+ param->dirty_background_bytes = dirty_background_bytes;
+}
+
extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
struct list_head *dst,
unsigned long *scanned, int order,
@@ -145,6 +170,10 @@ static inline void mem_cgroup_dec_page_stat(struct page *page,
mem_cgroup_update_page_stat(page, idx, -1);
}
+bool ...On Sun, 3 Oct 2010 23:58:02 -0700 Seems nice. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --
We only check the pointer without dereferencing it, so this is probably
ok, but maybe this is safer:
bool mem_cgroup_has_dirty_limit(void)
{
struct mem_cgroup *mem;
bool ret;
if (mem_cgroup_disabled())
return false;
rcu_read_lock();
mem = mem_cgroup_from_task(current);
ret = mem && !mem_cgroup_is_root(mem);
rcu_read_unlock();
return ret;
}
rcu_read_lock() should be held in mem_cgroup_from_task(), otherwise
lockdep could detect this as an error.
Thanks,
--
Good suggestion. I agree that lockdep might catch this. There are some unrelated debug_locks failures (even without my patches) that I worked around to get lockdep to complain about this one. I applied your suggested fix and lockdep was happy. I will incorporate this fix into --
On Tue, 05 Oct 2010 12:00:17 -0700 Hmm, considering other parts, shouldn't we define mem_cgroup_from_task as macro ? Thanks, -Kame --
Is your motivation to increase performance with the same functionality? If so, then would a 'static inline' be performance equivalent to a preprocessor macro yet be safer to use? Maybe it makes more sense to find a way to perform this check in mem_cgroup_has_dirty_limit() without needing to grab the rcu lock. I think this lock grab is unneeded. I am still collecting performance data, but suspect that this may be making the code slower than it needs to be. -- Greg --
On Wed, 06 Oct 2010 17:27:13 -0700 Ah, if lockdep finds this as bug, I think other parts will hit this, too. mem_cgroup_from_task() is designed to be used as this. Hmm. css_set[] itself is freed by RCU..what idea to remove rcu_read_lock() do you have ? Adding some flags ? Ah...I noticed that you should do mem = mem_cgroup_from_task(current->mm->owner); to check has_dirty_limit... -Kame --
mem_cgroup_from_task() calls task_subsys_state() calls task_subsys_state_check(). task_subsys_state_check() will be happy if rcu_read_lock is held. I don't think that this will fail lockdep, because rcu_read_lock_held() is true when calling mem_cgroup_from_task() within I do not understand how making mem_cgroup_from_task() a macro will change its behavior wrt. to lockdep assertion checking. I assume that as a macro mem_cgroup_from_task() would still call task_subsys_state(), which requires either: a) rcu read lock held b) task->alloc_lock held It seems like a shame to need a lock to determine if current is in the root cgroup. Especially given that as soon as mem_cgroup_has_dirty_limit() returns, the task could be moved in-to/out-of the root cgroup thereby invaliding the answer. So the answer is just a sample that may be wrong. But I think you are correct. What are the cases where current->mm->owner->cgroups != current->cgroups? I was hoping to avoid having add even more logic into mem_cgroup_has_dirty_limit() to handle the case where current->mm is NULL. Presumably the newly proposed vm_dirty_param(), mem_cgroup_has_dirty_limit(), and mem_cgroup_page_stat() routines all need to use the same logic. I assume they should all be consistently using current->mm->owner or current. -- Greg --
On Mon, 11 Oct 2010 17:24:21 -0700 In that case, assume group A and B. thread(1) -> belongs to cgroup A (thread(1) is mm->owner) thread(2) -> belongs to cgroup B and a page -> charnged to cgroup A Then, thread(2) make the page dirty which is under cgroup A. In this case, if page's dirty_pages accounting is added to cgroup B, cgroup B' Blease check current->mm. We can't limit works of kernel-thread by this, let's please. Thanks, -Kame --
I agree that in this case the dirty_pages accounting should be added to cgroup A because that is where the page was charged. This will happen because pc->mem_cgroup was set to A when the page was charged. The mark-page-dirty code will check pc->mem_cgroup to determine which cgroup to add the dirty page to. I think that the current vs current->mm->owner decision is in areas of the code that is used to query the dirty limits. These routines do not use this data to determine which cgroup to charge for dirty pages. The usage of either mem_cgroup_from_task(current->mm->owner) or mem_cgroup_from_task(current) in mem_cgroup_has_dirty_limit() does not determine which cgroup is added for dirty_pages. mem_cgroup_has_dirty_limit() is only used to determine if the process has a dirty limit. As discussed, this is a momentary answer that may be wrong by the time decisions are made because the task may be migrated in-to/out-of root cgroup while mem_cgroup_has_dirty_limit() runs. If the process has a dirty limit, then the process's memcg is used to compute dirty limits. Using your example, I assume that thread(1) and thread(2) will git dirty limits from cgroup(A) and cgroup(B) respectively. Are you thinking that when accounting for a dirty page (by incrementing pc->mem_cgroup->stat->count[MEM_CGROUP_STAT_FILE_DIRTY]) that we should --
On Tue, 12 Oct 2010 00:32:33 -0700 Ok, thank you for clarification. Throttoling a thread based on its own cgroup not based on mm->owner makes sense. Could you add a brief comment on the code ? Thanks, -Kame --
Add calls into memcg dirty page accounting. Notify memcg when pages
transition between clean, file dirty, writeback, and unstable nfs.
This allows the memory controller to maintain an accurate view of
the amount of its memory that is dirty.
Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
fs/nfs/write.c | 4 ++++
mm/filemap.c | 1 +
mm/page-writeback.c | 4 ++++
mm/truncate.c | 1 +
4 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 48199fb..9e206bd 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -450,6 +450,7 @@ nfs_mark_request_commit(struct nfs_page *req)
NFS_PAGE_TAG_COMMIT);
nfsi->ncommit++;
spin_unlock(&inode->i_lock);
+ mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -461,6 +462,7 @@ nfs_clear_request_commit(struct nfs_page *req)
struct page *page = req->wb_page;
if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+ mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(page, NR_UNSTABLE_NFS);
dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
return 1;
@@ -1316,6 +1318,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
req = nfs_list_entry(head->next);
nfs_list_remove_request(req);
nfs_mark_request_commit(req);
+ mem_cgroup_dec_page_stat(req->wb_page,
+ MEMCG_NR_FILE_UNSTABLE_NFS);
dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
BDI_RECLAIMABLE);
diff --git a/mm/filemap.c b/mm/filemap.c
index 3d4df44..82e0870 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -135,6 +135,7 @@ void __remove_from_page_cache(struct page *page)
* having removed the page ...On Sun, 3 Oct 2010 23:58:01 -0700 Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --
If the current process is in a non-root memcg, then
global_dirty_limits() will consider the memcg dirty limit.
This allows different cgroups to have distinct dirty limits
which trigger direct and background writeback at different
levels.
Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
mm/page-writeback.c | 87 ++++++++++++++++++++++++++++++++++++++++++---------
1 files changed, 72 insertions(+), 15 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a0bb3e2..c1db336 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -180,7 +180,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
* Returns the numebr of pages that can currently be freed and used
* by the kernel for direct mappings.
*/
-static unsigned long determine_dirtyable_memory(void)
+static unsigned long get_global_dirtyable_memory(void)
{
unsigned long x;
@@ -192,6 +192,58 @@ static unsigned long determine_dirtyable_memory(void)
return x + 1; /* Ensure that we never return 0 */
}
+static unsigned long get_dirtyable_memory(void)
+{
+ unsigned long memory;
+ s64 memcg_memory;
+
+ memory = get_global_dirtyable_memory();
+ if (!mem_cgroup_has_dirty_limit())
+ return memory;
+ memcg_memory = mem_cgroup_page_stat(MEMCG_NR_DIRTYABLE_PAGES);
+ BUG_ON(memcg_memory < 0);
+
+ return min((unsigned long)memcg_memory, memory);
+}
+
+static long get_reclaimable_pages(void)
+{
+ s64 ret;
+
+ if (!mem_cgroup_has_dirty_limit())
+ return global_page_state(NR_FILE_DIRTY) +
+ global_page_state(NR_UNSTABLE_NFS);
+ ret = mem_cgroup_page_stat(MEMCG_NR_RECLAIM_PAGES);
+ BUG_ON(ret < 0);
+
+ return ret;
+}
+
+static long get_writeback_pages(void)
+{
+ s64 ret;
+
+ if (!mem_cgroup_has_dirty_limit())
+ return global_page_state(NR_WRITEBACK);
+ ret = mem_cgroup_page_stat(MEMCG_NR_WRITEBACK);
+ BUG_ON(ret < 0);
+
+ return ret;
+}
+
+static unsigned long ...Just a nitpick. You seem to like get_xxxx name. But I think it's a redundant and just makes function name longer without any benefit. In kernel, many place doesn't use get_xxx naming. -- Kind regards, Minchan Kim --
On Sun, 3 Oct 2010 23:58:05 -0700 This patch seems good because of straightforward implementation. I think worth to be tested in -mm tree. Thank you very much. Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --
The determine_dirtyable_memory() function is not used outside of
page writeback. Make the routine static. No functional change.
Just a cleanup in preparation for a change that adds memcg dirty
limits consideration into global_dirty_limits().
Signed-off-by: Andrea Righi <arighi@develer.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
---
include/linux/writeback.h | 2 -
mm/page-writeback.c | 122 ++++++++++++++++++++++----------------------
2 files changed, 61 insertions(+), 63 deletions(-)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 72a5d64..9eacdca 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -105,8 +105,6 @@ extern int vm_highmem_is_dirtyable;
extern int block_dump;
extern int laptop_mode;
-extern unsigned long determine_dirtyable_memory(void);
-
extern int dirty_background_ratio_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp,
loff_t *ppos);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 820eb66..a0bb3e2 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -132,6 +132,67 @@ static struct prop_descriptor vm_completions;
static struct prop_descriptor vm_dirties;
/*
+ * Work out the current dirty-memory clamping and background writeout
+ * thresholds.
+ *
+ * The main aim here is to lower them aggressively if there is a lot of mapped
+ * memory around. To avoid stressing page reclaim with lots of unreclaimable
+ * pages. It is better to clamp down on writers than to start swapping, and
+ * performing lots of scanning.
+ *
+ * We only allow 1/2 of the currently-unmapped memory to be dirtied.
+ *
+ * We don't permit the clamping level to fall below 5% - that is getting rather
+ * excessive.
+ *
+ * We make sure that the background writeout level is below the adjusted
+ * clamping level.
+ */
+
+static unsigned long highmem_dirtyable_memory(unsigned long total)
+{
+#ifdef CONFIG_HIGHMEM
+ int node;
+ unsigned long ...On Sun, 3 Oct 2010 23:58:04 -0700 Hmm. Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> --
I suspect this means that we'll need a bdi controller in the I/O -- Three Cheers, Balbir --
Hi, Greg,
I see a problem with " memcg: add dirty page accounting infrastructure".
The reject is
enum mem_cgroup_write_page_stat_item {
MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+ MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+ MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
};
I don't see mem_cgroup_write_page_stat_item in memcontrol.h. Is this
based on top of Kame's cleanup.
I am working off of mmotm 28 sept 2010 16:13.
--
Three Cheers,
Balbir
--
Balbir,
All of the 10 memcg dirty limits patches should apply directly to mmotm
28 sept 2010 16:13 without any other patches. Any of Kame's cleanup
patches that are not in mmotm are not needed by this memcg dirty limit
series.
The patch you refer to, "[PATCH 05/10] memcg: add dirty page accounting
infrastructure" depends on a change from an earlier patch in the series.
Specifically, "[PATCH 03/10] memcg: create extensible page stat update
routines" contains the addition of mem_cgroup_write_page_stat_item:
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -25,6 +25,11 @@ struct page_cgroup;
struct page;
struct mm_struct;
+/* Stats that can be updated by kernel. */
+enum mem_cgroup_write_page_stat_item {
+ MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+};
+
Do you have trouble applying patch 5 after applying patches 1-4?
--
I could apply all the patches cleanly on mmotm 28/09/2010. Kernel build --
Hi Greg, the patchset seems to work fine on my box. I also ran a pretty simple test to directly verify the effectiveness of the dirty memory limit, using a dd running on a non-root memcg: dd if=/dev/zero of=tmpfile bs=1M count=512 and monitoring the max of the "dirty" value in cgroup/memory.stat: Here the results: dd in non-root memcg ( 4 MiB memcg dirty limit): dirty max=4227072 dd in non-root memcg ( 8 MiB memcg dirty limit): dirty max=8454144 dd in non-root memcg ( 16 MiB memcg dirty limit): dirty max=15179776 dd in non-root memcg ( 32 MiB memcg dirty limit): dirty max=32235520 dd in non-root memcg ( 64 MiB memcg dirty limit): dirty max=64245760 dd in non-root memcg (128 MiB memcg dirty limit): dirty max=121028608 dd in non-root memcg (256 MiB memcg dirty limit): dirty max=232865792 dd in non-root memcg (512 MiB memcg dirty limit): dirty max=445194240 -Andrea --
Greg, could you please try the parallel page fault test. Could you look at commit 0c3e73e84fe3f64cf1c2e8bb4e91e8901cbcdc38 and 569b846df54ffb2827b83ce3244c5f032394cba4 for examples. -- Three Cheers, Balbir --
On Sun, 3 Oct 2010 23:57:55 -0700
Greg, this is a patch on your set.
mmotm-1014
- memcg-reduce-lock-hold-time-during-charge-moving.patch
(I asked Andrew to drop this)
+ your 1,2,3,5,6,7,8,9,10 (dropped patch "4")
I'm grad if you merge this to your set as replacement of "4".
I'll prepare a performance improvement patch and post it if this dirty_limit
patches goes to -mm.
Thank you for your work.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Now, at supporing dirty limit, there is a deadlock problem in accounting.
1. If pages are being migrated from a memcg, then updates to that
memcg page statistics are protected by grabbing a bit spin lock
using lock_page_cgroup(). In recent changes of dirty page accounting
is updating memcg page accounting (specifically: num writeback pages)
from IRQ context (softirq). Avoid a deadlocking nested spin lock attempt
by irq on the local processor when grabbing the page_cgroup.
2. lock for update_stat is used only for avoiding race with move_account().
So, IRQ awareness of lock_page_cgroup() itself is not a problem. The problem
is in update_stat() and move_account().
Then, this reworks locking scheme of update_stat() and move_account() by
adding new lock bit PCG_MOVE_LOCK, which is always taken under IRQ disable.
Trade-off
* using lock_page_cgroup() + disable IRQ has some impacts on performance
and I think it's bad to disable IRQ when it's not necessary.
* adding a new lock makes move_account() slow. Score is here.
Peformance Impact: moving a 8G anon process.
Before:
real 0m0.792s
user 0m0.000s
sys 0m0.780s
After:
real 0m0.854s
user 0m0.000s
sys 0m0.842s
This score is bad but planned patches for optimization can reduce
this impact.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
include/linux/page_cgroup.h | 31 ++++++++++++++++++++++++++++---
mm/memcontrol.c | 9 +++++++--
2 files changed, 35 insertions(+), 5 ...Thanks for the patch. I will merge your patch (below) as a replacement --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
