On Tue, Apr 06, 2010 at 05:05:51PM -0700, Andrew Morton wrote:
I have a difficultly in that it's hard to give you fixes as it would
span two patches. It might be easiest on you overall if you so a
s/COMPACT_INCOMPLETE/COMPACT_CONTINUE/
on both this patch and the direct compaction patch. I'll then send a follow-on
patch documenting the four defines (later patch adds a fourth) as
/* Return values for compact_zone() and try_to_compact_pages() */
/* compaction didn't start as it was not possible or direct reclaim was more suitable */
#define COMPACT_SKIPPED 0
/* compaction should continue to another pageblock */
#define COMPACT_CONTINUE 1
/* direct compaction partially compacted a zone and there are suitable pages */
#define COMPACT_PARTIAL 2
/* The full zone was compacted */
#define COMPACT_COMPLETE 3
To reduce the amount of time zone locks are held.
Pro: Latencies are lower, fewer pages are isolated at any given time
Con: There is a wider window during which a parallel allocator can use a
page within the pageblock being compacted
It's somewhat arbitrary, only that reclaim works on similar units and
they share logic on what the correct number of pages to have isolated
from the LRU lists are.
The higher the value, the longer the latency is that the lock is held
during isolation but under very heavy memory pressure, there might be
higher success rates for allocation as the window during which parallel
allocators can allocate pages being compacted is reduced.
The lower the value, the lower the time the lock is held. Fewer pages
will be isolated at any given time.
The only advantage of either choice is increasing the value makes it
less likely a parallel allocator will interfere but it had to be
balanced against the lock hold latency time. As we appear to be ok with
the hold time for reclaim, it was reasonable to assume we'd also be ok
with the hold time for compaction.
Subsystems needing lists of free pages would be using mempools.
Included in the patch below. The corner-case is impossible. We're
isolating only COMPACT_CLUSTER_MAX and this must be less than
MAX_ORDER_NR_PAGES. However, the return value of the function is used with
an unsigned long. Technically, it could be unsigned int but page counts
are always in unsigned long so why be surprising.
It looks better. Done.
Typically, a MAX_ORDER_NR_PAGES naturally-aligned block of pages is
considered valid if any one of them return true for pfn_valid(). The
caller of this function has checked the block with pfn_valid so the
block of pages is "valid".
Some architectures insist on punching holes within a block of
MAX_ORDER_NR_PAGES. These are required to call pfn_valid_within() when
walking a range of PFNs. For architectures without these holes,
pfn_valid_within() is a no-op.
Ordinarily, a PFN walker is required to use pfn_to_page() in case it crosses
something like a sparsemem boundary (assuming no VMEMMAP) where there may
be no relationship between the PFN and the struct page location.
In this specific case though, we are within a MAX_ORDER_NR_PAGES block
so it's safe to cache the struct page assuming nothing crazy is
introduced by a memory model.
Done.
The page being broken up could be any size. It's not necessarily related
to pageblocks.
That's what the code marked with "If a page was split, advance to the
end of it" is for. It knows how to advance to the end of the buddy page
without accidentally skipping over a page.
It's easiest to consider migrating pages to and from in ranges of pageblocks
because that is the granularity anti-frag works on. There is very little gained
by considering a lower boundary. With direct compaction, compact_finished()
is checking on a regular basis whether it's ok to finish compaction early
because the caller is satisified.
At worst at the moment, more of a pageblock gets compacted than potentially
necessary for an order-4 allocation to succeed. Specifically, one pageblock
will get fully compacted even though only a small amount of it may have been
required. It'd be possible to do such an optimisation, but it'll be a
micro-optimisation and will obscure the logic somewhat.
Ok.
Correct.
This is walking in strides of pageblock_nr_pages. You only have to call
pfn_valid() once for MAX_ORDER_NR_PAGES but if walking the PFNs within
the block, pfn_valid_within() must be called for each one.
Granted, pageblock_nr_pages != MAX_ORDER_NR_PAGES, but it'd be little
more than a micro-optimisation to identify exactly when the boundary was
crossed and call pfn_valid() a few times less.
Absolute worst case, until it reaches the location of the migration
scanner. As we are isolating pages for migration in units of 32 pages,
it seems unlikely that the migration and free page scanner would be a
substantial difference apart without 32 free pages between them.
hint is in the name. It tells you if there are "too many pages
isolated". Included in the patch below.
Unlikely, but yes.
Done, in the first follow-on patch.
What indeed. Changed to "Only scan within a pageblock boundary"
/*
* Ensure that there are not too many pages isolated from the LRU
* list by either parallel reclaimers or compaction. If there are,
* delay for some time until fewer pages are isolated
*/
The expected cause of too many pages being isolated is parallel reclaimers. Too
many pages isolated implies pages are being cleaned so wait for a period of
time or until IO congestion clears to try again.
Done.
Done
No, although an interruption can be reason for a partial competion. In this
particular case, it's unfortunate because the caller is unlikely to get
the page requested but it also has received a fatal signal so it probably
doesn't care.
No reason why it can't happen but it's mitigated by only checking one PFN
per pageblock_nr_pages to see if it is valid in isolate_migratepages().
True.
Done.
Yes, compact_finished() has all the exit conditions.
Then the migrate scanner will eventually reach the free scanner and it
will exit.
Can you spot a corner case that is not covered by compact_finished() ?
It's a general comment on watermarks. Allocators shouldn't allow the
watermarks to be breached so that there are always pages for things like
TIF_MEMDIE. Changed the comment to
/* Obey watermarks as if the page was being allocated */
To do it would require changes to direct compaction as well. I'll do it
as a patch on top of the series as an incremental change to this patch
will be a mess.
You could do as you suggest, but it's would not reduce scanning. If anything,
it will increase it.
The objective is to move pages into the smallest number of pageblocks. For
that, we want all the free pages within a given range no matter what their
current order in the free lists are. Doing what you suggest would involve
scanning the buddy lists which is potentially more pages than a linear scan
of a range.
Here is a roll-up of the suggestions you made
==== CUT HERE ====
mm,compaction: Various fixes to the patch 'Memory compaction core'
o Have CONFIG_COMPACTION depend on HUGETLB_PAGE instead of HUGETLBFS
o Use unsigned long instead of int for page counters
o Simplify logic in isolate_freepages_block() and isolate_migratepages()
o Optimise isolate_freepages_block to use a cursor
o Use bool instead of int for true/false
o Clarify some comments
o Improve control flow in isolate_migratepages()
o Add newlines for clarity
o Simply loop in compact_zones
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/Kconfig | 2 +-
mm/compaction.c | 81 +++++++++++++++++++++++++++++++-----------------------
mm/page_alloc.c | 2 +-
3 files changed, 48 insertions(+), 37 deletions(-)
diff --git a/mm/Kconfig b/mm/Kconfig
index 4fd75a0..a275a7d 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -177,7 +177,7 @@ config COMPACTION
bool "Allow for memory compaction"
def_bool y
select MIGRATION
- depends on EXPERIMENTAL && HUGETLBFS && MMU
+ depends on EXPERIMENTAL && HUGETLB_PAGE && MMU
help
Allows the compaction of memory for the allocation of huge pages.
diff --git a/mm/compaction.c b/mm/compaction.c
index 3bb65d7..38b54e2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -40,10 +40,10 @@ struct compact_control {
struct zone *zone;
};
-static int release_freepages(struct list_head *freelist)
+static unsigned long release_freepages(struct list_head *freelist)
{
struct page *page, *next;
- int count = 0;
+ unsigned long count = 0;
list_for_each_entry_safe(page, next, freelist, lru) {
list_del(&page->lru);
@@ -55,28 +55,33 @@ static int release_freepages(struct list_head *freelist)
}
/* Isolate free pages onto a private freelist. Must hold zone->lock */
-static int isolate_freepages_block(struct zone *zone,
+static unsigned long isolate_freepages_block(struct zone *zone,
unsigned long blockpfn,
struct list_head *freelist)
{
unsigned long zone_end_pfn, end_pfn;
int total_isolated = 0;
+ struct page *cursor;
/* Get the last PFN we should scan for free pages at */
zone_end_pfn = zone->zone_start_pfn + zone->spanned_pages;
- end_pfn = blockpfn + pageblock_nr_pages;
- if (end_pfn > zone_end_pfn)
- end_pfn = zone_end_pfn;
+ end_pfn = min(blockpfn + pageblock_nr_pages, zone_end_pfn);
- /* Isolate free pages. This assumes the block is valid */
+ /* Find the first usable PFN in the block to initialse page cursor */
for (; blockpfn < end_pfn; blockpfn++) {
- struct page *page;
+ if (pfn_valid_within(blockpfn))
+ break;
+ }
+ cursor = pfn_to_page(blockpfn);
+
+ /* Isolate free pages. This assumes the block is valid */
+ for (; blockpfn < end_pfn; blockpfn++, cursor++) {
int isolated, i;
+ struct page *page = cursor;
if (!pfn_valid_within(blockpfn))
continue;
- page = pfn_to_page(blockpfn);
if (!PageBuddy(page))
continue;
@@ -89,38 +94,40 @@ static int isolate_freepages_block(struct zone *zone,
}
/* If a page was split, advance to the end of it */
- if (isolated)
+ if (isolated) {
blockpfn += isolated - 1;
+ cursor += isolated - 1;
+ }
}
return total_isolated;
}
-/* Returns 1 if the page is within a block suitable for migration to */
-static int suitable_migration_target(struct page *page)
+/* Returns true if the page is within a block suitable for migration to */
+static bool suitable_migration_target(struct page *page)
{
int migratetype = get_pageblock_migratetype(page);
/* Don't interfere with memory hot-remove or the min_free_kbytes blocks */
if (migratetype == MIGRATE_ISOLATE || migratetype == MIGRATE_RESERVE)
- return 0;
+ return false;
/* If the page is a large free page, then allow migration */
if (PageBuddy(page) && page_order(page) >= pageblock_order)
- return 1;
+ return true;
/* If the block is MIGRATE_MOVABLE, allow migration */
if (migratetype == MIGRATE_MOVABLE)
- return 1;
+ return true;
/* Otherwise skip the block */
- return 0;
+ return false;
}
/*
* Based on information in the current compact_control, find blocks
- * suitable for isolating free pages from
+ * suitable for isolating free pages from and then isolate them.
*/
static void isolate_freepages(struct zone *zone,
struct compact_control *cc)
@@ -143,7 +150,7 @@ static void isolate_freepages(struct zone *zone,
spin_lock_irqsave(&zone->lock, flags);
for (; pfn > low_pfn && cc->nr_migratepages > nr_freepages;
pfn -= pageblock_nr_pages) {
- int isolated;
+ unsigned long isolated;
if (!pfn_valid(pfn))
continue;
@@ -199,7 +206,7 @@ static void acct_isolated(struct zone *zone, struct compact_control *cc)
}
/* Similar to reclaim, but different enough that they don't share logic */
-static int too_many_isolated(struct zone *zone)
+static bool too_many_isolated(struct zone *zone)
{
unsigned long inactive, isolated;
@@ -220,16 +227,12 @@ static unsigned long isolate_migratepages(struct zone *zone,
struct compact_control *cc)
{
unsigned long low_pfn, end_pfn;
- struct list_head *migratelist;
-
- low_pfn = cc->migrate_pfn;
- migratelist = &cc->migratepages;
+ struct list_head *migratelist = &cc->migratepages;
/* Do not scan outside zone boundaries */
- if (low_pfn < zone->zone_start_pfn)
- low_pfn = zone->zone_start_pfn;
+ low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
- /* Setup to scan one block but not past where we are migrating to */
+ /* Only scan within a pageblock boundary */
end_pfn = ALIGN(low_pfn + pageblock_nr_pages, pageblock_nr_pages);
/* Do not cross the free scanner or scan within a memory hole */
@@ -238,7 +241,11 @@ static unsigned long isolate_migratepages(struct zone *zone,
return 0;
}
- /* Do not isolate the world */
+ /*
+ * Ensure that there are not too many pages isolated from the LRU
+ * list by either parallel reclaimers or compaction. If there are,
+ * delay for some time until fewer pages are isolated
+ */
while (unlikely(too_many_isolated(zone))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -261,12 +268,14 @@ static unsigned long isolate_migratepages(struct zone *zone,
}
/* Try isolate the page */
- if (__isolate_lru_page(page, ISOLATE_BOTH, 0) == 0) {
- del_page_from_lru_list(zone, page, page_lru(page));
- list_add(&page->lru, migratelist);
- mem_cgroup_del_lru(page);
- cc->nr_migratepages++;
- }
+ if (__isolate_lru_page(page, ISOLATE_BOTH, 0) != 0)
+ continue;
+
+ /* Successfully isolated */
+ del_page_from_lru_list(zone, page, page_lru(page));
+ list_add(&page->lru, migratelist);
+ mem_cgroup_del_lru(page);
+ cc->nr_migratepages++;
/* Avoid isolating too much */
if (cc->nr_migratepages == COMPACT_CLUSTER_MAX)
@@ -317,6 +326,7 @@ static void update_nr_listpages(struct compact_control *cc)
int nr_migratepages = 0;
int nr_freepages = 0;
struct page *page;
+
list_for_each_entry(page, &cc->migratepages, lru)
nr_migratepages++;
list_for_each_entry(page, &cc->freepages, lru)
@@ -362,7 +372,7 @@ static int compact_finished(struct zone *zone,
static int compact_zone(struct zone *zone, struct compact_control *cc)
{
- int ret = COMPACT_INCOMPLETE;
+ int ret;
/* Setup to move all movable pages to the end of the zone */
cc->migrate_pfn = zone->zone_start_pfn;
@@ -371,8 +381,9 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
migrate_prep();
- for (; ret == COMPACT_INCOMPLETE; ret = compact_finished(zone, cc)) {
+ while ((ret = compact_finished(zone, cc)) == COMPACT_INCOMPLETE) {
unsigned long nr_migrate, nr_remaining;
+
if (!isolate_migratepages(zone, cc))
continue;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 66823bd..08b6306 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1223,7 +1223,7 @@ int split_free_page(struct page *page)
zone = page_zone(page);
order = page_order(page);
- /* Obey watermarks or the system could deadlock */
+ /* Obey watermarks as if the page was being allocated */
watermark = low_wmark_pages(zone) + (1 << order);
if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
return 0;
--