Mel Gorman posted the seventh version of his Memory Compaction patches asking, "are there any further obstacles to merging?" The patches, first posted in May of 2007, provide a mechanism for moving GFP_MOVABLE pages into a smaller number of pageblocks, reducing externally fragmented memory. Mel explains that 'compaction' is another method of defragmenting memory, "for example, lumpy reclaim is a form of defragmentation as was slub 'defragmentation' (really a form of targeted reclaim). Hence, this is called 'compaction' to distinguish it from other forms of defragmentation."
The core compaction patch explains that memory is compacted in a zone by relocating movable pages towards the end of the zone:
"A single compaction run involves a migration scanner and a free scanner. Both scanners operate on pageblock-sized areas in the zone. The migration scanner starts at the bottom of the zone and searches for all movable pages within each area, isolating them onto a private list called migratelist. The free scanner starts at the top of the zone and searches for suitable areas and consumes the free pages within making them available for the migration scanner. The pages isolated for migration are then migrated to the newly isolated free pages."
"Finally found it ... the patch below solves the sparsemem crash and the test system boots up fine now," announced Ingo Molnar. He described the patch as fixing a "memory corruption and crash on 32-bit x86 systems. If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of RAM, then we call memory_present() with a start/end that goes outside the scope of MAX_PHYSMEM_BITS." He included a source snippet with the loop that could corrupt memory, "depending on what that memory is, we might crash, misbehave or just not notice the bug." Ingo went on to note that the bug was first introduced with sparsemem support in the 2.6.16 kernel:
"I believe this was the reason why my many bisection attempts were unsuccessful: the bug pattern was not stable and seemingly working kernels had the memory corruption too. It was pure luck that v2.6.24 'worked' and v2.6.25-rc9 broke visibly."
Linux creator Linus Torvalds replied, "good job. I've pushed this out, and will let this simmer at least overnight to see if there are any brown-paper-bag issues (either with this or with some last changes from Andrew), but I'm happy, and I think I'll do the real 2.6.25 tomorrow."
"Allocations of larger pages are not reliable in Linux. If larger pages have to be allocated then one faces various choices of allowing graceful fallback or using vmalloc with a performance penalty due to the use of a page table," began Christoph Lameter, describing the third version of his virtual compound page support patchset. He continued, "a virtual compound allocation means that there will be first of all an attempt to satisfy the request with physically contiguous memory. If that is not possible then a virtually contiguous memory will be created." Christopher proposed two advantages:
"1. Current uses of vmalloc can be converted to allocate virtual compounds instead. In most cases physically contiguous memory can be used which avoids the vmalloc performance penalty. 2. Uses of higher order allocations (stacks, buffers etc) can be converted to use virtual compounds instead. Physically contiguous memory will still be used for those higher order allocs in general but the system can degrade to the use of vmalloc should memory become heavily fragmented."
"You can play games in user space, but you're fooling yourself if you think you can do as well as doing it in the kernel. And you're *definitely* fooling yourself if you think mmap() solves performance issues. 'Zero-copy' does not equate to 'fast'. Memory speeds may be slower than core CPU speeds, but not infinitely so!"
"Memory is getting relatively cheap these days --- we're talking maybe US$30 to US$40 per megabyte if your machine can take SIMMS. Upgrading a machine from 2 meg to 4 meg doesn't cost *that* much money."
The previous 2.4 Linux kernel maintainer, Marcelo Tossati, resurrected a discussion on adding support for out of memory notifications to the Linux kernel. He explained, "AIX contains the SIGDANGER signal to notify applications to free up some unused cached memory," then noting, "there have been a few discussions on implementing such an idea on Linux, but nothing concrete has been achieved." In a request for discussion, Marcelo added, "on the kernel side Rik suggested two notification points: 'about to swap' (for desktop scenarios) and 'about to OOM' (for embedded-like scenarios)." Rik van Riel explained:
"The first threshold - 'we are about to swap' - means the application frees memory that it can. Eg. free()d memory that glibc has not yet given back to the kernel, or JVM running the garbage collector, or ...
"The second threshold - 'we are out of memory' - means that the first approach has failed and the system needs to do something else. On an embedded system, I would expect some application to exit or maybe restart itself."
"Currently there is a strong tendency to avoid larger page allocations in the kernel because of past fragmentation issues and the current defragmentation methods are still evolving," Christoph Lameter began, posting the first version of his Virtual Compound Page Support patches, a followup to his earlier RFC. He explained, "we use vmalloc allocations in many locations to provide a safe way to allocate larger arrays. That is due to the danger of higher order allocations failing." Christoph continued, "this patch set provides a way for a higher page allocation to fall back. Instead of a physically contiguous page a virtually contiguous page is provided. The functionality of the vmalloc layer is used to provide the necessary page tables and control structures to establish a virtually contiguous area."
He then listed four advantages, including, "if higher order allocations are failing then virtual compound pages consisting of a series of order-0 pages can stand in for those allocations." He also listed three disadvantages, noting that new functions are used to access the virtually mapped memory, adding "virtual mappings are less efficient than physical mappings, [so] performance will drop once virtual fall back occurs." He also noted, "virtual mappings have more memory overhead; vm_area control structures page tables, page arrays etc need to be allocated and managed to provide virtual mappings."
"Applications with dynamic input and dynamic memory usage have some issues with the current overcommitting kernel," Daniel Spång explained looking for ideas on how to best manage out of memory (OOM) situations on embedded systems with little memory and without swap. He noted, "some kind of notification to the application that the available memory is scarce and let the application free up some memory (e.g., by flushing caches), could be used to improve the situation and avoid the OOM killer." Daniel then briefly described four possible solutions, looking for other ideas:
"1) Turn off overcommit. Results in a waste of memory. 2) Nokia uses a lowmem security module to signal on predetermined thresholds. Currently available in the -omap tree. But this requires manual tuning of the thresholds. 3) Using madvise() with MADV_FREE to get the kernel to free mmaped memory, typically application caches, when the kernel needs the memory. 4) A OOM handler that the application registers with the kernel, and that the kernel executes before the OOM-killer steps in."
Mel Gorman offered a first release of a patchset that compacts memory, "this is a prototype for compacting memory to reduce external fragmentation so that free memory exists as fewer, but larger contiguous blocks. Rather than being a full defragmentation solution, this focuses exclusively on pages that are movable via the page migration mechanism." He notes that the patchset is currently incomplete, and at this time memory is only compacted manually, not automatically, "this version of the patchset is mainly concerned with getting the compaction mechanism correct." Mel goes on to describe how it works:
"A single compaction run involves two scanners operating within a zone - a migration and a free scanner. The migration scanner starts at the beginning of a zone and finds all movable pages within one pageblock_nr_pages-sized area and isolates them on a migratepages list. The free scanner begins at the end of the zone and searches on a per-area basis for enough free pages to migrate all the pages on the migratepages list. As each area is respectively migrated or exhausted of free pages, the scanners are advanced one area. A compaction run completes within a zone when the two scanners meet."
In a recent lkml thread the concept of dumping an image of the kernel's memory to swap when the kernel hits a bug was discussed. Linus Torvalds pointed out that such a feature wasn't useful to an operating system like Linux that can ran on such a diverse assortment of computers, "yes, in a controlled environment, dumping the whole memory image to disk may be the right thing to do. BUT: in a controlled environment, you'll never get the kind of usage that Linux gets. Why do you think Linux (and Windows, for that matter) took away a lot of the market from traditional UNIX?" He went on to explain that there are systems where swap is not larger than the size of the core so collecting a crash dump would not be possible, that Linux instead tries to acknowledge bugs without crashing, and quite often the bug is actually in the drivers, "writing to disk when the biggest problem is a driver to begin with is INSANE." Comparing Linux to Solaris he added, "so the fact is, Solaris is crap, and to a large degree Solaris is crap exactly _because_ it assumes that it runs in a 'controlled environment'."
Alan Cox went on to point out that there are also privacy issues, "there is an additional factor - dumps contain data which variously is - copyright third parties, protected by privacy laws, just personally private, security sensitive (eg browser history) and so on. The only reasons you can get dumps back in the hands of vendors is because there are strong formal agreements controlling where they go and what is done with them." He went on to note that dump utilities are also not user friendly, "diskdump (and even more so netdump) are useful in the hands of a developer crashing their own box just like kgdb, but not in the the normal and rational end user response of 'its broken, hit reset'". Linus heartily agread, and suggested that anyone willing to use kernel dumps would be better off debugging through a firewire connection, " if you've ever picked through a kernel dump after-the-fact, I just bet you could have done equally well with firewire, and it would have had _zero_ impact on your kernel image. Now, contrast that with kdump, and ask yourself: which one do you think is worth concentrating effort on?"
As RAM increasingly becomes a commodity, the prices drop and computer users are able to buy more. 32-bit archictectures face certain limitations in regards to accessing these growing amounts of RAM. To better understand the problem and the various solutions, we begin with an overview of Linux memory management. Understanding how basic memory management works, we are better able to define the problem, and finally to review the various solutions.
This article was written by examining the Linux 2.6 kernel source code for the x86 architecture types.