Daniel Phillips recently posted a patch to the Linux kernel mailing
list that implements a distributed mmap() API: "This function by itself
is enough to support a crude but useful form of distributed mmap where
a shared file is cached only on one cluster node at a time."
The mmap() system call maps a portion of a file or device into system memory. A distributed mmap API allows one to perform mmap() on files located on remote machines visible in the local namespace via a distributed filesystem.
Daniel's patch is one of two that would implement the simple core API for distributed mmap(). Via this simple core API, cache invalidation will only work for whole files and not for portions of a file. Also, multiple readers may not cache the same data simultaneously. An improved version of this kernel API will be developed later, addressing both of these limitations.
From: Daniel Phillips [email blocked] Subject: [RFC] Distributed mmap API Date: Wed, 25 Feb 2004 22:20:11 +0100 This is the function formerly known as invalidate_mmap_range, with the addition of a new code path in the zap_ call chain to handle MAP_PRIVATE properly. This function by itself is enough to support a crude but useful form of distributed mmap where a shared file is cached only on one cluster node at a time. To use this, the distributed filesystem has to hook do_no_page to intercept page faults and carry out the needed global locking. The locking itself does not require any new kernel hooks. In brief, the patch here and another patch to be presented for the do_no_page hook, together provide the core kernel API for a simplified, distributed mmap. (Note that there may be a workaround for the lack of a do_no_page hook, but certainly not as simple and robust.) To put this in perspective, I'll mention the two big limitations of the simplified API: 1) Invalidation is always a whole file at a time 2) Multiple readers may not cache the same data simultaneously To handle sub-file cache granularity, we also need to be able to flush dirty data and evict cache pages with sub-file granularity, giving a trio of cache management functions: unmap_mapping_range(mapping, start, length) /* this patch */ write_mapping_range(mapping, start, length) /* start IO for dirty cache */ evict_mapping_range(mapping, start, length) /* wait on IO and evict cache */ To handle (2) above, the distributed filesystem will need to hook and modify the behaviour of do_wp_page so that it can intercept memory writes to shared cache pages. To summarize the current proposal, and where we need to go in the future: Simple core kernel API for simplistic distributed memory map ------------------------------------------------------------ - unmap_mapping_range export (this patch) - do_no_page hook Improved core kernel API for optimal distributed memory map ----------------------------------------------------------- - unmap_mapping_range export (this patch) - write_mapping_range export - evict_mapping_range export - do_no_page hook - do_wp_page hook There's no big rush to move on to the optimal version just now, since the simplistic version is already a big step forward. I'd like to take this opportunity to apologize to Paul for derailing his more modest proposal, but unfortunately, the semantics that could be obtained that way are fatally flawed: private mmaps just won't work. What I've written here is about the minimum that supports acceptable mmap semantics. And finally, the EXPORT_SYMBOL_GPL issue: after much fretting I've changed it to just EXPORT_SYMBOL in this patch, because I feel that we have better ways to further our goals of free and open software than to try to use this particular API as a battering ram. Of course it's not my decision, I just want to register my vote here. Regards, Daniel