Linux: Distributed mmap API

Submitted by Amit Shah
on February 26, 2004 - 8:00am

Daniel Phillips recently posted a patch to the Linux kernel mailing
list that implements a distributed mmap() API: "This function by itself
is enough to support a crude but useful form of distributed mmap where
a shared file is cached only on one cluster node at a time."

The mmap() system call maps a portion of a file or device into system memory. A distributed mmap API allows one to perform mmap() on files located on remote machines visible in the local namespace via a distributed filesystem.

Daniel's patch is one of two that would implement the simple core API for distributed mmap(). Via this simple core API, cache invalidation will only work for whole files and not for portions of a file. Also, multiple readers may not cache the same data simultaneously. An improved version of this kernel API will be developed later, addressing both of these limitations.


From: Daniel Phillips [email blocked]
Subject: [RFC] Distributed mmap API
Date: Wed, 25 Feb 2004 22:20:11 +0100

This is the function formerly known as invalidate_mmap_range, with the
addition of a new code path in the zap_ call chain to handle MAP_PRIVATE
properly.  This function by itself is enough to support a crude but useful
form of distributed mmap where a shared file is cached only on one cluster
node at a time.

To use this, the distributed filesystem has to hook do_no_page to intercept
page faults and carry out the needed global locking.  The locking itself does
not require any new kernel hooks.  In brief, the patch here and another patch
to be presented for the do_no_page hook, together provide the core kernel API
for a simplified, distributed mmap.  (Note that there may be a workaround for
the lack of a do_no_page hook, but certainly not as simple and robust.)

To put this in perspective, I'll mention the two big limitations of the
simplified API:

  1) Invalidation is always a whole file at a time
  2) Multiple readers may not cache the same data simultaneously

To handle sub-file cache granularity, we also need to be able to flush dirty
data and evict cache pages with sub-file granularity, giving a trio of cache
management functions:

    unmap_mapping_range(mapping, start, length) /* this patch */
    write_mapping_range(mapping, start, length) /* start IO for dirty cache */
    evict_mapping_range(mapping, start, length) /* wait on IO and evict cache */

To handle (2) above, the distributed filesystem will need to hook and modify
the behaviour of do_wp_page so that it can intercept memory writes to shared
cache pages.

To summarize the current proposal, and where we need to go in the future:

  Simple core kernel API for simplistic distributed memory map
  ------------------------------------------------------------

     - unmap_mapping_range export (this patch)
     - do_no_page hook

  Improved core kernel API for optimal distributed memory map
  -----------------------------------------------------------

     - unmap_mapping_range export (this patch)
     - write_mapping_range export
     - evict_mapping_range export
     - do_no_page hook
     - do_wp_page hook

There's no big rush to move on to the optimal version just now, since the simplistic
version is already a big step forward.

I'd like to take this opportunity to apologize to Paul for derailing his more
modest proposal, but unfortunately, the semantics that could be obtained that
way are fatally flawed: private mmaps just won't work.  What I've written here
is about the minimum that supports acceptable mmap semantics.

And finally, the EXPORT_SYMBOL_GPL issue: after much fretting I've changed it
to just EXPORT_SYMBOL in this patch, because I feel that we have better ways
to further our goals of free and open software than to try to use this
particular API as a battering ram.  Of course it's not my decision, I just
want to register my vote here.

Regards,

Daniel

Related Links:

Gee, it's not like DSM (Distr

Anonymous
on
February 27, 2004 - 10:22am

Gee, it's not like DSM (Distributed Shared Memory) hasn't been so well studied that there's even a standard acronym for it. Some of them even use *real* cache-coherency protocols. Whole books have been written on the subject. This is just another example of "Not Invented Here" syndrome from the Linux kernel community.

Today our teacher at Distribu

Anonymous
on
February 27, 2004 - 7:05pm

Today our teacher at Distributed Systems classes talked just about this topic. He had a lot to say about it, none of his words were really positive. Mostly like "nice theoretical research back in the 80s, but it dried out after no efficient implementations were produced". So it clearly sucks, cause he should know, he works on unix/plan9 at bell-labs ;P

maybe your teacher sucks.

Anonymous (not verified)
on
January 24, 2007 - 3:32am

maybe your teacher sucks. this thing is cool

Please do an article on how 2.6.3 broke ALSA

Anonymous
on
February 28, 2004 - 2:34pm

I swear all of the ALSA merges from 2.6.2 => 2.6.3 totally screwed everything up.

Use a distro

Anonymous
on
February 28, 2004 - 4:44pm

I don't mean to sound mean, but if you aren't competent at building your own kernel and installing any required userland software then you should be using a distro kernel if you expect it to just work.

ALSA Talk?

midian
on
February 28, 2004 - 5:11pm

Well, that's true, but to do own kernel builds isn't that hard, you just need to know what you need (What hardware you have etc.)

What it comes to ALSA, I changed about a month ago, and I must say it sucks, the OSS emulation doesn't really work, it doesn't let me open two sound sources at the same time (eg. TeamSpeak2 and Quake3). I mailed them, still no reply. All of these things worked great with OSS, but forced to change before they drop it out.

That though gives me an idea of writing a how to compile your own kernel from scratch.

But I don't think the post about ALSA in 2.6.3 is coming - ever.
I've seen the process of kernel development, they're well aware of this kind of stuff, and they probably have it fixed in 2.6.4-rc1 (Have you tried that?). And if they don't fix something that's broken, mail the lkml and describe your problems.

And now when talking off topic, maybe I should speak out more, ide-scsi is broken in 2.6.3 (No one fixes it, "use ide-cd", no support for it yet, forced to boot to old kernels).

And to the userland software, it's well documented in the kernel docs what you need, what version etc.
--
Regards,
Markus

Re: ALSA talk?

gebner
on
March 1, 2004 - 6:34am

it doesn't let me open two sound sources

What card do you have? My Audigy2 does hw mixing nicely under ALSA.

ide-scsi is broken in 2.6.3

Why do you need that crap?

Re: ALSA talk?

midian
on
March 1, 2004 - 7:15am

What card do you have? My Audigy2 does hw mixing nicely under ALSA.

I have a SB Live!

ide-scsi is broken in 2.6.3

Burning, I still don't find new enough versions of stuff to work with ide-cd from debian sid.

--

Regards,

Markus

Burning, I still don't find n

molo
on
March 1, 2004 - 5:44pm

Burning, I still don't find new enough versions of stuff to work with ide-cd from debian sid.

cdrdao and cdrecord in sid both support ATAPI burns.

-molo

Re: Burning, I still don't find n

midian
on
March 2, 2004 - 5:47am

(flame)
Ooh, is there some damn DOCS anywhere then?! And no, I don't burn CD's with cdrecord/cdrdao only, I use k3b too, and it seems to be too old?

I'm so tired to this, everything working is removed and replaced with something that doesn't.
(/flame)
--
Regards,
Markus

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.