Why do we want this? -------------------- That depends on who you ask. My answer is this: 'foo.tar.gz/foo/bar' or 'foo.tar.gz/contents/foo/bar' or something similar. Others might suggest accessing streams, resource forks or extended attributes through such an interface. However this patch only deals with the non-directory case, so directories would be excluded from that interface. But otherwise this patch doesn't limit the uses of the "file as directory" concept in any way. It just adds the infrastructure to support these whacky beasts. How is it done? --------------- (See this [1] thread for more discussion on the subject) When a non-directory object is accessed without a trailing slash, then path resolution returns the object itself as usual. If a non-directory object is accessed with a trailing slash, then the filesystem may opt to let the file be accessed as a directory. In this case "something" (as supplied by the filesystem) is mounted on top of the non-directory object. This mount will have special properties: - If there's no trailing slash is after the file name, the mount won't be followed, even if the path resolution would otherwise follow mounts. - The mount only stays there while it is referenced by some external object, like a pwd or an open file. When it is no longer referenced, it is automatically unmounted. - Unlike "real" mounts, this won't block unlink(2) or rename(2) on the underlying object. Compatibility with existing systems ----------------------------------- Filesystems which enable "file as directory" semantics, might possibly break existing applications. For example an app could conceivably check if an object is a directory by appending a slash to the name and trying some filesystem operation. This application might be confused by allowing such operations to succeed on non-directory objects. However in practice this sort of behavior seem to be rare. The other question is, how well ...
Interesting... How do you deal with mount propagation and things like mount --move? As for unlink... How do you deal with having that thing mounted, mounting something _under_ it (so that vfsmount would be kept busy) and then unlinking that sucker? I'll look through the patch tonight; it sounds interesting, assuming that we don't run into serious crap with locking and <shudder> revalidation logics. -
Moving (or doing other mount operations on) an ancestor shouldn't be a problem. Moving this mount itself is not allowed, and neither is doing bind or pivot_root. Maybe bind could be allowed... Yeah, that's a good point. Current patch doesn't deal with that. Simplest solution could be to disallow submounting these. Don't think Revalidation shouln't be a problem. We'll just end up with an unhashed dentry with a mount over it, which will be detached when the vfsmount ref is dropped. Miklos -
What about clone copying your namespace? What about MNT_SLAVE stuff being set up prior to that lookup? More interesting question: should independent lookups of that sucker on different paths end up with the same superblock (and vfsmount for each) or should we get fully independent mount on each? Arbitrary limitations... (and that's where revalidate horrors come in, BTW). BTW^2: what if fs mounted that way will happen to have such node itself? I'm not saying that it's unfeasible or won't lead to interesting things, but it really needs semantics done right... -
But these mounts _are_ special. There is really no point in moving or In that case they are cloned, but only those survive which have refs These mounts are not propagated. Or at least I hope so. Propagation I think they should be the same superblock, same dentry. What would I think doing this recursively should be allowed. "Releasing last ref Agreed :) Miklos -
Er... These mounts might not be propagated, but what about a bind Then you are going to have interesting time with locking in final mntput(). BTW, what about having several links to the same file? You have i_mutex Releasing the last reference will lead to cascade of umounts in that case... IOW, need to be careful with locking. -
I don't see any use for that. But indeed, it should not be too hard So your question is, which mount takes priority on the lookup? It probably should be the propagated real mount, rather than the I think it's done right: detach_mnt() with namespace_sem and vfsmount_lock, then release locks, and path_release(&old_nd). If the recursion is extremely deep we could have stack overflow problems though, aargh... Miklos -
I still don't get it where the superblock comes in. The locking is "interesting" in there, yes. And I haven't completely convinced myself it's right, let alone something that won't easily be screwed up in the future. So there's definitely room for thought there. But how does it matter if two different paths have the same sb or a The same dentry is mounted over each one. The contents of the directory should only depend on the contents of the underlying inode. The path leading up to it is completely irrelevant. Miklos -
Because then you get a slew of fun issues with dropping the final reference to vfsmount vs. lookup on another place. What hold do you have on that superblock and when do you switch from "oh, called ->enter() on the same inode again, return vfsmount over the same superblock" to "need to So what kind of exclusion do you have for ->enter()? None? -
So really these issues, are about how do we get hold of the superblock to mount. I think that should be a filesystem internal problem, and I suspect the easiest solution is to just have a permanent meta superblock for these dir-on-file mounts. Miklos -
Maybe this might belong into __link_path_walk() similar to the handling of symbolic links. If the real mount has always higher priority why do we bother in follow_mount() about it. Jan -
Do you mean, that follow_mount() should never descend into the dir-on-file mount but that should always be done by __link_path_walk()? This could make sense. __lookup_mnt() currently returns the first matching mount in the hash list. With your suggestion, we'd need two __lookup_mnt() variants (or a parameter). One, that only matches normal mounts, and one that only matches dir-on-file mounts. Is that it? Miklos -
Moving would be an implementation artefact that doesn't really correspond to any useful operation on the filesyst AFAIK, most filesystems that have implemented subfiles (excepting Reiser4 of course) do not allow you to rename or move the subfile directory or its contents from one parent file to another. Trond -
If that's about xattr and nothing else, colour me thoroughly uninterested. If it might have other interesting uses, OTOH... -
I get it. It could probably be done with a little added complexity. For example when a real mount is attached onto a dir-on-file mount, the "mountedness" is propagated up to the dentry on the next real mount. So in that case unlink won't be allowed, even if the immediate attachment is a dir-on-file mount. This is tricky to do right though. Other possibility is to detach all mount trees attached to dentry on unlink. Miklos -
Hmmm, cd foo.tgz/bar/baz.tgz/xyzzy makes sense, and it is implemented as a submount, no? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
Yes, that certainly makes sense, but it's the same "special" mount, which goes away automatically, so there isn't any problem with unlinking with any number of such submounts. But I don't want to explicitly prohibit submounting by normal mounts either, if it's not too hard to handle, and Al's new vfsmount refcounting scheme should take care of the difficult part of that. Miklos -
here's a possibly stupid question. What about symlinks to dirs? namely the shells tend to treat them differently if postfixed with a slash or not. -
Right. So it only works on non-directory, non-symlink objects. Miklos -
BTW, I'd split that (and matching updates in callers) into separate Ouch. What guarantees that two lookups won't race right here? You are not holding any locks at that point, AFAICS... BTW, why newpath? What's wrong with simply returning a new vfsmount with right ->mnt_root/->mnt_sb (instead of creating it inside You've got to be kidding. nameidata is *big*. If anything, we want to make detach_mnt() take struct path * instead, but even that is lousy due to recursion. I really don't like what's going on here. The thing is, current code is based on assumption that presence in the mount tree => holding a reference. We _might_ deal with that (there was an old plan to change refcounting logics for vfsmounts), but that sort of games with locks spells trouble. What happens, for example, if namespace gets cloned before you grab namespace_sem? There's another problem, BTW - a lot of stuff does stat + open + fstat + compare kind of sequence. You'll end up mounting/umounting between stat and open, which opens you to race with somebody else. Get a different st_dev, eat a nice unreproducible error from application... -
Right. After locking vfsmount_lock, mount_dironfile() should recheck I don't think the filesystem ought to try _creating_ a vfsmount. I imagine, that the fs has already a kernel-internal mounted for this kind of stuff, and it just supplies a dentry from that. The vfsmount isn't actually important, but it should be readily available, and it's Yes. On namespace cloning the MNT_DIRONFILE will be re-added later. It _should_ work. The mount in the new namespace will be created (with namespace_sem held, so we can't yet free this mount), and then As I said, the superblock should be persistent, so we'll get a stable st_dev for multiple mounts. Miklos -
I don't get it. What's the point of that exercise, then? When do you OK, but then I guess I don't understand the intended use. -
If the cost of ->enter() is low, then it shouln't really be a problem. We can't use ->i_mutex for locking, and introducing a new lock for When the real superblock is created. It could even be the _same_ super block as the real one. There'd be just the problem of anchoring the dir-on-file dentries somewhere... Or with fuse the dir-on-file mount can just come from any mounted filesystem, again possibly the same one as the parent. I do actually test with this. The userspace filesystem supplies a file descriptor, from which the struct path is extracted and returned from ->enter(). Miklos -
Then I do not understand what this mechanism could be used for, other than an odd way to twist POSIX behaviour and see how much of the userland would survive that. Certainly not useful for your "look into tarball as a tree", unless you seriously want to scan the entire damn fs for tarballs at mount time and set up a superblock for each. And for per-file extended attributes/forks/whatever-you-call-that-abomination it also obviously doesn't help, since you lose them for directories. IOW, what uses do you have in mind? Complete scenario, please... -
Ah... After rereading the thread you've mentioned in the very beginning, I think I understand what you are driving at. However, in that case * I really don't see why bother with returning vfsmount at all. dentry alone is enough to create a new vfsmount, all in fs/namei.c. * the lifetime rules look fscking scary. You call that ->enter() on nearly every damn lookup. OK, so you'll recreate equivalent vfsmount, but... That's a lot of allocations/freeing. Can we do some caching and deal with it on memory pressure? * invalidation on unlink is still an open problem. * locking in final mntput() doesn't look nice; we probably need a new refcounting scheme for vfsmounts to make that work. I have a variant that might work here (and make life much easier for expiry logics in automount/shared trees, which is what it had been initially proposed for), but it still doesn't kill the need to deal with invalidation. And yes, NFS still needs it (and so do all network filesystems, really). The question of caching is related to that. -
Someone might think of a way to make those work with directories. So what's so special about invalidation? Why not just treat dir-on-file mounts the same as any other ref on the dentry? Miklos -
Umm... It is related to detached subtrees, but I'm not sure if it is what you are thinking about. Short version of the story: new counter (mnt_busy) that would be defined in the following way: the number of external references (not due to the vfsmount tree structure or from namespace to root) + the number of children that have non-zero ->mnt_busy. And a per-vfsmount flag ("goner"). The rules for handling ->mnt_busy: * duplicating external reference: increment m->mnt_busy * getting from m to child: increment child->mnt_busy, if it went from 0 to non-zero - increment m->mnt_busy as well (that's done under vfsmount_lock, so we can safely check for zero here). * getting from m to parent: increment parent->mnt_busy. * dropping external reference: decrement m->mnt_busy; if it's still non-zero, we are done. If it's zero, we are in for some work (and had acquired vfsmount_lock by atomic_dec_and_lock()). Here's what we do: * go through ancestors, decrementing ->mnt_busy, until we hit the root or get to one with ->mnt_busy staying non-zero. * find the most remote ancestor that has zero ->mnt_busy and is marked as goner (might be m itself). * if no such beast exists, we are done. * otherwise, detach the subtree rooted in that ancestor from its parent (if any) and unhash its root (if hashed). Now there is no external references to any vfsmount in that subtree. * now we can kill all vfsmounts in that subtree. * detaching m from parent: nothing; we trade a busy child of parent for new external reference to parent. * lazy umount: in addition to detaching everything from parents and dropping resulting external references to parents, mark everything in the subtree as goners. * normal umount: check ->mnt_busy *and* lack of children, detach, mark as goner, drop resulting external reference to parent. * fun new stuff - umount of intact subtree: detach the subtree from parent, do *not* dissolve it, mark everything in subtree as goners. ...
I was thinking of a similar one by Mike Waychison. It had the problem of requiring a spinlock for mntget/mntput. It was also different in that it did not gradually dissolve detached trees, but kept them as How will this work with copy_tree() and namespace duplication, which OK, I'll digest this info. Miklos -
Here the spinlock is needed only when mnt_busy goes to 0, so presumably it won't be a serious problem on more or less common setups; however, Easy - grab namespace_sem, grab vfsmount_lock, walk the subtree and bump mnt_busy on everything (by 1 + number of non-busy children). Then drop vfsmount_lock and do as usual, dropping references in tree being copied as you go. Nothing will get attached or detached due to namespace_sem, nothing will get evicted by anybody other than you since you've got all that stuff pinned down. End of story... -
Right. Do you have some code? Should I try to code something up? Miklos -
I hope to get some breathing space next week, then I'll get back to VFS work. I'd rather do that one myself, since it'll be a long series of equivalent transformations - debugging such rewrite of refcounting done as a single patch is going to be hell. And yes, refcounting rewrite is near the top of the list (another thing is wading through several threads from hell and reviewing unionfs ;-/) -
Sure, don't want to rob you of any fun stuff ;) Miklos -
I have some similar considerations about how userspace should deal with that. Well, *use cases* I can see. I'd like to use that - for loop mounting, archives, possibly using symlinks to remote filesystems "symlink1 => ssh:user@ip" (although that's possible with FUSE anyway - but would be possibly within a .zip, too), ... But I'm not sure how to do the presentation to userspace *right*. How about some special node in eg. /proc (or a new filesystem)? Eg. /fileAsDir/etc/passwd/owner ... would work for all *files*. For directories we do not know whether we're still climbing the hierarchy or would like to see meta-data. Some way like a ".this" entry is not the Right Way IMO ... Well, I cannot imagine a real good way to tell where I'd like to stop following the "normal" filesystem and go into the "generated" hierarchy ... /fileAsDir/level-3/usr/local/bin/owner is not nice. Regards, Phil -
So we need to make *anything* done anywhere in the namespace to modify the dentry tree on that fs. Could you spell "fuck, NO"? -
BTW, I'm not saying I like this. It's pretty ugly and fragile. But it's damn convenient to get rid of these mounts from mntput(). Is there a better alternative? Miklos -
Stole reiser4 an idea. These semantics are quite fragile. Until now, chdir is only possible for directories (otherwise, -ENOTDIR), and opening a directory without O_DIRECTORY gives -EISDIR. You can't just change semantics. That said, with FUSE, something like this should already be possible, should not it? And looking at your example of foo.tar.gz/foo/bar,the tar.gz needs to be read at least once to get at foo/bar. Jan -- -
Hello,
I work for a similir goal in my bachelor's theses. But my approach is
I do:
'foo.zip^/foo/bar' or
'foo.zip^/contents/foo/bar'
where foo.zip is a ZIP file. See the little '^' in the pathname: it's an
escape character. I have a kernel patch which modifies a lookup
resolution function and when a normal lookup fails ('foo.zip^/foo/bar'
dosn't exist) and the pathname contains '^' it *redirects* the lookup to
a FUSE mount.
So say we have a FUSE vfs server (called 'RheaVFS') on '/tmp/shadow'.
When a process tries to access '/home/xx/foo.zip^/foo/bar'
it is in-kernel transparently redirected to
'/tmp/shadow/home/xx/foo.zip^/foo/bar' and the vfs server handles all the
extraction/compresion/semi-mounting/semi-umounting/whatsoever...
Advantages:
* 99.9% imho backward compatible. No problems with clever programs
doing stat() before open()/opendir().
* you can easily and transparently stack filesystems one on top of another
with a clear semantic. Say we have 'foo.tar.gz'; then:
'foo.tar.gz^' is a decompressed TAR *file*;
'foo.tar.gz^^' is a directory
* you can pass additional parameters to the vfs server after the '^',
eg. 'foo.zip^compresslevel=1/foo/bar'
* works with symlinks too
Drawbacks:
* users must/should be aware of the special escape char '^'
* usually only single vfs server per user handles all "virtual"
directories --> single point of failure. (But I implemented a quirk
which allows restarting the FUSE vfs server with only minor
problems)
* probably tons of others I don't know....
The project tarball is at:
http://veverka.sh.cvut.cz/~sykora/prj/rheavfs-20070523-1239.tar.gz
The kernel patch is in the tarball and for your viewing pleasure
I've attached it to this email.
The patch is againts 2.6.20.1 and works with 2.6.21.1 too.
There are two minor failed hunks for 2.6.22-rc2 which I hadn't time to correct.
My project is not completed, there's almost no documentation etc.
Maybe I will put together some simple ...