Re: [RFC PATCH] file as directory

Previous thread: [PATCH 2/2] Use the new percpu interface for shared data -- version 3 by Fenghua Yu on Tuesday, May 22, 2007 - 11:20 am. (3 messages)

Next thread: [PATCH] add "notime" boot option by Randy Dunlap on Tuesday, May 22, 2007 - 12:09 pm. (18 messages)
From: Miklos Szeredi
Date: Tuesday, May 22, 2007 - 11:48 am

Why do we want this?
--------------------

That depends on who you ask.  My answer is this:

  'foo.tar.gz/foo/bar' or
  'foo.tar.gz/contents/foo/bar'

or something similar.

Others might suggest accessing streams, resource forks or extended
attributes through such an interface.  However this patch only deals
with the non-directory case, so directories would be excluded from
that interface.

But otherwise this patch doesn't limit the uses of the "file as
directory" concept in any way.  It just adds the infrastructure to
support these whacky beasts.

How is it done?
---------------

(See this [1] thread for more discussion on the subject)

When a non-directory object is accessed without a trailing slash, then
path resolution returns the object itself as usual.

If a non-directory object is accessed with a trailing slash, then the
filesystem may opt to let the file be accessed as a directory.  In
this case "something" (as supplied by the filesystem) is mounted on
top of the non-directory object.

This mount will have special properties:

 - If there's no trailing slash is after the file name, the mount
   won't be followed, even if the path resolution would otherwise
   follow mounts.

 - The mount only stays there while it is referenced by some external
   object, like a pwd or an open file.  When it is no longer
   referenced, it is automatically unmounted.

 - Unlike "real" mounts, this won't block unlink(2) or rename(2) on
   the underlying object.


Compatibility with existing systems
-----------------------------------

Filesystems which enable "file as directory" semantics, might possibly
break existing applications.  For example an app could conceivably
check if an object is a directory by appending a slash to the name and
trying some filesystem operation.  This application might be confused
by allowing such operations to succeed on non-directory objects.

However in practice this sort of behavior seem to be rare.

The other question is, how well ...
From: Al Viro
Date: Tuesday, May 22, 2007 - 3:10 pm

Interesting...  How do you deal with mount propagation and things like
mount --move?  As for unlink...  How do you deal with having that thing
mounted, mounting something _under_ it (so that vfsmount would be kept
busy) and then unlinking that sucker?

I'll look through the patch tonight; it sounds interesting, assuming that
we don't run into serious crap with locking and <shudder> revalidation
logics.
-

From: Miklos Szeredi
Date: Tuesday, May 22, 2007 - 11:36 pm

Moving (or doing other mount operations on) an ancestor shouldn't be a
problem.  Moving this mount itself is not allowed, and neither is
doing bind or pivot_root.  Maybe bind could be allowed...


Yeah, that's a good point.  Current patch doesn't deal with that.
Simplest solution could be to disallow submounting these.  Don't think

Revalidation shouln't be a problem.  We'll just end up with an
unhashed dentry with a mount over it, which will be detached when the
vfsmount ref is dropped.

Miklos
-

From: Al Viro
Date: Wednesday, May 23, 2007 - 12:03 am

What about clone copying your namespace?  What about MNT_SLAVE stuff being
set up prior to that lookup?  More interesting question: should independent
lookups of that sucker on different paths end up with the same superblock
(and vfsmount for each) or should we get fully independent mount on each?

Arbitrary limitations... (and that's where revalidate horrors come in, BTW).
BTW^2: what if fs mounted that way will happen to have such node itself?

I'm not saying that it's unfeasible or won't lead to interesting things,
but it really needs semantics done right...
-

From: Miklos Szeredi
Date: Wednesday, May 23, 2007 - 12:19 am

But these mounts _are_ special.  There is really no point in moving or

In that case they are cloned, but only those survive which have refs

These mounts are not propagated.  Or at least I hope so.  Propagation

I think they should be the same superblock, same dentry.  What would

I think doing this recursively should be allowed.  "Releasing last ref

Agreed :)

Miklos
-

From: Al Viro
Date: Wednesday, May 23, 2007 - 12:36 am

Er...  These mounts might not be propagated, but what about a bind

Then you are going to have interesting time with locking in final mntput().
BTW, what about having several links to the same file?  You have i_mutex

Releasing the last reference will lead to cascade of umounts in that
case...  IOW, need to be careful with locking.
-

From: Miklos Szeredi
Date: Wednesday, May 23, 2007 - 1:05 am

I don't see any use for that.  But indeed, it should not be too hard

So your question is, which mount takes priority on the lookup?  It
probably should be the propagated real mount, rather than the



I think it's done right: detach_mnt() with namespace_sem and
vfsmount_lock, then release locks, and path_release(&old_nd).

If the recursion is extremely deep we could have stack overflow
problems though, aargh...

Miklos
-

From: Al Viro
Date: Wednesday, May 23, 2007 - 1:29 am

Say /foo/bar/a is such a file.

cd /foo/bar
ln a b

now do lookups on a/ and b/

What happens?
-

From: Miklos Szeredi
Date: Wednesday, May 23, 2007 - 2:03 am

I still don't get it where the superblock comes in.  The locking is
"interesting" in there, yes.  And I haven't completely convinced
myself it's right, let alone something that won't easily be screwed up
in the future.  So there's definitely room for thought there.

But how does it matter if two different paths have the same sb or a

The same dentry is mounted over each one.  The contents of the
directory should only depend on the contents of the underlying inode.
The path leading up to it is completely irrelevant.

Miklos
-

From: Al Viro
Date: Wednesday, May 23, 2007 - 2:58 am

Because then you get a slew of fun issues with dropping the final reference
to vfsmount vs. lookup on another place.  What hold do you have on that
superblock and when do you switch from "oh, called ->enter() on the same
inode again, return vfsmount over the same superblock" to "need to

So what kind of exclusion do you have for ->enter()?  None?
-

From: Miklos Szeredi
Date: Wednesday, May 23, 2007 - 3:14 am

So really these issues, are about how do we get hold of the superblock
to mount.

I think that should be a filesystem internal problem, and I suspect
the easiest solution is to just have a permanent meta superblock for
these dir-on-file mounts.

Miklos
-

From: Jan Blunck
Date: Wednesday, May 23, 2007 - 2:16 am

Maybe this might belong into __link_path_walk() similar to the
handling of symbolic links. If the real mount has always higher
priority why do we bother in follow_mount() about it.

Jan
-

From: Miklos Szeredi
Date: Wednesday, May 23, 2007 - 2:28 am

Do you mean, that follow_mount() should never descend into the
dir-on-file mount but that should always be done by
__link_path_walk()?

This could make sense.

__lookup_mnt() currently returns the first matching mount in the hash
list.  With your suggestion, we'd need two __lookup_mnt() variants (or
a parameter).  One, that only matches normal mounts, and one that only
matches dir-on-file mounts.  Is that it?

Miklos
-

From: Trond Myklebust
Date: Wednesday, May 23, 2007 - 5:34 am

Moving would be an implementation artefact that doesn't really
correspond to any useful operation on the filesyst

AFAIK, most filesystems that have implemented subfiles (excepting
Reiser4 of course) do not allow you to rename or move the subfile
directory or its contents from one parent file to another.

Trond

-

From: Al Viro
Date: Wednesday, May 23, 2007 - 5:40 am

If that's about xattr and nothing else, colour me thoroughly uninterested.
If it might have other interesting uses, OTOH...
-

From: Jan Blunck
Date: Wednesday, May 23, 2007 - 2:21 am

Hmm, think about /your/path/qemu-disk1.img/boot ,
/your/path/qemu-disk2.img/usr , ...

Jan
-

From: Miklos Szeredi
Date: Wednesday, May 23, 2007 - 2:35 am

I get it.

It could probably be done with a little added complexity.  For example
when a real mount is attached onto a dir-on-file mount, the
"mountedness" is propagated up to the dentry on the next real mount.

So in that case unlink won't be allowed, even if the immediate
attachment is a dir-on-file mount.

This is tricky to do right though.

Other possibility is to detach all mount trees attached to dentry on
unlink.

Miklos
-

From: Pavel Machek
Date: Thursday, May 24, 2007 - 5:07 am

Hmmm, cd foo.tgz/bar/baz.tgz/xyzzy makes sense, and it is implemented
as a submount, no?
							Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-

From: Miklos Szeredi
Date: Monday, May 28, 2007 - 7:43 am

Yes, that certainly makes sense, but it's the same "special" mount,
which goes away automatically, so there isn't any problem with
unlinking with any number of such submounts.

But I don't want to explicitly prohibit submounting by normal mounts
either, if it's not too hard to handle, and Al's new vfsmount
refcounting scheme should take care of the difficult part of that.

Miklos
-

From: Shaya Potter
Date: Tuesday, May 22, 2007 - 4:26 pm

here's a possibly stupid question.  What about symlinks to dirs?  namely 
the shells tend to treat them differently if postfixed with a slash or not.
-

From: Miklos Szeredi
Date: Tuesday, May 22, 2007 - 11:39 pm

Right.  So it only works on non-directory, non-symlink objects.

Miklos
-

From: Al Viro
Date: Wednesday, May 23, 2007 - 2:51 am

BTW, I'd split that (and matching updates in callers) into separate


Ouch.  What guarantees that two lookups won't race right here?  You are
not holding any locks at that point, AFAICS...

BTW, why newpath?  What's wrong with simply returning a new vfsmount
with right ->mnt_root/->mnt_sb (instead of creating it inside


You've got to be kidding.  nameidata is *big*.  If anything, we want
to make detach_mnt() take struct path * instead, but even that is
lousy due to recursion.

I really don't like what's going on here.  The thing is, current code
is based on assumption that presence in the mount tree => holding a
reference.  We _might_ deal with that (there was an old plan to change
refcounting logics for vfsmounts), but that sort of games with locks
spells trouble.  What happens, for example, if namespace gets cloned
before you grab namespace_sem?

There's another problem, BTW - a lot of stuff does stat + open + fstat +
compare kind of sequence.  You'll end up mounting/umounting between stat
and open, which opens you to race with somebody else.  Get a different
st_dev, eat a nice unreproducible error from application...
-

From: Miklos Szeredi
Date: Wednesday, May 23, 2007 - 3:09 am

Right.  After locking vfsmount_lock, mount_dironfile() should recheck

I don't think the filesystem ought to try _creating_ a vfsmount.  I
imagine, that the fs has already a kernel-internal mounted for this
kind of stuff, and it just supplies a dentry from that.  The vfsmount
isn't actually important, but it should be readily available, and it's

Yes.  On namespace cloning the MNT_DIRONFILE will be re-added later.

It _should_ work.  The mount in the new namespace will be created
(with namespace_sem held, so we can't yet free this mount), and then

As I said, the superblock should be persistent, so we'll get a stable
st_dev for multiple mounts.

Miklos
-

From: Al Viro
Date: Wednesday, May 23, 2007 - 3:24 am

I don't get it.  What's the point of that exercise, then?  When do you

OK, but then I guess I don't understand the intended use.
-

From: Miklos Szeredi
Date: Wednesday, May 23, 2007 - 3:40 am

If the cost of ->enter() is low, then it shouln't really be a problem.
We can't use ->i_mutex for locking, and introducing a new lock for

When the real superblock is created.  It could even be the _same_
super block as the real one.  There'd be just the problem of anchoring
the dir-on-file dentries somewhere...

Or with fuse the dir-on-file mount can just come from any mounted
filesystem, again possibly the same one as the parent.  I do actually
test with this.  The userspace filesystem supplies a file descriptor,
from which the struct path is extracted and returned from ->enter().

Miklos
-

From: Al Viro
Date: Wednesday, May 23, 2007 - 4:39 am

Then I do not understand what this mechanism could be used for, other
than an odd way to twist POSIX behaviour and see how much of the userland
would survive that.  Certainly not useful for your "look into tarball
as a tree", unless you seriously want to scan the entire damn fs for
tarballs at mount time and set up a superblock for each.  And for per-file
extended attributes/forks/whatever-you-call-that-abomination it also
obviously doesn't help, since you lose them for directories.

IOW, what uses do you have in mind?  Complete scenario, please...
-

From: Al Viro
Date: Wednesday, May 23, 2007 - 5:16 am

Ah... After rereading the thread you've mentioned in the very beginning,
I think I understand what you are driving at.  However, in that case
	* I really don't see why bother with returning vfsmount at all.
dentry alone is enough to create a new vfsmount, all in fs/namei.c.
	* the lifetime rules look fscking scary.  You call that ->enter()
on nearly every damn lookup.  OK, so you'll recreate equivalent vfsmount,
but...  That's a lot of allocations/freeing.  Can we do some caching and
deal with it on memory pressure?
	* invalidation on unlink is still an open problem.
	* locking in final mntput() doesn't look nice; we probably need
a new refcounting scheme for vfsmounts to make that work.  I have a variant
that might work here (and make life much easier for expiry logics in
automount/shared trees, which is what it had been initially proposed for),
but it still doesn't kill the need to deal with invalidation.  And yes,
NFS still needs it (and so do all network filesystems, really).  The question
of caching is related to that.
-

From: Miklos Szeredi
Date: Wednesday, May 23, 2007 - 6:01 am

Someone might think of a way to make those work with directories.



So what's so special about invalidation?  Why not just treat
dir-on-file mounts the same as any other ref on the dentry?

Miklos
-

From: Al Viro
Date: Wednesday, May 23, 2007 - 6:51 am

Umm...  It is related to detached subtrees, but I'm not sure if it is what
you are thinking about.

Short version of the story: new counter (mnt_busy) that would be defined
in the following way: the number of external references (not due to the
vfsmount tree structure or from namespace to root) + the number of
children that have non-zero ->mnt_busy.  And a per-vfsmount flag ("goner").

The rules for handling ->mnt_busy:
	* duplicating external reference: increment m->mnt_busy
	* getting from m to child: increment child->mnt_busy, if it went
from 0 to non-zero - increment m->mnt_busy as well (that's done under
vfsmount_lock, so we can safely check for zero here).
	* getting from m to parent: increment parent->mnt_busy.
	* dropping external reference: decrement m->mnt_busy; if it's still
non-zero, we are done.  If it's zero, we are in for some work (and had
acquired vfsmount_lock by atomic_dec_and_lock()).  Here's what we do:
		* go through ancestors, decrementing ->mnt_busy, until we
		  hit the root or get to one with ->mnt_busy staying
		  non-zero.
		* find the most remote ancestor that has zero ->mnt_busy
		  and is marked as goner (might be m itself).
		* if no such beast exists, we are done.
		* otherwise, detach the subtree rooted in that ancestor
		  from its parent (if any) and unhash its root (if hashed).
		  Now there is no external references to any vfsmount in that
		  subtree.
		* now we can kill all vfsmounts in that subtree.
	* detaching m from parent: nothing; we trade a busy child of parent
for new external reference to parent.
	* lazy umount: in addition to detaching everything from parents
and dropping resulting external references to parents, mark everything
in the subtree as goners.
	* normal umount: check ->mnt_busy *and* lack of children, detach,
mark as goner, drop resulting external reference to parent.
	* fun new stuff - umount of intact subtree: detach the subtree from
parent, do *not* dissolve it, mark everything in subtree as goners.  ...
From: Miklos Szeredi
Date: Wednesday, May 23, 2007 - 7:32 am

I was thinking of a similar one by Mike Waychison.  It had the problem
of requiring a spinlock for mntget/mntput.  It was also different in
that it did not gradually dissolve detached trees, but kept them as

How will this work with copy_tree() and namespace duplication, which

OK, I'll digest this info.

Miklos
-

From: Al Viro
Date: Wednesday, May 23, 2007 - 8:06 am

Here the spinlock is needed only when mnt_busy goes to 0, so presumably
it won't be a serious problem on more or less common setups; however,

Easy - grab namespace_sem, grab vfsmount_lock, walk the subtree and bump
mnt_busy on everything (by 1 + number of non-busy children).  Then drop
vfsmount_lock and do as usual, dropping references in tree being copied
as you go.  Nothing will get attached or detached due to namespace_sem,
nothing will get evicted by anybody other than you since you've got all
that stuff pinned down.  End of story...
-

From: Miklos Szeredi
Date: Wednesday, May 23, 2007 - 8:25 am

Right.

Do you have some code?

Should I try to code something up?

Miklos
-

From: Al Viro
Date: Wednesday, May 23, 2007 - 8:37 am

I hope to get some breathing space next week, then I'll get back to
VFS work.  I'd rather do that one myself, since it'll be a long series
of equivalent transformations - debugging such rewrite of refcounting
done as a single patch is going to be hell.  And yes, refcounting rewrite
is near the top of the list (another thing is wading through several
threads from hell and reviewing unionfs ;-/)
-

From: Miklos Szeredi
Date: Wednesday, May 23, 2007 - 8:55 am

Sure, don't want to rob you of any fun stuff ;)

Miklos
-

From: Ph. Marek
Date: Wednesday, May 23, 2007 - 6:23 am

I have some similar considerations about how userspace should deal with that.

Well, *use cases* I can see. I'd like to use that - for loop mounting, 
archives, possibly using symlinks to remote filesystems "symlink1 => 
ssh:user@ip" (although that's possible with FUSE anyway - but would be 
possibly within a .zip, too), ...


But I'm not sure how to do the presentation to userspace *right*.


How about some special node in eg. /proc (or a new filesystem)?
Eg.
   /fileAsDir/etc/passwd/owner ...
would work for all *files*. For directories we do not know whether we're still 
climbing the hierarchy or would like to see meta-data.

Some way like a ".this" entry is not the Right Way IMO ...
Well, I cannot imagine a real good way to tell where I'd like to stop 
following the "normal" filesystem and go into the "generated" hierarchy ...

   /fileAsDir/level-3/usr/local/bin/owner
is not nice.


Regards,

Phil
-

From: Al Viro
Date: Wednesday, May 23, 2007 - 6:54 am

So we need to make *anything* done anywhere in the namespace to modify
the dentry tree on that fs.  Could you spell "fuck, NO"?
-

From: Miklos Szeredi
Date: Wednesday, May 23, 2007 - 3:24 am

BTW, I'm not saying I like this.  It's pretty ugly and fragile.  But
it's damn convenient to get rid of these mounts from mntput().

Is there a better alternative?

Miklos
-

From: Jan Engelhardt
Date: Wednesday, May 23, 2007 - 5:01 am

Stole reiser4 an idea.
These semantics are quite fragile. Until now, chdir is only possible
for directories (otherwise, -ENOTDIR), and opening a directory without
O_DIRECTORY gives -EISDIR. You can't just change semantics.

That said, with FUSE, something like this should already be possible,
should not it?

And looking at your example of foo.tar.gz/foo/bar,the tar.gz needs to
be read at least once to get at foo/bar.


	Jan
-- 
-

From: Jaroslav Sykora
Date: Wednesday, May 23, 2007 - 6:20 am

Hello,


I work for a similir goal in my bachelor's theses. But my approach is 

I do:
   'foo.zip^/foo/bar' or
   'foo.zip^/contents/foo/bar'

where foo.zip is a ZIP file. See the little '^' in the pathname: it's an
escape character. I have a kernel patch which modifies a lookup 
resolution function and when a normal lookup fails ('foo.zip^/foo/bar'
dosn't exist) and the pathname contains '^' it *redirects* the lookup to 
a FUSE mount.

So say we have a FUSE vfs server (called 'RheaVFS') on '/tmp/shadow'. 
When a process tries to access '/home/xx/foo.zip^/foo/bar' 
it is in-kernel transparently redirected to 
'/tmp/shadow/home/xx/foo.zip^/foo/bar' and the vfs server handles all the
extraction/compresion/semi-mounting/semi-umounting/whatsoever...

Advantages:
* 99.9% imho backward compatible. No problems with clever programs 
  doing stat() before open()/opendir().
* you can easily and transparently stack filesystems one on top of another
  with a clear semantic. Say we have 'foo.tar.gz'; then:
	'foo.tar.gz^' is a decompressed TAR *file*;
	'foo.tar.gz^^' is a directory
* you can pass additional parameters to the vfs server after the '^', 
  eg. 'foo.zip^compresslevel=1/foo/bar'
* works with symlinks too

Drawbacks:
* users must/should be aware of the special escape char '^'
* usually only single vfs server per user handles all "virtual"
  directories --> single point of failure. (But I implemented a quirk
  which allows restarting the FUSE vfs server with only minor
  problems)
* probably tons of others I don't know....

The project tarball is at:

http://veverka.sh.cvut.cz/~sykora/prj/rheavfs-20070523-1239.tar.gz

The kernel patch is in the tarball and for your viewing pleasure 
I've attached it to this email.
The patch is againts 2.6.20.1 and works with 2.6.21.1 too.
There are two minor failed hunks for 2.6.22-rc2 which I hadn't time to correct.

My project is not completed, there's almost no documentation etc.
Maybe I will put together some simple ...
Previous thread: [PATCH 2/2] Use the new percpu interface for shared data -- version 3 by Fenghua Yu on Tuesday, May 22, 2007 - 11:20 am. (3 messages)

Next thread: [PATCH] add "notime" boot option by Randy Dunlap on Tuesday, May 22, 2007 - 12:09 pm. (18 messages)