While your description above seems completely correct to me and I have no objections to the patch, I'd prefer it if someone having more experience with the VFS looked at it. Miklos, can you have a look, please? Thanks, --
The description is missing some details: why is the filesystem frozen before suspend? AFAICS this can happen when DM calls bdev_freeze() on the device before the task freezing begins. Is this the case? Also, while the patch might solve some of the symptoms of the fuse vs. process freezer interaction, it will not fully fix that problem. As such it's just a hack to hide the problem, making it less likely to appear. So I'm not at all conviced about this patch. Thanks, Miklos --
Hi Miklos. It doesn't matter why a process is sitting in that wait_event call. What does matter is that one can be there. In the case where I saw it, I was working on fuse freezing. I don't remember the details, as it's a year No, it's part of the solution. I haven't posted the full fuse freezing patch because I thought this could be profitably merged without the rest of the patch. Regards, Nigel --
Well, I guess it's better if you post the entire thing so that we can see what the role of the $subject patch is in it, even if this patch finally gets merged separately. Thanks, Rafael --
Hi.
Ah.. that makes me see how vfs_check_frozen was getting triggered...
(fs/namei.c, below).
Regards,
Nigel
fs/buffer.c | 87 ++++++++++++++++++++++++++++++++++++++++++++
fs/fuse/control.c | 1
fs/fuse/dev.c | 7 +++
fs/fuse/dir.c | 35 +++++++++++++++--
fs/fuse/file.c | 14 +++++++
fs/fuse/fuse.h | 13 ++++++
fs/fuse/inode.c | 4 +-
fs/namei.c | 2 +
include/linux/buffer_head.h | 5 ++
include/linux/freezer.h | 15 +++++++
include/linux/fs.h | 10 ++++-
kernel/power/process.c | 48 +++++++++++++++++++++---
12 files changed, 227 insertions(+), 14 deletions(-)
diff -ruNp 700-BRANCH-fuse-freezing.patch-old/fs/buffer.c 700-BRANCH-fuse-freezing.patch-new/fs/buffer.c
--- 700-BRANCH-fuse-freezing.patch-old/fs/buffer.c 2008-10-19 04:57:22.000000000 +1100
+++ 700-BRANCH-fuse-freezing.patch-new/fs/buffer.c 2008-10-21 13:16:08.000000000 +1100
@@ -247,6 +247,93 @@ void thaw_bdev(struct block_device *bdev
}
EXPORT_SYMBOL(thaw_bdev);
+#if 0
+#define FS_PRINTK(fmt, args...) printk(fmt, ## args)
+#else
+#define FS_PRINTK(fmt, args...)
+#endif
+
+/* #define DEBUG_FS_FREEZING */
+
+/**
+ * freeze_filesystems - lock all filesystems and force them into a consistent
+ * state
+ * @which: What combination of fuse & non-fuse to freeze.
+ */
+void freeze_filesystems(int which)
+{
+ struct super_block *sb;
+
+ lockdep_off();
+
+ /*
+ * Freeze in reverse order so filesystems dependant upon others are
+ * frozen in the right order (eg. loopback on ext3).
+ */
+ list_for_each_entry_reverse(sb, &super_blocks, s_list) {
+ FS_PRINTK(KERN_INFO "Considering %s.%s: (root %p, bdev %x)",
+ sb->s_type->name ? sb->s_type->name : "?",
+ sb->s_subtype ? sb->s_subtype : "", sb->s_root,
+ sb->s_bdev ? sb->s_bdev->bd_dev : 0);
+
+ if (sb->s_type->fs_flags & FS_IS_FUSE &&
+ sb->s_frozen == SB_UNFROZEN &&
+ ...Nigel, thanks for the patch, it makes thinks a lot clearer. From a quick look through the patch it seems to solve a bunch of cases where new fuse requests during the freezing could cause problems. But it doesn't do anything with requests that are already under way when the freezing starts, which would still result in all the same problems. Take this scenario: 1) process A does rename("/mnt/fuse/a", "/mnt/fuse/b") 2) request goes to process B serving the fuse filesystem 3) filesystems are frozen, no new fuse requests 4) processes are frozen, let's say B first, then A 5) freezing A will fail, since it's still waiting for the request to finish Several solutions have been posted, none of which really solve the problem: a) Let's tag fuse server processes and freeze them later. This is basically impossible, because many processes could be interoperating and there's no way to know which is depending on which (example: sshfs uses ssh for communication, which possibly relies on openvpn process for packet transmission). b) While waiting for replies to fuse request, allow process to freeze. Does not fully solve the problem, as VFS might be holding locks, and other processes waiting for those locks will not be freezable. Miklos --
Hi. I'll have a look at the code again. I deliberately didn't stop existing requests, but perhaps that's the wrong behaviour. In the above scenario, process B won't see process A's new request until post-resume if the filesystem is already frozen, so steps 4 & 5 don't happen. Process B will also always be frozen before process A if process A is userspace (most likely in the above scenario) or was mounted after process B. (I've had this patch distributed as is for almost a year, I agree that these solutions won't work. Regards, Nigel --
No, I mean the filesystem is only frozen at 3 not before, so B is Both A and B are userspace processes. The order of freezing userspace processes is basically random, there's no way to make sure that processes doing work on behalf of a fuse filesystem (B) are frozen after processes accessing the fuse filesystem (A). Miklos --
Hi. Sorry. You're right - I was confusing things in what I said, but I do have a (unconfused) solution: The answer is to freeze the fuse filesystems first, stopping new requests (freezing the processes making them) before they are passed on to userspace and allowing existing requests to complete or freeze. Once that is done, the userspace fuse processes will be idle, at least as far as fuse activity is concerned. After fuse activity is stopped, userspace can be frozen without worrying about what processes are fuse and what aren't. This is what my patch implements so far. To deal with requests that are already in progress, I'd suggest three possibilities, in the order I think they should be preferred. 1) Use wait_event_freezeable[_timeout] in fuse code. Probably preferable to #2, but I thought of it later :) 2) Use freezer_do_not_count as part of FUSE_MIGHT_FREEZE (resetting before exiting the callers, of course). If the request doesn't complete during the freezing period, it must be because the userspace thread was already frozen. If the request does complete, we're counted again during the normal freezing of userspace that follows the freezing of filesystems. 3) Adding a means to check whether processes being frozen are fuse requests. The code could then wait for fc->num_waiting - (say) fc->num_frozen == 0. Regards, Nigel --
Yup, these fix the freezing of tasks which have outstanding fuse requests. However it does not fix the freezing of tasks which are waiting for VFS locks (i.e. inode->i_mutex) held by the outstanding fuse requests. This is the tricky part... Miklos --
Hi. Convert them all to wait_event_freezeable. If you know the locks they're waiting on definitely aren't going to be free until post-resume and we're going to be trying to freeze them while they're waiting, that would be the right behaviour. Regards, Nigel --
You mean convert mutexes to event queues? Not a very good idea. I fear we are going down the same path as the last time. I still don't think rewriting the VFS is the right solution to the freezing problem. But hey, if you want, sumbit a patch or an RFD and lets see what others think. Miklos --
So, what solution would you prefer? Rafael --
I would prefer a freezer-less solution. Suspend to ram doesn't need the freezer, and with the kexec approach hibernate could be done without it also. I don't think adding hacks to the VFS to work around the issues with the freezer is the right way to solve this. But this is just my personal opinion, the VFS maintainers may think otherwise. Miklos --
There are problems with that too, although they aren't directly related to Well, my personal opinion is that we need filesystems to support suspend, this way or another and the sooner it happens, the better. Still, I'm rather not going to make that happen myself. :-) Apparently, Nigel is willing to work in this direction and we can use this as an opportunity to learn what exactly is necessary for this purpose and _then_ decide if this is reasonable or not instead of dismissing it upfront. Thanks, Rafael --
Umm, OK. Last I remember everybody agreed that there's absolutely no reason why processes need to be frozen, and the only important thing is that drivers are not twiddling the hardware during suspend, and I haven't dismissed it, just voiced my opinion that I think it's the wrong direction. Miklos --
Well, this turned out not to be the case in the meantime. In fact to handle that without the freezer we'd have to synchronize every driver's suspend/resume callbacks with every possible way in which applications can access the device for regular I/O (for example for PCI devices this means any I/O other than configuration space accesses). While this is possible in theory, I don't see this happening any time soon, especially that we're going to keep the bar for accepting new drivers relatively low. Thanks, Rafael --
Not all callbacks. I don't know what the current model is but AFAIR it should be something like this: 1) call drivers to prepare for suspend (allocate space, etc) 2) stop all driver activity (plug queues, disable interrupts, etc) 3) call drivers to actually save state and power down 4) suspend The part we are concerned is stopping driver activity. It could be done with a mutex, or it could be done by freezing tasks. Adding a mutex or other mechanism is the one I most like, but it's probably the biggest work, so lets look at how to fix the freezing: Currently the criteria for freezing is that userspace task has to exit kernelspace, and kernel task has to hit a specific "freeze point". This causes problems where we want to freeze tasks which are "stuck" inside filesystems or other non-driver parts of the kernel. We can fix this two ways: a) mark additional places to freeze for userspace tasks as well. This is the direction Nigel seems to be taking. b) or instead we could allow freezing anywhere in uninterruptible sleep, _except_ where explicity marked. Which is easier? I don't know. But I very storgly feel that marking un-freezable places instead of marking freezable places is a much cleaner solution. It only affects parts of the kernel which have something to do with suspend, instead of affecting parts of the kernel which have absolutely nothing to do with suspend. Miklos --
Not only is adding a mutex the biggest amount of work, it has has the largest impact. Every I/O pathway would have to acquire the The problem with unrestricted freezing shows up when you freeze tasks that hold a mutex or other sort of lock. If this mutex is needed later on for suspending a device then the suspend will hang, because a frozen task can't release any mutexes. I suppose you could try to categorize mutexes as "freezable" and "non-freezable". Ones used by device drives would generally be non-freezable, whereas others (such as those used by VFS) would be freezable. Still, it would be pretty difficult. Among other things, it would be necessary to verify that a task holding a non-freezable mutex never tries to acquire a freezable mutex. Alan Stern --
Actually I was thinking of an rw-semaphore, not a mutex. But yeah that still has scalability problems. But it could be done with custom locking primitives, optimized for this case: suspend_disable(); /* driver stuff */ I did a random sampling of ->suspend() callbacks, and they don't seem to be taking mutexes. Does that happen at all? Did anybody ever try modifying the freezer for suspend (not hibernate), so that it allows tasks not in running state to freeze? If not, I think that's an experiment worth doing. Miklos --
Yes, it could be done. And the overhead could be minimized by using per-CPU variables. It would still be an awful lot of work, and easy to It does, particularly among drivers that do runtime PM, which is becoming more and more important. Besides, suspend has to synchronize with I/O somehow. Right now that is handled by making suspend wait until no tasks are doing I/O (because they are all frozen). If you allow tasks to be frozen at more or less arbitrary times, while holding arbitrary locks, then you may end up freezing a task that's in the middle of I/O. That should certainly What happens if the reason the task isn't running is because it's waiting for I/O to complete? I just don't think this can be made to work. Alan Stern --
What is the middle of I/O? Depending the type of I/O it could be messed up regardless of whether tasks happen to be in userspace or not (e.g. printing). And some types of I/O are already mostly decoupled from userspace (file I/O, networking), so the userspace freezing shoudln't make any Don't know. I've never written a driver, and I'm not familiar with runtime PM, etc. So I can't come up with a detailed design for solving the freezer issues there. But I do think that the solution does not lie in "fixing" the VFS. Miklos --
I don't know. Maybe it slipped through the cracks. It wouldn't be I'll wriggle out of this by saying that "in the middle of I/O" means the task has gotten a device into a state from which it can't successfully be suspended. An example would be if the system had sent the device part of a command. Even if you did manage to suspend the device and resume it later, you wouldn't expect it to work right when it received the second half of the command. In general, drivers don't leave devices in this kind of intermediate state for long. A driver might sleep at such times, but it wouldn't True. We're mostly talking about character device I/O. There are a I tend to agree. The problems posed by fuse are very difficult. It is not at all obvious how to attack them. Alan Stern --
OK, getting back to this, as it seems to be the only way that we agree is doable. How about this, a) identify syscalls that may make drivers do I/O: - read - write - ioctl ??? b) add the suspend_disable/enable() primitives to these syscalls c) push primitives inside the implementation c) is slightly tricky, but could be done for example by setting a flag on open: FMODE_NO_SUSPEND_DISABLE (better name required), saying that implementation is responsible for getting the suspend disable magic right. For starters this flag could be set for all non-device opens (maybe all non-char-dev opens?), solving the fuse vs. freezer issues without any complicated trickery. Miklos --
I discussed this last summer with Rafael. It's a lot harder than it looks, for all sorts of reasons. For example, what about user tasks I don't know. There are other interfaces too, like sysfs attributes, that would have to be handled specially. On the whole, the freezer seems much, much simpler. Regarding fuse, something like Nigel's scheme for preventing new requests and then waiting for old requests to complete might work out. Especially if you combine it with a strategy for making the freezer back and retry after a delay when something goes wrong. Alan Stern --
Yeah. The current design of the freezer is rather simplistic and I'm not really sure it's the best one possible. Perhaps we can redesign the freezer to work differently and handle the cases like fuse. I didn't have the time to think about that yet, though. Thanks, Rafael --
Hi. Why redo what I've already done? In the full patch, you have the basis of what you're talking about. I haven't seen a failure to freeze fuse or anything else in a year of use. Nigel --
Still, Miklos noticed some problems with it. I'm not talking about doing things on top of the current signal-based freezing mechanism, but rather about moving the freezer a bit closer towards the scheduler. Maybe this is the way to go. I don't know. In any case, the freezing of user space seems to be much simpler than modifying all drivers to add suspend synchronization. Thanks, Rafael --
As I mentioned, we don't _have_ to modify all drivers. It could start out by just modifying syscalls that result in calls to drivers. That would be a 10-liner patch. Miklos --
If you can solve current freezer vs. fuse in 10 lines, that would be useful... -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Well yeah, your patch handles the straightforward cases. But it doesn't help with the more tricky cases, where one fuse filesystem is using another, and as those may become more widespread, this approach will fail. Miklos --
Hi. At the moment, yes. But it's not impossible for us to modify the patch to handle that as well. Regards, Nigel --
It depends on what you mean. The most direct reading of your statement is simply wrong: It is _theoretically_ impossible to find the correct order for freezing filesystems. To do so would be equivalent to solving the halting problem. Alan Stern --
Hi Alan. I'm not sure that's true. You see, I'm thinking of this as not that different to the problem of unmounting filesystems. There, too, we need to unmount in a particular order, and let transactions on each filesystem stop cleanly before we can unmount them. Even if there are differences, perhaps looking at how we handle unmounting will help with handling freezing. Regards, Nigel --
There's nothing magic about umount, it just uses a refcount on the fs. But umount changes the namespace, that's the big difference. For example if a process is accessing path P which has a component inside the mount, it _will_ get different results before and after the umount. This is not acceptable for freezing. For freezing to work with such a refcounting scheme, we'd have to count _future_ uses of the fs as well, not just current ones, which is obviously impossible. Miklos --
Hi. I must be missing something. If you're freezing future users of the filesystem before they can start anything new, doesn't that deal with this problem? Regards, Nigel --
How do you determine which are the future users? Miklos --
Hi again. You don't. You just use FUSE_MIGHT_FREEZE to stop them before they take locks that might cause problems. So my suggestion is: 1) Stop new requests at FUSE_MIGHT_FREEZE 2) Handle existing requests by using freezer_do_not_count in request_send and request_send_nowait before the spin_lock and freezer_count after the spin_unlock. With #2, we don't need to care about whether the request is completed before freezing completes or post-resume. If the userspace process tries to use an already frozen fuse filesystem and gets frozen, that's okay because we'll sit waiting for an answer, not be counted by the freezer and so not contribute to any failure to freeze processes. If the userspace process completes its work, we'll either get caught at the freezer_count (if we've already been told to freeze) or be gotten later, after exiting the fuse code. Regards, Nigel --
Before? FUSE_MIGHT_FREEZE is called _after_ i_mutex is taken by the Yes, this is the variant of categorizing sleeps to freezing and non-freezing. The problem is, you need to do that with all mutex_lock(&inode->i_mutex) instances as well. Try grepping for that in fs/*.c! It _is_ possible, it's just not practical. Miklos --
Hi. Let's get practical. You know fuse a lot better than me. Can you give me the most complicated, ugly configuration you can think of, and I'll work on making it freeze first time, every time. Regards, Nigel --
I don't think it will work out, because to be able to do this some ordering between freezing the filesystems must be done. But this is basically impossible, for all the same reasons it's impossible to order the freezing of userspace tasks. Also this is not just a fuse issue: we have userspace network devices, we have userspace USB drivers, etc, affected by this problem. If there _is_ a solution with the freezer that does solve all of this, I haven't yet heard it. But yeah, in the end the simpler solution should win. Miklos --
Sure it does. A frozen process can't touch a memory-mapped I/O region, whereas a non-frozen process can. Although as Matthew pointed out, Would you like to write a first-pass patch? I don't think it will To do this properly, it is necessary to categorize sleeping states. Some should count as frozen and others shouldn't. So for instance, all the mutexes and wait queues involved in VFS would count as frozen; a process waiting on one of them can simply be kept from waking up until the suspend is over. Others (those involved in device I/O) wouldn't count as frozen; a task waiting on one of them would have to wake up and move forward before the system could suspend. Doing that seems like a lot of work, just as modifying every driver does. Changing a few kernel entry points is simpler, but I'm pretty sure it won't work. For instance, tasks can block arbitrarily long on read calls (waiting for data to arrive); you can't allow such things to prevent the system from suspending. Alan Stern --
If somebody doesn't beat me to it, I'll do that (first implemented
But we already do: either
a) it's in interruptible sleep (I/O on sockets, pipes, etc), and
freezing simply interrupts it, or
b) it's in uninterruptible sleep and suspend will wait it out (or
time out).
In the new scheme we could retain that part of the freezer: interrupt
all tasks which are inside the critical region and wait for them to
exit the critical region.
To put it in another way: it's still the freezer, it does all the same
things as the old freezer, except that the condition for freezing is
not that the task is out of the kernel, rather that it's out of the
disable_supend - enable_suspend region. As such it's not a big change
to the whole suspend system, and so there shouldn't be anything big
going wrong there.
Miklos
--
Okay. Don't forget things like ioctl for sockets -- they often involve doing I/O directly to the network interface device. What happens to a task accessing a non-regular file on a fuse filesystem? :-) Alan Stern --
Yeah, ioctls should probably just always be protected (at least The same as on any other filesystem, i.e. the fs is only involved as far as calling init_special_inode(), the rest is taken care of by the VFS. Tejun Heo recently posted patches to fuse which enable emulating a char device from userspace. That is another matter, obviously we'd want to keep the "allow suspend during I/O" property of fuse in that case, even though there's a char device involved (but no hardware, at least not on that level). Miklos --
I like this idea. I was thinking about using task flags to mark processes as "not freezable at the moment", which would make the freezer to wait for such tasks (currently a task may only be always freezable or not freezable at all, which is not very flexible). Unfortunately, I didn't have the time to implement this. Thanks, Rafael --
Cool! Disadvantage is it will add overhead to regular syscalls, at least initialy. That's why I implemented freezer initialy... Of course, suspend is more important than it was back then, and disadvantages of freezer are now well known, so maybe a little overhead in exchange of cleaner design is worth it...? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
The only interesting cases of this I know of are X (which ought to stop doing so in the near future) and certain sound operations (which are going via a well-defined API anyway and need to handle devices going away at random, so not a problem). There may be some other niche cases, but really - if you're mmapping hardware then it's generally because you haven't written a proper kernel driver. Do that instead. Runtime power management's already going to make you wildly unhappy. -- Matthew Garrett | mjg59@srcf.ucam.org --
Hi again. Okay. Looking back on our conversation brought me back to this message, which I think needs another reply. If we take the strategy of holding new requests and allowing existing ones to complete, then am I right in thinking that we only need to worry about where inode->i_mutex is taken in fs/fuse/file.c (I don't see it taken in other fs/fuse/*.c files). If that's correct, dealing with that issue looks relatively straight forward: we need some more FUSE_MIGHT_FREEZE calls for those functions, and something done to the vfs_check_frozen call - I'm a bit confused by that - inode->i_sb will refer to the fuse filesystem, won't it? Regards, Nigel --
Nope, i_mutex is usually taken by the VFS not the filesystem. That means that the filesystem is called with the mutex already held. Try "grep i_mutex fs/*.c". There's also sb->s_vfs_rename_mutex, for all the gory details see Documentation/filesystems/Locking. So it's not just having to fix fuse, it's having to "fix" the VFS as well. Miklos --
Hi. Remember, though, that we're only freezing fuse at the moment, and strictly one filesystem at a time. We can thus happily wait for the i_mutex taken by some other process to be released. Regards, Nigel --
Not going to work: you need to wait for all requests to be finished, but those might depend on some other fuse filesystem which has already been frozen. Miklos --
Hi. Okay. In that case, am I right in thinking that the request waiting on the frozen filesystem will be stuck in request_wait_answer, and the userspace process that was trying to satisfy the request will be stuck in the FUSE_MIGHT_FREEZE call that was invoked for the frozen filesystem? Regards, Nigel --
No, it already passed that, before the filesystem got frozen. But it doesn't matter, in either case i_mutex will already have been taken by the VFS and it won't be released until the request completely finishes. Miklos --
Sorry, I misunderstood this. Yes you're right, in the case of one fuse filesystem relying on another to complete the request the already frozen one will be stuck in FUSE_MIGHT_FREEZE(). How does that help? Miklos --
Hi. Well, my next question was going to be: can we find a way to know that the userspace process we're waiting on was frozen? If we can know that, then perhaps we can apply that knowledge in this thread to avoid a freezing failure. Regards, Nigel --
Hi Miklos. I think we're making different assumptions. I'm assuming that one of those solutions we already discussed is implemented, such that we don't start freezing a new filesystem until all the requests for the current filesystem are dealt with. Perhaps I should come up with a new version of the patch that implements that. Nigel --
Right, but it could also be sleeping in any other wait_event(), Please post the full patch, it'd be a lot better to see the big picture instead of just a small part of it. Thanks, Miklos --
