Hello, in following three patches is implemented recursive mtime feature for ext3. The first two patches are mostly clean-up patches, the third patch implements the feature itself. If somebody is interested in testing this (or even writing a support of this feature in rsync and similar), please contact me. Attached are sources of simple tools set_recmod, get_recmod for testing the feature and also a patch implementing basic support of the feature in e2fsprogs. Comments welcome. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR
Hello,
the following patch makes more lightweight handling of
EXT3_I(inode)->i_flags possible.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
---
Implement atomic updates of EXT3_I(inode)->i_flags. So far the i_flags access
was guarded mostly by i_mutex but this is quite heavy-weight. We now use
inode->i_lock to protect i_flags reading and updates in ext3. This patch
introduces a bogus warning that jflag and oldflags may be uninitialized -
anyone knows how to cleanly get rid of it?
Signed-off-by: Jan Kara <jack@suse.cz>
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/dir.c linux-2.6.23-1-i_flags_atomicity/fs/ext3/dir.c
--- linux-2.6.23/fs/ext3/dir.c 2007-10-11 12:01:23.000000000 +0200
+++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/dir.c 2007-11-05 14:04:56.000000000 +0100
@@ -108,10 +108,10 @@ static int ext3_readdir(struct file * fi
sb = inode->i_sb;
#ifdef CONFIG_EXT3_INDEX
- if (EXT3_HAS_COMPAT_FEATURE(inode->i_sb,
- EXT3_FEATURE_COMPAT_DIR_INDEX) &&
- ((EXT3_I(inode)->i_flags & EXT3_INDEX_FL) ||
- ((inode->i_size >> sb->s_blocksize_bits) == 1))) {
+ if (is_dx(inode) ||
+ (EXT3_HAS_COMPAT_FEATURE(inode->i_sb, \
+ EXT3_FEATURE_COMPAT_DIR_INDEX) &&
+ (inode->i_size >> sb->s_blocksize_bits) == 1)) {
err = ext3_dx_readdir(filp, dirent, filldir);
if (err != ERR_BAD_DX_DIR) {
ret = err;
@@ -121,7 +121,9 @@ static int ext3_readdir(struct file * fi
* We don't set the inode dirty flag since it's not
* critical that it get flushed back to the disk.
*/
+ spin_lock(&inode->i_lock);
EXT3_I(filp->f_path.dentry->d_inode)->i_flags &= ~EXT3_INDEX_FL;
+ spin_unlock(&inode->i_lock);
}
#endif
stored = 0;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/ialloc.c linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c
--- linux-2.6.23/fs/ext3/ialloc.c 2006-11-29 22:57:37.000000000 +0100
+++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c 2007-11-05 14:14:50.000000000 +0100
@@ ...Make space reserved for fragments as unused as they were never implemented.
Remove also related initializations. We later use the space for recursive
mtime.
Signed-off-by: Jan Kara <jack@suse.cz>
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c linux-2.6.23-2-make_flags_unused/fs/ext3/ialloc.c
--- linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c 2007-11-05 14:14:50.000000000 +0100
+++ linux-2.6.23-2-make_flags_unused/fs/ext3/ialloc.c 2007-11-05 14:37:33.000000000 +0100
@@ -576,11 +576,6 @@ got:
/* dirsync only applies to directories */
if (!S_ISDIR(mode))
ei->i_flags &= ~EXT3_DIRSYNC_FL;
-#ifdef EXT3_FRAGMENTS
- ei->i_faddr = 0;
- ei->i_frag_no = 0;
- ei->i_frag_size = 0;
-#endif
ei->i_file_acl = 0;
ei->i_dir_acl = 0;
ei->i_dtime = 0;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c linux-2.6.23-2-make_flags_unused/fs/ext3/inode.c
--- linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c 2007-11-05 14:24:39.000000000 +0100
+++ linux-2.6.23-2-make_flags_unused/fs/ext3/inode.c 2007-11-05 14:38:05.000000000 +0100
@@ -2651,11 +2651,6 @@ void ext3_read_inode(struct inode * inod
}
inode->i_blocks = le32_to_cpu(raw_inode->i_blocks);
ei->i_flags = le32_to_cpu(raw_inode->i_flags);
-#ifdef EXT3_FRAGMENTS
- ei->i_faddr = le32_to_cpu(raw_inode->i_faddr);
- ei->i_frag_no = raw_inode->i_frag;
- ei->i_frag_size = raw_inode->i_fsize;
-#endif
ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl);
if (!S_ISREG(inode->i_mode)) {
ei->i_dir_acl = le32_to_cpu(raw_inode->i_dir_acl);
@@ -2790,11 +2785,6 @@ static int ext3_do_update_inode(handle_t
spin_lock(&inode->i_lock);
raw_inode->i_flags = cpu_to_le32(ei->i_flags);
spin_unlock(&inode->i_lock);
-#ifdef EXT3_FRAGMENTS
- raw_inode->i_faddr = cpu_to_le32(ei->i_faddr);
- raw_inode->i_frag = ei->i_frag_no;
- raw_inode->i_fsize = ei->i_frag_size;
-#endif
raw_inode->i_file_acl = cpu_to_le32(ei->i_file_acl);
if ...Implement recursive mtime (rtime) feature for ext3. The feature works as follows: In each directory we keep a flag EXT3_RTIME_FL (modifiable by a user) whether rtime should be updated. In case a directory or a file in it is modified and when the flag is set, directory's rtime is updated, the flag is cleared, and we move to the parent. If the flag is set there, we clear it, update rtime and continue upwards upto the root of the filesystem. In case a regular file or symlink is modified, we pick arbitrary of its parents (actually the one that happens to be at the head of i_dentry list) and start the rtime update algorith there. As the flag is always cleared after updating rtime and we don't climb up the tree if the flag is cleared, we have constant amortized complexity of rtime updates. That's for theoretical time consumption ;) Practically, there's no measurable performance impact for a test case like: touch every file in a kernel tree where every directory has RTIME flag set. Intended use case is that application which wants to watch any modification in a subtree scans the subtree and sets flags for all inodes there. Next time, it just needs to recurse in directories having rtime newer than the start of the previous scan. There it can handle modifications and set the flag again. It is up to application to watch out for hardlinked files. It can e.g. build their list and check their mtime separately (when a hardlink to a file is created its inode is modified and rtimes properly updated and thus any application has an effective way of finding new hardlinked files). Signed-off-by: Jan Kara <jack@suse.cz> diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/ialloc.c linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/ialloc.c --- linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/ialloc.c 2007-11-05 16:58:10.000000000 +0100 +++ linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/ialloc.c 2007-11-05 16:58:53.000000000 +0100 @@ -569,7 +569,7 @@ got: /* Guard reading of ...
On Tue, 6 Nov 2007 18:19:45 +0100 Ok since mtime (and rtime) are part of the inode and not the dentry... how do you deal with hardlinks? And with cases of files that have been unlinked? (ok the later is a wash obviously other than not crashing) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org -
There is only one possible answer... he only updates the directory path that was used to touch the particular file involved. Thus, the semantics gets grotty not just in the presence of hard links, but also in the presence of bind- and other non-root mounts. -hpa -
Unlinked files are easy - you just don't propagate the rtime anywhere. Update of recursive mtime does not pass filesystem boundaries (i.e. mountpoints) so bind mounts and such are non-issue (hmm, at least that was my original idea but as I'm looking now I don't handle bind mounts properly so that needs to be fixed). With hardlinks, you are right that the behaviour is undeterministic - I tried to argue in the text of the mail that this does not actually matter - there are not many hardlinks on usual system and so the application can check hardlinked files in the old way - i.e. look at mtime. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -
*ewwww* You know, you can do that with aush^H^Hdit right now... -
Oh yes, there is :) But I tried to argue it does not really matter - application would have to handle hardlinks in a special way but I find that Interesting idea, no I have not thought about this. I guess you mean watching all the VFS modification events and then do the checking and propagation from user space... My first feeling is that the performance penalty would be considerably higher (currently I am at 1% performance penalty for quite pessimistic test case) but in case the current patch would be considered unacceptable, I can try how large the penalty would be. Thanks for suggestion. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -
Umm, yuck. What if more than one application wants to use this facility? The application is using a global per-inode flag that is written out to disk. So sweeping the entire subtree and setting this flag will involve a lot of disk i/o; as does setting a mod-time, since it could potentially require a large number of inode updates, and then the application needs to sweep through the subtree and reset the flags (resulting in more disk i/o). The performance would seem to me to be really pessimal. In addition, after you crash, there might not be any application waiting to watch modifications in that subtree, and yet the flags would still be set so the system would still be paying the performance penalties of needing to propagate modtimes until all of the flags disappear --- and for a large subtree, that might not be for a long, long time. So if the goal is some kind of modification notification system that watches a subtree efficiently, avoiding some of the deficiencies of inotify and dnotify, the interface doesn't seem to be the right way to go about things. The fact that only one application at a time can use this interface, even if you ignore the issues of hard links and the performance problems and the lack of cleanup after a reboot, seems in my mind to just be a irreparable fatal flaw to this particular scheme. Regards, - Ted -
That should be fine - let's see: Each application keeps somewhere a time when it started a scan of a subtree (or it can actually remember a time when it set the flag for each directory), during the scan, it sets the flag on each directory. When it wakes up to recheck the subtree it just compares the rtime against the stored time - if rtime is greater, subtree has been modified since the last scan and we recurse in it and when we are finished with it we set the flag. Now notice that we don't care about the flag when we check for changes - we care only for rtime - so if there are several applications interested in the same subtree, the flag just gets set more often and thus the update of rtime happens more often but the same scheme I don't get it here - you need to scan the whole subtree and set the flag only during the initial scan. Later, you need to scan and set the flag only for directories in whose subtree something changed. Similarty rtime needs to be updated for each inode at most once after the scan. Maybe we have different different ideas of use-cases: I consider this useful for larger subtrees which change only seldom (or only their small parts) or you want to check for changes only once per some longer time - so uses like backup with rsync, updatedb, cachefiles for trees with config files (like KDE has) etc. There the penalty for additional IO is during rtime updates is quite negligible - if you have some usecase you'd like to measure, please propose it and I'll measure it. I have tested the following: Create a tree of depth 5 where each directory has 5 subdirectories and the leaf directories have 10 files in it. You set the flag on all directories (umount and mount again) and then touch one file in every directory. With the feature enabled this takes 36.1176s (average from 5 tests) with deviation 0.29509. Without the feature it takes 35.75480 with deviation 0.15433. So the difference in performance is 1% which is just slightly above the error and I'd find this test ...
OK, so in this case you don't need to set rtime on the every single file inode, but only directory inode, right? Because you're only using checking the rtime at the directory level, and not the flag. And it's just as easy for you to check the rtime flag for the file's containing directory (modulo magic vis-a-vis hard links) as the file's inode. I'm just really wishing that rtime and the rtime flag didn't have live on disk, but could rather be in memory. If you only needed to save the directory flags and rtimes, that might actually be doable. Note by the way that since you need to own the file/directory to set flags, this means that only programs that are running as root or running as the uid who owns the entire subtree will be able to use this scheme. One advantage of doing in kernel memory is that you might be able to support watching a tree that is not owned by the OK, so in the worst case every single file in a kernel source tree might change after doing an extreme git checkout. That means around 36k of files get updated. So if you have to set/clear the rtime flag during the checkout process 36k file inodes would have to have their rtime flag cleared, plus 2k worth of directory inodes; but those would probably be folded into other changes made to the inodes anyway. But then when trackerd goes back and scans the subtree, if you are actually setting rtime flags for every single file inode, then that's 38k of indoes that need updating. If you only need to set the rtime flags for directories, that's only 2k worth of extra gratuitous inode updates. - Ted -
Yes, that's actually what I'm doing - sorry if I didn't make it clear I already gave some thought to this but there seemed to be some drawbacks. Query I want to support is: given a directory, tell me which of its subdirectories (arbitrarily deep below) have been modified since time T. That is what you need to support faster rsync, updatedb and similar loads. Also I want to allow a reboot to happen inbetween the modification and a query (handling a crash correctly would be nice too but honestly my current implementation is not completely reliable in this regard either) so some pernament storage is needed in any case. What I can imagine we could do is to report all modifications to userspace - that has a problem that there are *many* possible modifications but we are interested only whether there happened some since time T. We could improve this by an in-memory inode flag "I'm not interested in modifications any further" and reporting the change only if the parent directory does not have this flag set (note that this flag gets lost when we evict the inode from memory). But I would say that in the end all this message passing, climbing the tree from userspace and maintaining data structure in memory and on disk would cost use more than the current implementation... Also it has the disadvantage that we miss the modifications which happen before we start the userspace daemon catching the events. Doing this in kernel memory has a problem how to solve the persistency across reboots (dumping mod's to userspace on request?) and also on my system you'd have roughly a few MB of pinned memory for these purposes... Yes, that is the advantage. On the other hand we could allow setting that particular flag even without being an owner of the inode. In fact, I don't currently see use case where you won't be either root (rsync, updatedb) or an owner of the files (watching config file trees) but I guess Yes, here the impact is hardly measurable as I've written in the previous As I wrote ...
Ah, OK, so the two things that I didn't get from your patch description are: 1) the rtime flag and rtime field are only set on directories 2) the intended use is not trackerd and its ilk, but rsync and updatedb, so it is desirable that scan/queries be persistent across reboots But then the major hole in this scheme is still the issue of hard links. The rsync program is still going to have to scan the entire subtree looking for hard links, since an inode with multiple links into the directory tree can't guarantee that all of its parent directories will have their rtime field updated. A program like updatedb which only cares about filenames will be OK, since that means it really only cares about knowing when directories have changed, and you can't have hard links to directories. The other problem, of course, is that this feature would become ext 2/3/4 specific, and I could see future filesystems possibly wanting this. So this raises the question of whether the interface should be at the VFS layer or not --- and if so, how to handle querying whether a particulra filesystem supports it, and what happens if you have a subtree which is covered by a filesystem that doesn't support rtime? So a program like rsync would need to scan /proc/self/mounts to see whether or not it would be safe to use this feature in the first place. And, of course, rsync would need to know whether it has write access to the tree in order to set flags in the directory, and what to do if some portion of the subtree isn't writeable by rsync. Sometimes people like to use rsync to copy a subtree to which they have read access but not write access. (And here note that it's not enough to have write access, you actually need to *own* all of the directories in the subtree). Yes, it's safe to let any user *set* the rtime flag, but we couldn't let them clear the rtime flag, since then they would be able to hide a file modification from some other (potentially privileged) process. Speaking of ...
Not really - initially rsync can scan a tree for hardlinks and remember where they are. If a hardlink to a file is created, an rtime update is sent up the tree via the path used to create the link. So during next scan, rsync will see the file is modified and finds out that its nlink is > 1 and adds it to the list of hardlinked files. So for things like regular backups hardlinks can be dealt with Yes, being filesystem specific and thus requiring special handling of Yes, the cases where we cannot modify the flag in a tree would have to be handled (similarly as the cases where the filesystem simply does not support the feature). I don't think it wouldn't be too complicated but I have Yes, so in such cases my feature won't be able to help. But I think No, the patch does not allow this. But anyway in case user has enough Hardlinks can be worked-around as I wrote above and there would have to be a fallback in case we cannot set the flag. So I agree the code would be more complicated but I think it could be done in a quite clean way - but of course that has to be proven by a patch which I don't have yet. I have not spoken to rsync maintainers about this - first I want to have at least a preliminary version of a patch for rsync so that we have something to talk about... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR -
