In a recent posting to the lkml, Al Viro announced an up-to-date guide for porting filesystems from the 2.4 stable kernel to the 2.5 development kernel. Unpacking the 2.5.5-pre1 source tree or later (found here), it can be found in the directory 'Documentation/filesystems/porting'. He stressed that it was important to keep this information up to date. It also includes notes on changes that break things.
As detailed in his email, there are two changes currently that break existing file systems: umsdos and intermezo. Al says, "The former will be fixed after the next series of file_system_type cleanups. The latter is a victim of current changes in locking scheme. Help from intermezzo folks would be a good idea - preferably in the form that would reduce the dependency on the VFS guts".
What follows is Al's email, and a snapshot of the following pieces of documentatino from 2.5.5-pre1: porting, directory locking, and Locking.
From: Alexander Viro Date: Thu, 14 Feb 2002 23:11:38 -0500 (EST) To: linux-fsdevel Cc: linux-kernel Subject: [ANNOUNCE] new VFS documentation First of all, since 2.5.5-pre1 there is an up-to-date guide for porting filesystems from 2.4 to 2.5.. Location: Documentation/filesystems/porting It WILL be kept up-to-date. IOW, submit an API change that may require filesystem changes without a corresponding patch to that file and I will hunt you down and hurt you. Badly. The same document covers "what do I need to change to keep my out-of-tree filesystem uptodate". So watch for changes there. Normally when API change happens, the person doing it is responsible for updating all in-tree filesystems or, at least, warning people about the breakage. Applying the list of broken filesystems. Right now that list consists of umsdos and intermezzo. The former will be fixed after the next series of file_system_type cleanups. The latter is a victim of current changes in locking scheme. Help from intermezzo folks would be a good idea - preferably in the form that would reduce the dependency on the VFS guts. New locking scheme is described in Documentation/filesystems/directory-locking. In details and with proof of correctness. It doesn't change the exclusion warranties for filesystems, so unless they mess with locking in non-trivial ways (intermezzo was the only in-tree example) they shouldn't need any changes. Some things might become simpler, actually (i.e. in some cases private locking became redundant and can be dropped). Again, see Documentation/filesystems/porting for details of changes. Documentation/filesystems/Locking is slowly getting up-to-date. Descriptions of several superblock methods are still missing and I would really appreciate it if folks who had introduced them would document them.
Documentation/filesystems/porting: Changes since 2.5.0: --- [recommeneded] New helpers: sb_bread(), sb_getblk(), sb_get_hash_table(), set_bh(), sb_set_blocksize() and sb_min_blocksize(). Use them. --- [recommeneded] New methods: ->alloc_inode() and ->destroy_inode(). Remove inode->u.foo_inode_i Declare struct foo_inode_info { /* fs-private stuff */ struct inode vfs_inode; }; static inline struct foo_inode_info *FOO_I(struct inode *inode) { return list_entry(inode, struct foo_inode_info, vfs_inode); } Use FOO_I(inode) instead of &inode->u.foo_inode_i; Add foo_alloc_inode() and foo_destory_inode() - the former should allocate foo_inode_info and return the address of ->vfs_inode, the latter should free FOO_I(inode) (see in-tree filesystems for examples). Make them ->alloc_inode and ->destroy_inode in your super_operations. Keep in mind that now you need explicit initialization of private data - typically in ->read_inode() and after getting an inode from new_inode(). At some point that will become mandatory. --- [mandatory] Change of file_system_type method (->read_super to ->get_sb) ->read_super() is no more. Ditto for DECLARE_FSTYPE and DECLARE_FSTYPE_DEV. Turn your foo_read_super() into a function that would return 0 in case of success and negative number in case of error (-EINVAL unless you have more informative error value to report). Call it foo_fill_super(). Now declare struct super_block foo_get_sb(struct file_system_type *fs_type, int flags, char *dev_name, void *data) { return get_sb_bdev(fs_type, flags, dev_name, data, ext2_fill_super); } (or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of filesystem). Replace DECLARE_FSTYPE... with explicit initializer and have ->get_sb set as foo_get_sb. --- [mandatory] Locking change: ->s_vfs_rename_sem is taken only by cross-directory renames. Most likely there is no need to change anything, but if you relied on global exclusion between renames for some internal purpose - you need to change your internal locking. Otherwise exclusion warranties remain the same (i.e. parents are victim are locked, etc.). --- [informational] Now we have the exclusion between ->lookup() and directory removal (by ->rmdir() and ->rename()). If you used to need that exclusion and do it by internal locking (most of filesystems couldn't care less) - you can relax your locking. --- [mandatory] ->lookup() is called without BKL now. Grab it on the entry, drop upon return - that will guarantee the same locking you used to have. If your ->lookup() or its parts do not need BKL - better yet, now you can shift lock_kernel()/ unlock_kernel() so that they would protect exactly what needs to be protected. --- [mandatory] ->truncate() is called without BKL now (same as above).
Documentation/filesystems/directory-locking: Locking scheme used for directory operations is based on two kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem). For our purposes all operations fall in 5 classes: 1) read access. Locking rules: caller locks directory we are accessing. 2) object creation. Locking rules: same as above. 3) object removal. Locking rules: caller locks parent, finds victim, locks victim and calls the method. 4) rename() that is _not_ cross-directory. Locking rules: caller locks the parent, finds source and target, if target already exists - locks it and then calls the method. 5) cross-directory rename. The trickiest in the whole bunch. Locking rules: * lock the filesystem * lock parents in "ancestors first" order. * find source and target. * if old parent is equal to or is a descendent of target fail with -ENOTEMPTY * if new parent is equal to or is a descendent of source fail with -ELOOP * if target exists - lock it. * call the method. The rules above obviously guarantee that all directories that are going to be read, modified or removed by method will be locked by caller. If no directory is its own ancestor, the scheme above is deadlock-free. Proof: First of all, at any moment we have a partial ordering of the objects - A < B iff A is an ancestor of B. That ordering can change. However, the following is true: (1) if operation different from cross-directory rename holds lock on A and attempts to acquire lock on B, A will remain the parent of B until we acquire the lock on B. (Proof: only cross-directory rename can change the parent of object and it would have to lock the parent). (2) if cross-directory rename holds the lock on filesystem, order will not change until rename acquires all locks. (Proof: other cross-directory renames will be blocked on filesystem lock and we don't start changing the order until we had acquired all locks). Now consider the minimal deadlock. Each process is blocked on attempt to acquire some lock and already holds at least one lock. Let's consider the set of contended locks. First of all, filesystem lock is not contended, since any process blocked on it is not holding any locks. Thus all processes are blocked on ->i_sem. Any contended object is either held by cross-directory rename or has a child that is also contended. Indeed, suppose that it is held by operation other than cross-directory rename. Then the lock this operation is blocked on belongs to child of that object due to (1). It means that one of the operations is cross-directory rename. Otherwise the set of contended objects would be infinite - each of them would have a contended child and we had assumed that no object is its own descendent. Moreover, there is exactly one cross-directory rename (see above). Consider the object blocking the cross-directory rename. One of its descendents is locked by cross-directory rename (otherwise we would again have an infinite set of of contended objects). But that means that means that cross-directory rename is taking locks out of order. Due to (2) the order hadn't changed since we had acquired filesystem lock. But locking rules for cross-directory rename guarantee that we do not try to acquire lock on descendent before the lock on ancestor. Contradiction. I.e. deadlock is impossible. Q.E.D. These operations are guaranteed to avoid loop creation. Indeed, the only operation that could introduce loops is cross-directory rename. Since the only new (parent, child) pair added by rename() is (new parent, source), such loop would have to contain these objects and the rest of it would have to exist before rename(). I.e. at the moment of loop creation rename() responsible for that would be holding filesystem lock and new parent would have to be equal to or a descendent of source. But that means that new parent had been equal to or a descendent of source since the moment when we had acquired filesystem lock and rename() would fail with -ELOOP in that case. While this locking scheme works for arbitrary DAGs, it relies on ability to check that directory is a descendent of another object. Current implementation assumes that directory graph is a tree. This assumption is also preserved by all operations (cross-directory rename on a tree that would not introduce a cycle will leave it a tree and link() fails for directories). Notice that "directory" in the above == "anything that might have children", so if we are going to introduce hybrid objects we will need either to make sure that link(2) doesn't work for them or to make changes in is_subdir() that would make it work even in presense of such beasts.
Documentation/filesystems/Locking: The text below describes the locking rules for VFS-related methods. It is (believed to be) up-to-date. *Please*, if you change anything in prototypes or locking protocols - update this file. And update the relevant instances in the tree, don't leave that to maintainers of filesystems/devices/ etc. At the very least, put the list of dubious cases in the end of this file. Don't turn it into log - maintainers of out-of-the-tree code are supposed to be able to use diff(1). Thing currently missing here: socket operations. Alexey? --------------------------- dentry_operations -------------------------- prototypes: int (*d_revalidate)(struct dentry *, int); int (*d_hash) (struct dentry *, struct qstr *); int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); locking rules: none have BKL dcache_lock may block d_revalidate: no yes d_hash no yes d_compare: yes no d_delete: yes no d_release: no yes d_iput: no yes --------------------------- inode_operations --------------------------- prototypes: int (*create) (struct inode *,struct dentry *,int); struct dentry * (*lookup) (struct inode *,struct dentry *); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct inode *,struct dentry *,const char *); int (*mkdir) (struct inode *,struct dentry *,int); int (*rmdir) (struct inode *,struct dentry *); int (*mknod) (struct inode *,struct dentry *,int,int); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); int (*readlink) (struct dentry *, char *,int); int (*follow_link) (struct dentry *, struct nameidata *); void (*truncate) (struct inode *); int (*permission) (struct inode *, int); int (*revalidate) (struct dentry *); int (*setattr) (struct dentry *, struct iattr *); int (*getattr) (struct dentry *, struct iattr *); locking rules: all may block BKL i_sem(inode) lookup: no yes create: yes yes link: yes yes mknod: yes yes mkdir: yes yes unlink: yes yes (both) rmdir: yes yes (both) (see below) rename: yes yes (all) (see below) readlink: no no follow_link: no no truncate: yes yes (see below) setattr: yes if ATTR_SIZE permssion: yes no getattr: (see below) revalidate: no (see below) setxattr: DOCUMENT_ME getxattr: DOCUMENT_ME removexattr: DOCUMENT_ME Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on victim. cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. ->revalidate(), it may be called both with and without the i_sem on dentry->d_inode. ->truncate() is never called directly - it's a callback, not a method. It's called by vmtruncate() - library function normally used by ->setattr(). Locking information above applies to that call (i.e. is inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been passed). ->getattr() is currently unused. See Documentation/filesystems/directory-locking for more detailed discussion of the locking scheme for directory operations. --------------------------- super_operations --------------------------- prototypes: void (*read_inode) (struct inode *); void (*write_inode) (struct inode *, int); void (*put_inode) (struct inode *); void (*delete_inode) (struct inode *); void (*put_super) (struct super_block *); void (*write_super) (struct super_block *); int (*statfs) (struct super_block *, struct statfs *); int (*remount_fs) (struct super_block *, int *, char *); void (*clear_inode) (struct inode *); void (*umount_begin) (struct super_block *); locking rules: All may block. BKL s_lock mount_sem read_inode: yes (see below) write_inode: no put_inode: no delete_inode: no clear_inode: no put_super: yes yes maybe (see below) write_super: yes yes maybe (see below) statfs: yes no no remount_fs: yes yes maybe (see below) umount_begin: yes no maybe (see below) ->read_inode() is not a method - it's a callback used in iget()/iget4(). rules for mount_sem are not too nice - it is going to die and be replaced by better scheme anyway. --------------------------- file_system_type --------------------------- prototypes: struct super_block *(*read_super) (struct super_block *, void *, int); locking rules: may block BKL ->s_lock mount_sem yes yes yes maybe --------------------------- address_space_operations -------------------------- prototypes: int (*writepage)(struct page *); int (*readpage)(struct file *, struct page *); int (*sync_page)(struct page *); int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); int (*commit_write)(struct file *, struct page *, unsigned, unsigned); int (*bmap)(struct address_space *, long); locking rules: All may block BKL PageLocked(page) writepage: no yes, unlocks readpage: no yes, unlocks sync_page: no maybe prepare_write: no yes commit_write: no yes bmap: yes ->prepare_write(), ->commit_write(), ->sync_page() and ->readpage() may be called from the request handler (/dev/loop). ->readpage() and ->writepage() unlock the page. ->sync_page() locking rules are not well-defined - usually it is called with lock on page, but that is not guaranteed. Considering the currently existing instances of this method ->sync_page() itself doesn't look well-defined... ->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some filesystems and by the swapper. The latter will eventually go away. All instances do not actually need the BKL. Please, keep it that way and don't breed new callers. Note: currently almost all instances of address_space methods are using BKL for internal serialization and that's one of the worst sources of contention. Normally they are calling library functions (in fs/buffer.c) and pass foo_get_block() as a callback (on local block-based filesystems, indeed). BKL is not needed for library stuff and is usually taken by foo_get_block(). It's an overkill, since block bitmaps can be protected by internal fs locking and real critical areas are much smaller than the areas filesystems protect now. --------------------------- file_lock ------------------------------------ prototypes: void (*fl_notify)(struct file_lock *); /* unblock callback */ void (*fl_insert)(struct file_lock *); /* lock insertion callback */ void (*fl_remove)(struct file_lock *); /* lock removal callback */ locking rules: BKL may block fl_notify: yes no fl_insert: yes maybe fl_remove: yes maybe Currently only NLM provides instances of this class. None of the them block. If you have out-of-tree instances - please, show up. Locking in that area will change. --------------------------- buffer_head ----------------------------------- prototypes: void (*b_end_io)(struct buffer_head *bh, int uptodate); locking rules: called from interrupts. In other words, extreme care is needed here. bh is locked, but that's all warranties we have here. Currently only RAID1, highmem and fs/buffer.c are providing these. Block devices call this method upon the IO completion. --------------------------- block_device_operations ----------------------- prototypes: int (*open) (struct inode *, struct file *); int (*release) (struct inode *, struct file *); int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long); int (*check_media_change) (kdev_t); int (*revalidate) (kdev_t); locking rules: BKL bd_sem open: yes yes release: yes yes ioctl: yes no check_media_change: yes no revalidate: yes no The last two are called only from check_disk_change(). Prototypes are very bad - as soon as we'll get disk_struct they will change (and methods will become per-disk instead of per-partition). --------------------------- file_operations ------------------------------- prototypes: loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char *, size_t, loff_t *); ssize_t (*write) (struct file *, const char *, size_t, loff_t *); int (*readdir) (struct file *, void *, filldir_t); unsigned int (*poll) (struct file *, struct poll_table_struct *); int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); int (*fsync) (struct file *, struct dentry *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); }; locking rules: All except ->poll() may block. BKL llseek: yes (see below) read: no write: no readdir: yes (see below) poll: no ioctl: yes (see below) mmap: no open: maybe (see below) flush: yes release: no fsync: yes (see below) fasync: yes (see below) lock: yes readv: no writev: no ->llseek() locking has moved from llseek to the individual llseek implementations. If your fs is not using generic_file_llseek, you need to acquire and release the appropriate locks in your ->llseek(). For many filesystems, it is probably safe to acquire the inode semaphore. Note some filesystems (i.e. remote ones) provide no protection for i_size so you will need to use the BKL. ->open() locking is in-transit: big lock partially moved into the methods. The only exception is ->open() in the instances of file_operations that never end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices (chrdev_open() takes lock before replacing ->f_op and calling the secondary method. As soon as we fix the handling of module reference counters all instances of ->open() will be called without the BKL. Note: ext2_release() was *the* source of contention on fs-intensive loads and dropping BKL on ->release() helps to get rid of that (we still grab BKL for cases when we close a file that had been opened r/w, but that can and should be done using the internal locking with smaller critical areas). Current worst offender is ext2_get_block()... ->fasync() is a mess. This area needs a big cleanup and that will probably affect locking. ->readdir() and ->ioctl() on directories must be changed. Ideally we would move ->readdir() to inode_operations and use a separate method for directory ->ioctl() or kill the latter completely. One of the problems is that for anything that resembles union-mount we won't have a struct file for all components. And there are other reasons why the current interface is a mess... ->read on directories probably must go away - we should just enforce -EISDIR in sys_read() and friends. ->fsync() has i_sem on inode. --------------------------- dquot_operations ------------------------------- prototypes: void (*initialize) (struct inode *, short); void (*drop) (struct inode *); int (*alloc_block) (const struct inode *, unsigned long, char); int (*alloc_inode) (const struct inode *, unsigned long); void (*free_block) (const struct inode *, unsigned long); void (*free_inode) (const struct inode *, unsigned long); int (*transfer) (struct dentry *, struct iattr *); locking rules: BKL initialize: no drop: no alloc_block: yes alloc_inode: yes free_block: yes free_inode: yes transfer: no --------------------------- vm_operations_struct ----------------------------- prototypes: void (*open)(struct vm_area_struct*); void (*close)(struct vm_area_struct*); struct page *(*nopage)(struct vm_area_struct*, unsigned long, int); locking rules: BKL mmap_sem open: no yes close: no yes nopage: no yes ================================================================================ Dubious stuff (if you break something or notice that it is broken and do not fix it yourself - at least put it here) ipc/shm.c::shm_delete() - may need BKL. ->read() and ->write() in many drivers are (probably) missing BKL. drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL.
UVFS
Slightly-Offtopic info:
There's a nifty User-space filesystem implementation - UVFS - at http://www.sciencething.org/geekthings/index.html