Chris Jepeway recently posted an interesting patch to the NetBSD tech-performance mailing list. His patch provides "disk-level transaction clustering", bunching together a series of contiguous disk read or writes into a single read or write. This logic is added at the disk driver level, after the filesystem has performed its own read-ahead or write-behind logic, providing about a 5% improvement in throughput. Chris says, "None of this is earth-shattering news, of course, but it does demonstrate that clusters can be missed at the FS / VM interface. And that's when only one process uses the disk in question."
I contacted Chris who was kind enough to answer a few questions about his efforts. To learn more, read on...
Jeremy Andrews: Can you explain what disk-level transaction clustering is, and how it works?
Chris Jepeway: What it is:
It gloms together a bunch of contiguous disk writes or disk reads into a single write or read. It would turn, eg, 4 separate reads of disk blocks 17, 18, 19, and 20 into a single 4 block read that starts with block 17. It trades CPU time to build and tear down the cluster for the overhead of bus arbitration for each of the separate transactions.
It's "disk-level" because it's performed in the disk driver, after the filesystem code has already done its own read-ahead or write-behind. It's "transaction clustering" because it replaces N transactions to/from adjacent sectors on the disk with 1 larger transaction spanning the all the adjacent sectors.
How it works:
It uses the VM system to map the pages in the individual buffers into a contiguous range of virtual address space. It then cobbles up a fresh struct buf using the new address and schedules the new buffer instead of the N individual buffers. This, via UVM and the pmap interface.
JA: Have you received much of a response from fellow NetBSD hackers?
Chris Jepeway: Three or four responses, I think, mostly suggestions for improvement and a couple of questions. I haven't heard from anybody who has said "hey, I tried it and here's what happened." It may not apply to -current anymore, since I was working with source synced up just days before the gehenna-devsw merge.
The biggest problem I see with it is that it won't work on systems that have a virtually indexed memory cache that doesn't handle VA aliases. Mucking with the VM on such systems the way it's now done can leave stale entries in the cache, as I (perhaps mis-) understand it. Jason Thorpe pointed this out, and he suggested substituting scatter-gather for VM tricks. However, Manuel Bouyer noted that not all bus architectures implement the scatter/gather API for bus_dma(9). Ah, well, I'll work something out there.
JA: What further plans do you have for this patch?
Chris Jepeway: I hope to keep working on it until it can get into NetBSD proper.
JA: Have you performed any further benchmarks beyond what's described in your email?
Chris Jepeway: None that are worth reporting, really. This is gonna be the first thing I address when I get back to it RSN. My feeling is the NetBSD community is waiting for some numbers before taking this patch as more than a possibly interesting hack. *I*'m kinda waiting to hear about benchmarks folks believe b/4 I invest in the rigor I'd like when doing solid tests.
The stats in the e-mail were more along the lines of "here's this really simple use of the filesystem where clustering found some work to do" and less of "look at the performance gain you get from using this." There was some question as to whether clustering would even be needed, since FFS, for example, does its own batching up of disk requests via UBC (see Chuck Silvers's post when UBC went into 1.5-current). These stats show that there are clusters missed by the filesystem, but they don't show any speed gains. I wouldn't really expect them to, since a kernel compile is so CPU bound that doing some I/O a bit faster would get swamped by the compiler.
For the single-disk case (which is all I can easily test), a benchmark would only find improvement if it issued async reads/writes to a single file, or if it read/wrote to multiple files. I suppose I should just punt and start with the obvious ones like bonnie, dig into the results, interpret them, throw them out there and let people tell me what they think.
The biggest interest has been in using this stuff is with ccd or RAIDframe. I don't have the h/w to test either without busting up my test environment, so, hey, if anybody reading [this] article could run some benchmarks on a multi-disk setup like that, I'll be way happy to put them up on the web page at http://www.blasted-heath.com/nbsd/cluster/.
From: Chris Jepeway To: tech-perform AT netbsd.org Subject: Disk-level Transaction Clustering Date: Sat, 07 Sep 2002 02:42:19 -0400 I've whacked disk-level transaction clustering into the sd and wd drivers of -current from about 3 days ago. This is before the gehenna-devsw merge, so I dunno whether the patches I've put up will apply as of today. I've only tested the sd driver, I haven't yet tried compiling the wd driver with clustering enabled. And only on an FFS partition w/o softdeps enabled. For a simple benchmark, I used ssh/pax to copy a full-ish /usr/src/sys tree (it had the kernels from a release build in it) onto a test machine where sd clustering was enabled. About 99K total xfers were done to disk. Of these, about 1300 were clusters built by the sd driver. These 1300 clusters held about 5100 buffers that would have been individually scheduled if the driver weren't combinging them. So, clustering saved about 3800 xfers, roughly a 4% savings. I then built the GENERIC kernel with clustering disabled. About 12800 xfers were done during the build. Building GENERIC again with clustering turned on did about 12200 xfers, where 1000 buffers or so were combined into 300 clusters. That's about a 5% savings. CPU time and wall time for both compiles were comparable. None of this is earth-shattering news, of course, but it does demonstrate that clusters can be missed at the FS/VM interface. And that's when only one process uses the disk in question. If someone points me at some benchmarks enjoyed by the powers that be, I'll be glad to generate harder numbers. See a msg posted a few days back on what/how I'd test. Further info and code/patches at http://www.blasted-heath.com/nbsd/cluster/ I had to hand edit the patch to remove some lines in sys/conf/files and the like that weren't relevant to clustering, so there's a chance that part of the patch might not apply cleanly. If you try it, let me know how it goes. Chris
From: Jason R Thorpe Subject: Re: Disk-level Transaction Clustering Date: Sat, 7 Sep 2002 09:32:55 -0700 On Sat, Sep 07, 2002 at 02:42:19AM -0400, Chris Jepeway wrote: > Further info and code/patches at > > http://www.blasted-heath.com/nbsd/cluster/ This is pretty cool stuff, but I have some suggestions on how to make it better :-) You really don't want to use a VM map to make the clusters. This can have painful side-effects on some architectures, esp. since you are using kmappings ... this is basically not going to work on any platform which has a virtually-indexed cache. Instead, I suggest using uios to describe the clusters. Make a flag called B_UIO for the buf structure, and when that is set, b_data points to a uio structure. When you build a cluster, allocate a uio and an iovec array (maybe always allocate an iovec array large enough to handle up to some max_cluster requests). ...then modify the SCSI HBA drivers to use bus_dmamap_load_uio instead of bus_dmamap_load when they see B_UIO. Note that there is already some #if 0'd code for this in some HBA drivers (historical reasons). It would also be nice if the building of clusters were hidden inside the BUFQ interface. I suggest adding a new flag when the bufq is allocated, BUFQ_CLUSTER, or something. Now, for devices which aren't using bus_dma, we could just avoid setting BUFQ_CLUSTER in those cases. They won't get the benefit of clustering, but they will also continue to work. -- -- Jason R. Thorpe
From: Chris Jepeway Subject: Re: Disk-level Transaction Clustering Date: Thu, 12 Sep 2002 16:57:14 -0400 > This is pretty cool stuff, but I have some suggestions on how to make > it better :-) Cool. > You really don't want to use a VM map to make the clusters. This can have > painful side-effects on some architectures, esp. since you are using > kmappings ... this is basically not going to work on any platform which has > a virtually-indexed cache. OK. That's b/c these machines cant't handle VA aliases in the cache, so aliases aren't allowed on them? Is there some way to inval the cache, in that case? And are there other reasons why it won't work? I ask to both understand and to try to support clusters on those configs that don't bus_dma. > Instead, I suggest using uios to describe the clusters. Make a flag > called B_UIO for the buf structure, and when that is set, b_data points > to a uio structure. When you build a cluster, allocate a uio and an iovec > array (maybe always allocate an iovec array large enough to handle up to > some max_cluster requests). > > ....then modify the SCSI HBA drivers to use bus_dmamap_load_uio instead > of bus_dmamap_load when they see B_UIO. Note that there is already some > #if 0'd code for this in some HBA drivers (historical reasons). A buddy of mine had pointed out that code and suggested this approach, too. Makes sense to me, so that's what I'll aim for. > It would also be nice if the building of clusters were hidden inside the > BUFQ interface. I suggest adding a new flag when the bufq is allocated, > BUFQ_CLUSTER, or something. I think I like this, too. > Now, for devices which aren't using bus_dma, we could just avoid setting > BUFQ_CLUSTER in those cases. They won't get the benefit of clustering, but > they will also continue to work. Hm. I could have 2 clustering methods, one that uses bus_dma, one that uses VM tricks. Prefer bus_dma over VM, prefer VM over nothing, and force nothing if the VM h/w for the system can't dtrt? If BUFQ_CLUSTER is set on machines that can't support them, it'd just be ignored. Chris
From: Chuck Silvers Subject: Re: Disk-level Transaction Clustering Date: Sat, 7 Sep 2002 12:54:04 -0700 hi, hmm, that's interesting, could you find out what was in the blocks that you were able to cluster? I'd guess it's inode data, but it could be something else. it's kind of disappointing that there was no measurable improvement in performance, though. could you try experimenting with ccd or raidframe and see if it helps noticably in that context? it'll probably help if you use a machine with a slower CPU as well. my point with trying to see a performance improvement is that if we think there should be a performance improvement but there isn't one, then maybe something isn't working correctly. -Chuck