Hello, all. These patches are available in the following git tree but it's on top of the previous blk-map-related-fixes patchset which needs some updating, so this posting is just for review and comments. This patchset needs to spend quite some time in RCs even if it gets acked eventually, so definitely no .30 material. Please also note that I haven't updated the comment about bio chaining violating queue limit, so please ignore that part. git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git blk-map http://git.kernel.org/?p=linux/kernel/git/tj/misc.git;a=shortlog;h=blk-map The goals of this patchset are * Clean up and refactor blk/bio PC map API interface and implementation. * Add blk_rq_map_kern_sg() so that in-kernel user can issue multi-segment PC request. scsi_lib.c currently does this directly by building rq itself. * This is also in the general direction of making block layer interface more strictk. For PC requests, users only get to deal with requests and buffers. This patchset uses sg list to pass multi-segment memory area into blk_rq_map_kern_sg() and also uses sgl to simplify internal implementation. Boaz Harrosh has some objections against both of them. For the former, I can't really think of using anything other than sgl for the API. For the second point, the use of sgl inside bio for map implementation does add overhead for blk_rq_map_user*() calls because it first maps to sgl and then gets interpreted to bvec. The reason that sgl is used for implementation is because it has richer API, mainly sg_mapping_iter. So, it basically comes down to whether reducing the overhead of alloc/copying and freeing sgl one more time for each SG_IO call is worth the added complexity. I don't really think it will show up anywhere but if this is a problem, switching over to bvec for internal implementation wouldn't be too difficult although then copying back and forth would require some hairy code. Also, please note that the new code is ...
Impact: cleanup
bio is about to go through major update. Take the chance and cleanup
bio.h such that
* forward declaration of structs are in one place.
* collect bio_copy/map*() prototypes in one place.
* function prototypes don't have unncessary extern in front of them
and have their parameters named. (dropping extern makes it much
easier to have named parameters)
* dummy integrity APIs are inline functions instead of macros so that
type check still occurs and unused variable warnings aren't
triggered.
* fix return values of dummy bio_integrity_set/get_tag(),
bio_integrity_prep() and bio_integrity_clone().
Signed-off-by: Tejun Heo <tj@kernel.org>
---
include/linux/bio.h | 178 +++++++++++++++++++++++++++++++--------------------
1 files changed, 108 insertions(+), 70 deletions(-)
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 8647dd9..4bf7442 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -354,56 +354,64 @@ struct bio_pair {
atomic_t cnt;
int error;
};
-extern struct bio_pair *bio_split(struct bio *bi, int first_sectors);
-extern void bio_pair_release(struct bio_pair *dbio);
-extern struct bio_set *bioset_create(unsigned int, unsigned int);
-extern void bioset_free(struct bio_set *);
-
-extern struct bio *bio_alloc(gfp_t, int);
-extern struct bio *bio_kmalloc(gfp_t, int);
-extern struct bio *bio_alloc_bioset(gfp_t, int, struct bio_set *);
-extern void bio_put(struct bio *);
-extern void bio_free(struct bio *, struct bio_set *);
-
-extern void bio_endio(struct bio *, int);
struct request_queue;
-extern int bio_phys_segments(struct request_queue *, struct bio *);
-
-extern void __bio_clone(struct bio *, struct bio *);
-extern struct bio *bio_clone(struct bio *, gfp_t);
-
-extern void bio_init(struct bio *);
-
-extern int bio_add_page(struct bio *, struct page *, unsigned int,unsigned int);
-extern int bio_add_pc_page(struct request_queue *, struct bio *, struct page *,
- unsigned int, ...Impact: cleanup
blk-map and bio use sg_iovec for addr-len segments although there
isn't anything sg-specific about the API. This is mostly due to
historical reasons. sg_iovec is by definition identical to iovec.
Use iovec instead. This removes bogus dependency on scsi sg and will
allow use of iovec helpers.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
block/blk-map.c | 5 ++---
block/scsi_ioctl.c | 8 +++-----
fs/bio.c | 23 +++++++++++------------
include/linux/bio.h | 6 +++---
include/linux/blkdev.h | 8 ++++----
5 files changed, 23 insertions(+), 27 deletions(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index 29aa60d..4f0221a 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -5,7 +5,6 @@
#include <linux/module.h>
#include <linux/bio.h>
#include <linux/blkdev.h>
-#include <scsi/sg.h> /* for struct sg_iovec */
#include "blk.h"
@@ -64,7 +63,7 @@ static int __blk_rq_unmap_user(struct bio *bio)
* unmapping.
*/
int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
- struct rq_map_data *map_data, struct sg_iovec *iov,
+ struct rq_map_data *map_data, struct iovec *iov,
int iov_count, unsigned int len, gfp_t gfp_mask)
{
struct bio *bio = ERR_PTR(-EINVAL);
@@ -130,7 +129,7 @@ int blk_rq_map_user(struct request_queue *q, struct request *rq,
struct rq_map_data *map_data, void __user *ubuf,
unsigned long len, gfp_t gfp_mask)
{
- struct sg_iovec iov;
+ struct iovec iov;
iov.iov_base = ubuf;
iov.iov_len = len;
diff --git a/block/scsi_ioctl.c b/block/scsi_ioctl.c
index c8e8868..73cfd91 100644
--- a/block/scsi_ioctl.c
+++ b/block/scsi_ioctl.c
@@ -289,7 +289,7 @@ static int sg_io(struct request_queue *q, struct gendisk *bd_disk,
if (hdr->iovec_count) {
const int size = sizeof(struct sg_iovec) * hdr->iovec_count;
size_t iov_data_len;
- struct sg_iovec *iov;
+ struct iovec *iov;
iov = kmalloc(size, GFP_KERNEL);
if (!iov) {
@@ ...OK, The actual one user in sg.c passes a void*, so no casts are
needed. (I couldn't find where are the type-casts of old users)
Should we make this a part of a bigger cleanup that removes
sg_iovec, from Kernel altogether and only makes a #define for
user-mode?
BTW:
user-mode scsi/sg.h does not come from the Kernels exported
headers. It comes with the gcc distribution.
If we remove it alltogether it will not affect anybody.
If you want I can help with this little chore?
--
Hello, Sure, that would be a nice cleanup. If dropping sg_iovec doesn't affect userland, I think it would better to just drop it. Thanks. -- tejun --
Impact: cleanup
bio confusingly uses @write_to_vm and @reading for data directions,
both of which mean the opposite of the usual block/bio convention of
using READ and WRITE w.r.t. IO devices. The only place where the
inversion is necessary is when caling get_user_pages_fast() in
bio_copy_user_iov() as the gup uses the VM convention of read/write
w.r.t. VM.
This patch converts all bio functions to use READ/WRITE rw parameter
and let the one place where inversion is necessary to rw == READ.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
block/blk-map.c | 10 +++++-----
fs/bio.c | 50 +++++++++++++++++++++++++-------------------------
include/linux/bio.h | 18 +++++++++---------
3 files changed, 39 insertions(+), 39 deletions(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index b0b65ef..29aa60d 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -68,15 +68,15 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
int iov_count, unsigned int len, gfp_t gfp_mask)
{
struct bio *bio = ERR_PTR(-EINVAL);
- int read = rq_data_dir(rq) == READ;
+ int rw = rq_data_dir(rq);
if (!iov || iov_count <= 0)
return -EINVAL;
if (!map_data)
- bio = bio_map_user_iov(q, NULL, iov, iov_count, read, gfp_mask);
+ bio = bio_map_user_iov(q, NULL, iov, iov_count, rw, gfp_mask);
if (bio == ERR_PTR(-EINVAL))
- bio = bio_copy_user_iov(q, map_data, iov, iov_count, read,
+ bio = bio_copy_user_iov(q, map_data, iov, iov_count, rw,
gfp_mask);
if (IS_ERR(bio))
return PTR_ERR(bio);
@@ -177,7 +177,7 @@ EXPORT_SYMBOL(blk_rq_unmap_user);
int blk_rq_map_kern(struct request_queue *q, struct request *rq, void *kbuf,
unsigned int len, gfp_t gfp_mask)
{
- int reading = rq_data_dir(rq) == READ;
+ int rw = rq_data_dir(rq);
int do_copy = 0;
struct bio *bio;
@@ -188,7 +188,7 @@ int blk_rq_map_kern(struct request_queue *q, struct request *rq, void *kbuf,
do_copy = !blk_rq_aligned(q, kbuf, len) || ...Hi one more nit picking just if you are at it. If you want I can do the work and send it to you so you can squash it into this patch. can we pleas have an inline that does that? Like bio_set_dir()? and change all users. You will be surprised how many there are. It gives me an hart attack every time I have to write yet another Boaz --
Things like this are actually what I'm trying to clean up. If you look at the whole series, at the end, there remains only one place which sets the flag in the blk/bio map paths and all users of blk_rq_map_*() interface couldn't care less about the flag because they provide only the necessary information through strictly defined API. Thanks. -- tejun --
Right, I know. and its great. I'm saying, all these other block-layer places and all these places above block like stacking drivers and filesystems That use generic_make_request and friends. If at it these can be I'll do it, no problems. Would you push it with your patchset? Thanks Boaz --
Impact: better atomic mapping handling
sg-miter supported atomic mapping using single flag - SG_MITER_ATOMIC.
It implicly used KM_BIO_SRC_IRQ and required irq to be disabled for
each iteration to protect the kernel mapping. This was too limiting
and didn't allow multiple iterators to be used concurrently (e.g. for
sgl -> sgl copying).
This patch adds a new interface sg_miter_start_atomic() which takes
km_type argument and drops @flags from sg_miter_start() so that
km_type can be specified by the caller for atomic iterators.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Pierre Ossman <drzeus-mmc@drzeus.cx>
---
drivers/mmc/host/sdhci.c | 4 +-
include/linux/scatterlist.h | 11 +++++++--
lib/scatterlist.c | 49 ++++++++++++++++++++++++++++++++----------
3 files changed, 47 insertions(+), 17 deletions(-)
diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
index accb592..559aca7 100644
--- a/drivers/mmc/host/sdhci.c
+++ b/drivers/mmc/host/sdhci.c
@@ -704,8 +704,8 @@ static void sdhci_prepare_data(struct sdhci_host *host, struct mmc_data *data)
}
if (!(host->flags & SDHCI_REQ_USE_DMA)) {
- sg_miter_start(&host->sg_miter,
- data->sg, data->sg_len, SG_MITER_ATOMIC);
+ sg_miter_start_atomic(&host->sg_miter,
+ data->sg, data->sg_len, KM_BIO_SRC_IRQ);
host->blocks = data->blocks;
}
diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h
index e599698..2249caa 100644
--- a/include/linux/scatterlist.h
+++ b/include/linux/scatterlist.h
@@ -6,6 +6,7 @@
#include <linux/mm.h>
#include <linux/string.h>
#include <asm/io.h>
+#include <asm/kmap_types.h>
struct sg_table {
struct scatterlist *sgl; /* the list */
@@ -254,11 +255,15 @@ struct sg_mapping_iter {
struct scatterlist *__sg; /* current entry */
unsigned int __nents; /* nr of remaining entries */
unsigned int __offset; /* offset within sg */
- unsigned int __flags;
+ unsigned int __flags; /* internal flags */
+ enum ...Impact: more modular implementation
Break down bio_copy_user_iov() into the following steps.
1. bci and page allocation
2. copying data if WRITE
3. create bio accordingly
bci is now responsible for managing any copy related resources. Given
source iov, bci_create() allocates bci and fills it with enough pages
to cover the source iov. The allocated pages are described with a
sgl.
Note that new allocator always rounds up rq_map_data->offset to page
boundary to simplify implementation and guarantee enough DMA padding
area at the end. As the only user, scsi sg, always passes in zero
offset, this doesn't cause any actual behavior difference. Also,
nth_page() is used to walk to the next page rather than directly
adding to struct page *.
Copying back and forth is done using bio_memcpy_sgl_uiov() which is
implemented using sg mapping iterator and iov iterator.
The last step is done using bio_create_from_sgl().
This patch by itself adds one more level of indirection via sgl and
more code but components factored out here will be used for future
code refactoring.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
fs/bio.c | 465 +++++++++++++++++++++++++++++++++++++++-----------------------
1 files changed, 293 insertions(+), 172 deletions(-)
diff --git a/fs/bio.c b/fs/bio.c
index 1cd97e3..f13aef0 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -26,6 +26,7 @@
#include <linux/mempool.h>
#include <linux/workqueue.h>
#include <linux/blktrace_api.h>
+#include <linux/scatterlist.h>
#include <linux/pfn.h>
#include <trace/block.h>
@@ -653,102 +654,303 @@ int bio_add_page(struct bio *bio, struct page *page, unsigned int len,
return __bio_add_page(q, bio, page, len, offset, q->max_sectors);
}
-struct bio_copy_info {
- struct bio_vec *iovecs;
- struct iovec *src_iov;
- int src_count;
- int is_our_pages;
-};
+/**
+ * bio_sgl_free_pages - free pages from sgl
+ * @sgl: target sgl
+ * @nents: nents for @sgl
+ *
+ * Free all pages pointed to by @sgl. Note ...Hi dear Tejun I've looked hard and deep into your patchset, and I would like to suggest an improvement. [Option 1] What your code is actually using from sgl-code base is: for_each_sg sg_mapping_iter and it's sg_miter_start, sg_miter_next ... (what else) I would like if you can define above for bvec(s) just the way you like them. Then code works directly on the destination bvect inside the final bio. One less copy no intermediate allocation, and no kmalloc of bigger-then-page buffers. These are all small inlines, duplicating those will not affect Kernel size at all. You are not using the chaining ability of sgl(s) so it can be simplified. You will see that not having the intermediate copy simplifies the code even more. Since no out-side user currently needs sgl(s) no functionality is lost. [Option 2] Keep pointer to sgl and not bvec at bio, again code works on final destination. Later users of block layer that call blk_rq_fill_sgl (blk_rq_map_sg) will just get a copy of the pointer and another allocation and copy is gained. This option will spill outside of the current patches scope. Into bvec hacking code. I do like your long term vision of separating the DMA part from the virtual part of scatterlists. Note how they are actually two disjoint lists altogether. After the dma_map does its thing the dma physical list might be shorter then virtual and sizes might not correspond at all. The dma mapping code regards the dma part as an empty list that gets appended while processing, any segments match is accidental. (That is: inside the scatterlist the virtual address most probably does not match the dma address) So [option 1] matches more closely to that vision. Historically code was doing Many-sources => scatterlist => biovec => scatterlist => dma-scatterlist Only at 2.6.30 we can say that we shorten a step to do: Many-sources => biovec => scatterlist => dma-scatterlist Now you want to return the extra step, I hate it. [Option 2] can make that ...
Hello, Boaz. Yeah, we can switch to dealing with bvecs instead of sgls. For kernel mappings, it wouldn't make any difference. For user single mappings, it'll remove an intermediate step. For user sg mappings, if we make gup use sgl, it won't make much difference. Frankly, for copy paths, I don't think all this matters at all. If each request is small, other overheads will dominate. For large requests, copying to and from sgl isn't something worth worrying about. For mapping paths, if we change gup to use sgl (we really can't make it swallow bvec), I don't think it will make whole lot of difference one way or the other. It might end up using slightly more cachelines, but that will be about it. I do agree using bvec as internal page carrier would be nicer (sans having to implement stuff manually) but I'm not convinced whether It's not my favorite part either but I'm just not convinced it matters Yeah, this is the root cause. We don't have a common API to carry segments so we end up doing impedence matching when crossing boundaries. Drivers expect sgl at the interface. Filesystems obviously expect bio with bvecs. gup expects array of pages (this one I'm not hugely against using bvec inside. I just didn't see much difference and went for something easier, so yeah, it definitely is an option. Then, that's a bug. Yeah, high order allocations fail more easily as time passes. But low order (1 maybe 2) allocations aren't too bad. If it matters we can make bio_kmalloc() use vmalloc() for bvecs if it goes over PAGE_SIZE but again given that nobody reported the spurious GFP_DMA in bounce_gfp which triggers frenzy OOM killing spree for large SG_IOs, I don't think this matters all that much. Thanks. -- tejun --
"gup use sgl" was just a stupid suggestion to try eliminate one more allocation. Realy Realy stupid. for gup there is a much better, safer, way. Just allocate on the stack a constant size say 64 pointers and add an inner-loop on the 64 pages. The extra code loop is much cheaper then an allocation and is one less point of failure. So now that we got gup out of the way, will you go directly to bvecs? It is. Again, I tested that. Each new allocation in the order of the number of pages in a request adds 3-5% on a RAM I/O gup to use sgl, is out of the way, that was stupid idea to try eliminate We can reach a point where there is no intermediate dynamic alloctions and no point of failures, but for the allocation of the final bio+biovec. And that "implement stuff manually" will also enable internal bio code cleanups since now biovec is that much smarter. Much much higher, not just 100% higher. And it will kill the Fixing the bug will not help, the allocation will fail and the IO will No, we are not talking about 1 or 2. Today since Jens put the scatterlist chaining we can have lots and lots of them chained together. At the time Jens said that bio had the chaining option for a long time, only scatterlists No that is a regression, vmaloc is an order of a magnitude slower then just plain BIO chaining This BUG was only for HW, ie ata. But for stuff like iscsi FB sas they did fine. The fact of the matter is that people, me included, run in systems with very large request for a while now, large like 4096 pages which is I love your cleanups, and your courage, which I don't have. I want to help in any way I can. If you need just tell me what, and I'll be glad to try or test anything. This is fun I like to work and think on these things Thanks Boaz --
Hello, Boaz. As said before, I don't think exposing bio at the driver interface is a good idea copy or not. For internal implementation, if there's Sure, if you do things against ram, anything will show up. Do you That is not the only measure. Things are always traded off using multiple evaluation metrics. The same goes with the _NO_ allocation There is no need to obsess about no point of failure. Obsessing about Didn't know nommu would kill high order allocation or were you talking Under enough memory pressure, Non-fs IOs are gonna fail somewhere along the line. If you don't like that, use mempool backed allocation or preallocated data structures. Also, I believe it is generally Alright, let's than chain bios but let's please do it inside bio The bug was for all controllers which couldn't do 64bit DMA on 64bit The patchset was written the way it's written because I couldn't see any pressing performance requirement for the current existing users. Optimizing for SG_IO against memory is simply not a good idea. If you have in-kernel users which often require large transfers, sure, chaining bio would be better. Still, let's do it _inside_ bio. Would your use case be happy with something like blk_rq_map_kern_bvec(rq, bvec) which can append to rq? If we convert internal implementation over to bvec, blk_rq_map_kern_sg() can be sg->bvec converting wrapper Thanks a lot but it's not like the code and Jens are scary monsters I Great, your reviews are comments are very helpful, so please keep them coming. :-) Jens, Tomo, what do you guys think? Thanks. -- tejun --
I want to draw your attention, to something you might be forgetting sorry not to make it clearer before. There is no users for blk_rq_map_kern_sgl(), and no need for any such driver interface. This is because there is a pending patchest done by TOMO, which is in it's 4th incarnation by now, which removes the very last users of scsi_execute_async() and it's use of sgl. together with all that mess at scsi_lib.c. Removing scsi_execute_async() will remove the only potential user of blk_rq_map_kern_sgl(). So it can be killed at birth. The pachset was tested and debugged by all scsi ULD maintainers and is in very stable state. I hope it can make it into 2.6.30 but for sure it will make it into 2.6.31, which is the time frame we are talking about. James any chance this patchset can make 2.6.30? Please note that the important historical users of scsi_execute_async() sg.c and sr.c don't use it already in 2.6.28-29, and the last bugs of that change, where wrinkled out. Reads from some SSDs/FLASH is just as fast as RAM. Battery backup RAM is very common. Pete Wyckoff reached speeds over RDMA which come close I agree completely. "no mess" - first priority, "no point of failure" - far behind second. "Lots of work" - not considered I agree, exactly my point, eliminate any extra allocation is the Yes, I completely agree. This can/should be internal bio business. Doing it this way will also add robustness to other users of bio like filesystems raid-engines and staking drivers. When they come they come in groups. I'm talking about steady stream of these that come for minutes. think any streaming application like backup or plain cp of very large files, video prepossessing, ... See above I answered that. No need for blk_rq_map_kern_bvec() nor blk_rq_map_kern_sg(), currently. OSD absolutely needs the direct use of bios via the fs route, and the other users are going out fast. And yes bio chaining at the bio level is a dream come true for me, and it drops 3 ...
Hello, Boaz. One quick question. Don't the above paths map kernel pages to a request? Thanks. -- tejun --
In that case it would go through blk_rq_map_kern_sg(), so unless you wanna change the blk_rq_map_kern_sg() interface such that it uses bvec, the sgl overhead doesn't matter anyway. Of course, if it has been buidling bios directly, that is a different story. Thanks. -- tejun --
Impact: add multi segment support to kernel bio mapping
Implement bio_{map|copy}_kern_sgl() and implement
bio_{map|copy}_kern() in terms of them. As all map/copy helpers
support sgl, it's quite simple. The sgl versions will be used to
extend blk_rq_map_kern*() interface.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
fs/bio.c | 254 +++++++++++++++++++++++++++++++-------------------
include/linux/bio.h | 5 +
2 files changed, 162 insertions(+), 97 deletions(-)
diff --git a/fs/bio.c b/fs/bio.c
index 04bc5c2..9c921f9 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -26,7 +26,6 @@
#include <linux/mempool.h>
#include <linux/workqueue.h>
#include <linux/blktrace_api.h>
-#include <linux/scatterlist.h>
#include <linux/pfn.h>
#include <trace/block.h>
@@ -899,13 +898,55 @@ static int bio_memcpy_sgl_uiov(struct scatterlist *sgl, int nents,
}
WARN_ON_ONCE(slen); /* iov too short */
}
- WARN_ON_ONCE(iov_iter_count(&ii)); /* bio too short */
+ WARN_ON_ONCE(iov_iter_count(&ii)); /* sgl too short */
done:
sg_miter_stop(&si);
return ret;
}
/**
+ * bio_memcpy_sgl_sgl - copy data betweel two sgls
+ * @dsgl: destination sgl
+ * @dnents: number of entries in @dsgl
+ * @ssgl: source sgl
+ * @snents: number of entries in @ssgl
+ * @d_km_type; km_type to use for mapping @dsgl
+ * @s_km_type; km_type to use for mapping @ssgl
+ *
+ * Copy data from @ssgl to @dsgl. The areas should be of the
+ * same size.
+ */
+static void bio_memcpy_sgl_sgl(struct scatterlist *dsgl, int dnents,
+ struct scatterlist *ssgl, int snents,
+ enum km_type d_km_type, enum km_type s_km_type)
+{
+ struct sg_mapping_iter si, di;
+
+ sg_miter_start_atomic(&di, dsgl, dnents, d_km_type);
+ sg_miter_start_atomic(&si, ssgl, snents, s_km_type);
+
+ while (sg_miter_next(&di)) {
+ void *daddr = di.addr;
+ size_t dlen = di.length;
+
+ while (dlen && sg_miter_next(&si)) {
+ size_t copy = min(dlen, si.length);
+
+ memcpy(daddr, si.addr, ...Impact: cleanup
Collapse double underbar prefixed internal functions which have only
single user into their respective call sites. This prepares for
reimplementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
fs/bio.c | 129 +++++++++++++++++++++++--------------------------------------
1 files changed, 49 insertions(+), 80 deletions(-)
diff --git a/fs/bio.c b/fs/bio.c
index f13aef0..4540afc 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1044,10 +1044,20 @@ struct bio *bio_copy_user(struct request_queue *q, struct rq_map_data *md,
return bio_copy_user_iov(q, md, &iov, 1, rw, gfp);
}
-static struct bio *__bio_map_user_iov(struct request_queue *q,
- struct block_device *bdev,
- struct iovec *iov, int count, int rw,
- gfp_t gfp)
+/**
+ * bio_map_user_iov - map user iovec table into bio
+ * @q: the struct request_queue for the bio
+ * @bdev: destination block device
+ * @iov: the iovec.
+ * @count: number of elements in the iovec
+ * @rw: READ or WRITE
+ * @gfp: memory allocation flags
+ *
+ * Map the user space address into a bio suitable for io to a block
+ * device. Returns an error pointer in case of error.
+ */
+struct bio *bio_map_user_iov(struct request_queue *q, struct block_device *bdev,
+ struct iovec *iov, int count, int rw, gfp_t gfp)
{
int i, j;
size_t tot_len = 0;
@@ -1141,6 +1151,15 @@ static struct bio *__bio_map_user_iov(struct request_queue *q,
bio->bi_bdev = bdev;
bio->bi_flags |= (1 << BIO_USER_MAPPED);
+
+ /*
+ * subtle -- if __bio_map_user() ended up bouncing a bio,
+ * it would normally disappear when its bi_end_io is run.
+ * however, we need it for the unmap, so grab an extra
+ * reference to it
+ */
+ bio_get(bio);
+
return bio;
out_unmap:
@@ -1180,38 +1199,15 @@ struct bio *bio_map_user(struct request_queue *q, struct block_device *bdev,
}
/**
- * bio_map_user_iov - map user iovec table into bio
- * @q: the struct request_queue for the bio
- * @bdev: ...Impact: add new features to internal functions to prepare for future changes
Expand bci such that it supports sgl as source and implement
bio_memcpy_sgl_sgl() which can copy data between two sgls. These will
be used to implement blk_rq_copy/map_kern_iov().
Signed-off-by: Tejun Heo <tj@kernel.org>
---
fs/bio.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++++++++------
1 files changed, 74 insertions(+), 8 deletions(-)
diff --git a/fs/bio.c b/fs/bio.c
index 1ca8b16..04bc5c2 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -758,8 +758,10 @@ err:
struct bio_copy_info {
struct scatterlist *copy_sgl; /* one sg per page */
struct iovec *src_iov; /* source iovec from userland */
+ struct scatterlist *src_sgl; /* source sgl */
int copy_nents; /* #entries in copy_sgl */
int src_count; /* #entries in src_iov */
+ int src_nents; /* #entries in src_sgl */
size_t len; /* total length in bytes */
int is_our_pages; /* do we own copied pages? */
};
@@ -776,6 +778,7 @@ static void bci_destroy(struct bio_copy_info *bci)
bio_sgl_free_pages(bci->copy_sgl, bci->copy_nents);
kfree(bci->copy_sgl);
kfree(bci->src_iov);
+ kfree(bci->src_sgl);
kfree(bci);
}
@@ -783,6 +786,8 @@ static void bci_destroy(struct bio_copy_info *bci)
* bci_create - create a bci
* @src_iov: source iovec
* @src_count: number of entries in @src_iov
+ * @src_sgl: source sgl
+ * @src_nents: number of entries in @src_sgl
* @gfp: gfp for data structure allocations
* @page_gfp: gfp for page allocations
* @md: optional preallocated page pool
@@ -795,28 +800,47 @@ static void bci_destroy(struct bio_copy_info *bci)
* Pointer to the new bci on success, NULL on failure.
*/
static struct bio_copy_info *bci_create(struct iovec *src_iov, int src_count,
- gfp_t gfp, gfp_t page_gfp,
- struct rq_map_data *md)
+ struct scatterlist *src_sgl, int src_nents,
+ gfp_t gfp, gfp_t page_gfp,
+ struct rq_map_data *md)
{
struct bio_copy_info *bci;
+ struct ...Impact: hack removal
SCSI needs to map sgl into rq for kernel PC requests; however, block
API didn't have such feature so it used its own rq mapping function
which hooked into block/bio internals and is generally considered an
ugly hack. The private function may also produce requests which are
bigger than queue per-rq limits.
Block blk_rq_map_kern_sgl(). Kill the private implementation and use
it.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
drivers/scsi/scsi_lib.c | 108 +----------------------------------------------
1 files changed, 1 insertions(+), 107 deletions(-)
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 3196c83..3fa5589 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -296,112 +296,6 @@ static void scsi_end_async(struct request *req, int uptodate)
__blk_put_request(req->q, req);
}
-static int scsi_merge_bio(struct request *rq, struct bio *bio)
-{
- struct request_queue *q = rq->q;
-
- bio->bi_flags &= ~(1 << BIO_SEG_VALID);
- if (rq_data_dir(rq) == WRITE)
- bio->bi_rw |= (1 << BIO_RW);
- blk_queue_bounce(q, &bio);
-
- return blk_rq_append_bio(q, rq, bio);
-}
-
-static void scsi_bi_endio(struct bio *bio, int error)
-{
- bio_put(bio);
-}
-
-/**
- * scsi_req_map_sg - map a scatterlist into a request
- * @rq: request to fill
- * @sgl: scatterlist
- * @nsegs: number of elements
- * @bufflen: len of buffer
- * @gfp: memory allocation flags
- *
- * scsi_req_map_sg maps a scatterlist into a request so that the
- * request can be sent to the block layer. We do not trust the scatterlist
- * sent to use, as some ULDs use that struct to only organize the pages.
- */
-static int scsi_req_map_sg(struct request *rq, struct scatterlist *sgl,
- int nsegs, unsigned bufflen, gfp_t gfp)
-{
- struct request_queue *q = rq->q;
- int nr_pages = (bufflen + sgl[0].offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
- unsigned int data_len = bufflen, len, bytes, off;
- struct scatterlist *sg;
- struct page ...Impact: cleanup blk-map and bio are about to go through major update. Make the following renames to ease future changes. * more concise/wieldy names: s/gfp_mask/gfp/, s/map_data/md/, s/iov_count/count/. Note that count is and will continue to be used only for number of entries in iovec. Similarly, nents will be used for number of entries in sgl. * less confusing names: bio_map_data doesn't have much to do with mapping per-se but is aux info for copying. Also, it's very confusing with rq_map_data which BTW is reserved pool of pages for copying. Rename bio_map_data to bio_copy_info and everything related to bci_*(). This part of API is gonna receive major overhaul. The names and semantics will get clearer with future changes. Signed-off-by: Tejun Heo <tj@kernel.org> --- block/blk-map.c | 37 ++++---- fs/bio.c | 231 +++++++++++++++++++++++------------------------ include/linux/bio.h | 37 ++++---- include/linux/blkdev.h | 4 +- 4 files changed, 150 insertions(+), 159 deletions(-) diff --git a/block/blk-map.c b/block/blk-map.c index 4f0221a..eb206df 100644 --- a/block/blk-map.c +++ b/block/blk-map.c @@ -43,11 +43,11 @@ static int __blk_rq_unmap_user(struct bio *bio) * blk_rq_map_user_iov - map user data to a request, for REQ_TYPE_BLOCK_PC usage * @q: request queue where request should be inserted * @rq: request to map data to - * @map_data: pointer to the rq_map_data holding pages (if necessary) + * @md: pointer to the rq_map_data holding pages (if necessary) * @iov: pointer to the iovec - * @iov_count: number of elements in the iovec + * @count: number of elements in the iovec * @len: I/O byte count - * @gfp_mask: memory allocation flags + * @gfp: memory allocation flags * * Description: * Data will be mapped directly for zero copy I/O, if possible. Otherwise @@ -63,20 +63,19 @@ static int __blk_rq_unmap_user(struct bio *bio) * unmapping. */ int ...
Impact: API cleanup
blk_rq_map_user_iov() took @len parameter which contains duplicate
information as the total length is available as the sum of all iov
segments. This doesn't save anything either as the mapping function
should walk the whole iov on entry to check for alignment anyway.
Remove the superflous parameter.
Removing the superflous parameter removes the pathological corner case
where the caller passes in shorter @len than @iov but @iov mappings
ends up capped due to queue limits and bio->bi_size ends up matching
@len thus resulting in successful map. With the superflous parameter
gone, blk-map/bio can now simply fail partial mappings.
Move partial mapping detection to bio_create_from_sgl() which is
shared by all map/copy paths and remove partial mapping handling from
all other places.
This change removes one of the two users of __blk_rq_unmap_user() and
it gets collapsed into blk_rq_unmap_user().
Signed-off-by: Tejun Heo <tj@kernel.org>
---
block/blk-map.c | 47 +++++++++++++++--------------------------------
block/scsi_ioctl.c | 11 +++--------
drivers/scsi/sg.c | 2 +-
fs/bio.c | 43 +++++++++++++++++--------------------------
include/linux/blkdev.h | 2 +-
5 files changed, 37 insertions(+), 68 deletions(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index dc4097c..f60f439 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -8,20 +8,6 @@
#include "blk.h"
-static int __blk_rq_unmap_user(struct bio *bio)
-{
- int ret = 0;
-
- if (bio) {
- if (bio_flagged(bio, BIO_USER_MAPPED))
- bio_unmap_user(bio);
- else
- ret = bio_uncopy_user(bio);
- }
-
- return ret;
-}
-
/**
* blk_rq_map_user_iov - map user data to a request, for REQ_TYPE_BLOCK_PC usage
* @q: request queue where request should be inserted
@@ -29,7 +15,6 @@ static int __blk_rq_unmap_user(struct bio *bio)
* @md: pointer to the rq_map_data holding pages (if necessary)
* @iov: pointer to the iovec
* ...I did not go to the bottom of this but how does that do not change user-space API. It would now fail in code that would work before. What if I want to try variable size commands, where I try the shorter first, then a longer one (or vis versa). I would setup memory mapping to the biggest command but issue a length that correspond to the encoded command inside the CDB. The bi->bi_size is not only mapping size it is also the command size I want to actually read/write. --
Hello, blk_rq_map_user_iov() never had the shorter data len functionality. It fails requests with -EINVAL if iov_length(iov) != dxfer_len, so a patch in the previous patchset adds the above iov_shorten() call to trim iov if dxfer_len is shorter. In this patch, it gets a bit simpler as the length parameter isn't necessary. Thanks. -- tejun --
Impact: cleanup, removal of unused / undesirable API
With recent changes, the following functions aren't used anymore.
* bio_{map|copy}_{user|kern}()
* blk_rq_append_bio()
The following functions aren't used outside of block layer.
* bio_add_pc_page()
* bio_un{map|copy}_user()
Kill the first group and unexport the second. As bio_add_pc_page() is
used only inside fs/bio.c, it's made static.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
block/blk-map.c | 17 --------
fs/bio.c | 102 +----------------------------------------------
include/linux/bio.h | 12 ------
include/linux/blkdev.h | 6 ---
4 files changed, 3 insertions(+), 134 deletions(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index 0474c09..dc4097c 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -8,23 +8,6 @@
#include "blk.h"
-int blk_rq_append_bio(struct request_queue *q, struct request *rq,
- struct bio *bio)
-{
- if (!rq->bio)
- blk_rq_bio_prep(q, rq, bio);
- else if (!ll_back_merge_fn(q, rq, bio))
- return -EINVAL;
- else {
- rq->biotail->bi_next = bio;
- rq->biotail = bio;
-
- rq->data_len += bio->bi_size;
- }
- return 0;
-}
-EXPORT_SYMBOL(blk_rq_append_bio);
-
static int __blk_rq_unmap_user(struct bio *bio)
{
int ret = 0;
diff --git a/fs/bio.c b/fs/bio.c
index 9c921f9..fe796dc 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -627,8 +627,9 @@ static int __bio_add_page(struct request_queue *q, struct bio *bio, struct page
* smaller than PAGE_SIZE, so it is always possible to add a single
* page to an empty bio. This should only be used by REQ_PC bios.
*/
-int bio_add_pc_page(struct request_queue *q, struct bio *bio, struct page *page,
- unsigned int len, unsigned int offset)
+static int bio_add_pc_page(struct request_queue *q, struct bio *bio,
+ struct page *page, unsigned int len,
+ unsigned int offset)
{
return __bio_add_page(q, bio, page, len, offset, q->max_hw_sectors);
}
@@ ...This function is used by drivers/scsi/osd/osd_initiator.c Currently in mainline. Please use allmodconfig to compile I've posted propositions on how to fix osd_initiator.c which involves making blk_map_kern() append to the request, and a new blk_make_request(bio). Any suggestions are welcome Thanks Boaz --
Hello, The patchse is against Jens's tree which hasn't received the changes Yeap, I'll take a look at that. Thanks. -- tejun --
Impact: new API
Implement blk_rq_map_kern_sgl() using bio_copy_{map|kern}_sgl() and
reimplement blk_rq_map_kern() in terms of it. As the bio helpers
already have all the necessary checks, all blk_rq_map_kern_sgl() has
to do is wrap them and initialize rq accordingly. The implementation
closely resembles blk_rq_msp_user_iov().
This is an exported API and will be used to replace hack in scsi ioctl
implementation.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
block/blk-map.c | 54 ++++++++++++++++++++++++++++++-----------------
include/linux/blkdev.h | 2 +
2 files changed, 36 insertions(+), 20 deletions(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index eb206df..0474c09 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -161,47 +161,61 @@ int blk_rq_unmap_user(struct bio *bio)
EXPORT_SYMBOL(blk_rq_unmap_user);
/**
- * blk_rq_map_kern - map kernel data to a request, for REQ_TYPE_BLOCK_PC usage
+ * blk_rq_map_kern_sg - map kernel data to a request, for REQ_TYPE_BLOCK_PC
* @q: request queue where request should be inserted
* @rq: request to fill
- * @kbuf: the kernel buffer
- * @len: length of user data
+ * @sgl: area to map
+ * @nents: number of elements in @sgl
* @gfp: memory allocation flags
*
* Description:
* Data will be mapped directly if possible. Otherwise a bounce
* buffer is used.
*/
-int blk_rq_map_kern(struct request_queue *q, struct request *rq, void *kbuf,
- unsigned int len, gfp_t gfp)
+int blk_rq_map_kern_sg(struct request_queue *q, struct request *rq,
+ struct scatterlist *sgl, int nents, gfp_t gfp)
{
int rw = rq_data_dir(rq);
- int do_copy = 0;
struct bio *bio;
- if (len > (q->max_hw_sectors << 9))
- return -EINVAL;
- if (!len || !kbuf)
+ if (!sgl || nents <= 0)
return -EINVAL;
- do_copy = !blk_rq_aligned(q, kbuf, len) || object_is_on_stack(kbuf);
- if (do_copy)
- bio = bio_copy_kern(q, kbuf, len, gfp, rw);
- else
- bio = bio_map_kern(q, kbuf, len, ...You might want to call bio_copy_kern_sg from within bio_map_kern_sg and remove yet another export from bio layer It could be nice to call blk_rq_append_bio() here and support multiple calls to this member. This will solve half of my problem with osd_initiator Hmm .. but you wanted to drop multiple bio chaining --
Hello, I don't want to drop multiple bio chaining at all in itself. I just didn't see the current uses as, well, useful. If building a sgl for a request at once is difficult for your purpose, making blk_rq_map_*() functions accumulate bios sounds like a good idea. The primary goal was to remove direct bio visibility/manipulation from low level Yeap, will make it inline. Thanks. -- tejun --
Impact: cleanup
As bio_map_user_iov() is unique in the way it acquires pages to map
from all other three operations (user-copy, kern-map, kern-copy), the
function still needs to get the pages itself but the bio creation can
use bio_create_from_sgl(). Create sgl of mapped pages and use it to
create bio from it. The code will be further simplified with future
changes to bio_create_from_sgl().
Signed-off-by: Tejun Heo <tj@kernel.org>
---
fs/bio.c | 79 ++++++++++++++++++++++++++------------------------------------
1 files changed, 33 insertions(+), 46 deletions(-)
diff --git a/fs/bio.c b/fs/bio.c
index 4540afc..1ca8b16 100644
--- a/fs/bio.c
+++ b/fs/bio.c
@@ -1059,13 +1059,13 @@ struct bio *bio_copy_user(struct request_queue *q, struct rq_map_data *md,
struct bio *bio_map_user_iov(struct request_queue *q, struct block_device *bdev,
struct iovec *iov, int count, int rw, gfp_t gfp)
{
- int i, j;
size_t tot_len = 0;
int nr_pages = 0;
+ int nents = 0;
+ struct scatterlist *sgl;
struct page **pages;
struct bio *bio;
- int cur_page = 0;
- int ret, offset;
+ int i, ret;
for (i = 0; i < count; i++) {
unsigned long uaddr = (unsigned long)iov[i].iov_base;
@@ -1088,70 +1088,57 @@ struct bio *bio_map_user_iov(struct request_queue *q, struct block_device *bdev,
if (!nr_pages || tot_len & q->dma_pad_mask)
return ERR_PTR(-EINVAL);
- bio = bio_kmalloc(gfp, nr_pages);
- if (!bio)
+ sgl = kmalloc(nr_pages * sizeof(sgl[0]), gfp);
+ if (!sgl)
return ERR_PTR(-ENOMEM);
+ sg_init_table(sgl, nr_pages);
ret = -ENOMEM;
- pages = kcalloc(nr_pages, sizeof(struct page *), gfp);
+ pages = kcalloc(nr_pages, sizeof(pages[0]), gfp);
if (!pages)
- goto out;
+ goto err_sgl;
for (i = 0; i < count; i++) {
unsigned long uaddr = (unsigned long)iov[i].iov_base;
unsigned long len = iov[i].iov_len;
- unsigned long end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT;
- unsigned long start = uaddr >> PAGE_SHIFT;
- const int ...Fast look at all users of get_user_pages_fast(). This can be converted to take an sglist instead of pages* array. Outside of this code all users either have a fixed-size allocated array or hard coded one page. This here is the main user. (Out of 5) --
Yeah, that would be nice, although it's understandable that gup taking page array as parameter. Thanks. -- tejun --
Impact: API cleanup
All initialized requests must and are guaranteed to have rq->q set.
Taking both @q and @rq parameters doesn't improve anything. It only
adds pathological corner case where @q != !@rq->q. Kill the
superflous @q parameter from blk_rq_map_{user|kern}*().
Signed-off-by: Tejun Heo <tj@kernel.org>
---
block/blk-map.c | 27 +++++++++++----------------
block/bsg.c | 5 ++---
block/scsi_ioctl.c | 6 +++---
drivers/block/pktcdvd.c | 2 +-
drivers/cdrom/cdrom.c | 2 +-
drivers/scsi/device_handler/scsi_dh_alua.c | 2 +-
drivers/scsi/device_handler/scsi_dh_emc.c | 2 +-
drivers/scsi/device_handler/scsi_dh_rdac.c | 2 +-
drivers/scsi/scsi_lib.c | 7 +++----
drivers/scsi/scsi_tgt_lib.c | 3 +--
drivers/scsi/sg.c | 6 +++---
drivers/scsi/st.c | 3 +--
include/linux/blkdev.h | 15 +++++++--------
13 files changed, 36 insertions(+), 46 deletions(-)
diff --git a/block/blk-map.c b/block/blk-map.c
index f60f439..885d359 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -10,7 +10,6 @@
/**
* blk_rq_map_user_iov - map user data to a request, for REQ_TYPE_BLOCK_PC usage
- * @q: request queue where request should be inserted
* @rq: request to map data to
* @md: pointer to the rq_map_data holding pages (if necessary)
* @iov: pointer to the iovec
@@ -30,10 +29,10 @@
* original bio must be passed back in to blk_rq_unmap_user() for proper
* unmapping.
*/
-int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
- struct rq_map_data *md, struct iovec *iov, int count,
- gfp_t gfp)
+int blk_rq_map_user_iov(struct request *rq, struct rq_map_data *md,
+ struct iovec *iov, int count, gfp_t gfp)
{
+ struct request_queue *q = rq->q;
struct bio *bio = ...