Prev: undefined reference to `mmc_pm_notify' when selected MMC without PM.
Next: 2.6.35-git12 fails compile (update_wall_time)
From: Christoph Hellwig on 13 Aug 2010 07:50 The patchset looks functionally correct to me, and with a small patch to make use of WRITE_FUA_FLUSH survives xfstests, and instrumenting the underlying qemu shows that we actually get the flush requests where we should. No performance or power fail testing done yet. But I do not like the transition very much. The new WRITE_FUA_FLUSH request is exactly what filesystems expect from a current barrier request, so I'd rather move to that functionality without breaking stuff inbetween. So if it was to me I'd keep patches 1, 2, 4 and 5 from your series, than a main one to relax barrier semantics, then have the renaming patches 7 and 8, and possible keep patch 11 separate from the main implementation change, and if absolutely also a separate one to introduce REQ_FUA and REQ_FLUSH in the bio interface, but keep things working while doing this. Then we can patches do disable the reiserfs barrier "optimization" as the very first one, and DM/MD support which I'm currently working on as the last one and we can start doing the heavy testing. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on 13 Aug 2010 09:20 On Fri, Aug 13, 2010 at 04:55:33PM +0400, Vladislav Bolkhovitin wrote: > I'm not mentioning the obvious that a common functionality (enforcing > requests ordering in this case) should be handled by a common library, > but not internally by a zillion file systems Linux has. I/O ordering is still handled mostly by common code, that is the pagecache and the buffercache, although a few filesystems like XFS and btrfs have their own implementation of the second one. The current ordered semantics of barriers have only successfull implemented by a complete queue drain, and not effectively been used by filesystems. This patchset removes the bogus global ordering enforced by the block layer whenever a filesystems wants to be able to use cache flushes, and because of that allows deeper outstanding queue depth I/O with less latency. Now I know you in particular are a fan of scsi ordered tags. And as I told you before I'm open to review such an implementation if it shows us any advantages. Adding it after this patch is in fact not any more complicated than before, I'd almost be tempted it's easier as you don't have to plug it into the complex state machine we used for barriers, and more importantly we drop the requirement for the barrier sequence to be atomic, which in fact made implementing barriers using tagged queues impossible with the current scsi layer. As far as playing with ordered tags it's just adding a new flag for it on the bio that gets passed down to the driver. For a final version you'd need a queue-level feature if it's supported, but you don't even need that for the initial work. Then you can implement a variant of blk_do_flush that does away with queueing additional requests once finish but queues all two or three at the same time with your new ordered flag set, at which point you are back to the level or ordered tag usage that the old code allows. You're still left with all the hard problems of actually implementing error handling for it and using it higher up in the filesystem and generic page cache code. I'd really love to see your results, up to the point of just trying that once I get a little spare time. But my theory is that it won't help us - the problem with ordered tags is that they enforce global ordering while we currently have local ordering. While it will reduce the latency for the process waiting for an fsync or similar it will affect other I/O going on in the background and reduce the devices ability to reorder that I/O. So for now this patch set is a massive improvement of performance for workloads we care about, while removing the interface we put in place to allow a theoretical optimization that didn't show up for 8 years before, and in fact made the interface just complicated enough to make that optimization so hard. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on 13 Aug 2010 10:40
On Fri, Aug 13, 2010 at 03:48:59PM +0200, Tejun Heo wrote: > There are two reason to avoid changing the meaning of REQ_HARDBARRIER > and just deprecate it. One is to avoid breaking filesystems' > expectations underneath it. Please note that there are out-of-tree > filesystems too. I think it would be too dangerous to relax > REQ_HARDBARRIER. Note that the renaming patch would include a move from REQ_HARDBARRIER to REQ_FLUSH_FUA, so things just using REQ_HARDBARRIER will fail to compile. And while out of tree filesystems do exist they it's their problem to keep up with kernel changes. They decide not to be part of the Linux kernel, so it'll be their job to keep up with it. > Another is that pseudo block layer drivers (loop, virtio_blk, > md/dm...) have assumptions about REQ_HARDBARRIER behavior and things > would be broken in obscure ways between REQ_HARDBARRIER semantics > change and updates to each of those drivers, so I don't really think > changing the semantics while the mechanism is online is a good idea. I don't think doing those changes in a separate commit is a good idea. > > Then we can patches do disable the reiserfs barrier "optimization" as > > the very first one, and DM/MD support which I'm currently working on > > as the last one and we can start doing the heavy testing. > > Oops, I've already converted loop, virtio_blk/lguest and am working on > md/dm right now too. I'm almost done with md and now doing dm. :-) > Maybe we should post them right now so that we don't waste too much > time trying to solve the same problems? Here's the dm patch. It only handles normal bio based dm yet, which I understand and can test. request based dm (multipath) still needs work. Index: linux-2.6/drivers/md/dm-crypt.c =================================================================== --- linux-2.6.orig/drivers/md/dm-crypt.c 2010-08-13 16:11:04.207010218 +0200 +++ linux-2.6/drivers/md/dm-crypt.c 2010-08-13 16:11:10.048003862 +0200 @@ -1249,7 +1249,7 @@ static int crypt_map(struct dm_target *t struct dm_crypt_io *io; struct crypt_config *cc; - if (unlikely(bio_empty_barrier(bio))) { + if (bio_empty_flush(bio)) { cc = ti->private; bio->bi_bdev = cc->dev->bdev; return DM_MAPIO_REMAPPED; Index: linux-2.6/drivers/md/dm-io.c =================================================================== --- linux-2.6.orig/drivers/md/dm-io.c 2010-08-13 16:11:04.213011894 +0200 +++ linux-2.6/drivers/md/dm-io.c 2010-08-13 16:11:10.049003792 +0200 @@ -364,7 +364,7 @@ static void dispatch_io(int rw, unsigned */ for (i = 0; i < num_regions; i++) { *dp = old_pages; - if (where[i].count || (rw & REQ_HARDBARRIER)) + if (where[i].count || (rw & REQ_FLUSH)) do_region(rw, i, where + i, dp, io); } @@ -412,8 +412,8 @@ retry: } set_current_state(TASK_RUNNING); - if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) { - rw &= ~REQ_HARDBARRIER; + if (io->eopnotsupp_bits && (rw & REQ_FLUSH)) { + rw &= ~REQ_FLUSH; goto retry; } Index: linux-2.6/drivers/md/dm-raid1.c =================================================================== --- linux-2.6.orig/drivers/md/dm-raid1.c 2010-08-13 16:11:04.220013431 +0200 +++ linux-2.6/drivers/md/dm-raid1.c 2010-08-13 16:11:10.054018319 +0200 @@ -670,7 +670,7 @@ static void do_writes(struct mirror_set bio_list_init(&requeue); while ((bio = bio_list_pop(writes))) { - if (unlikely(bio_empty_barrier(bio))) { + if (bio_empty_flush(bio)) { bio_list_add(&sync, bio); continue; } @@ -1203,7 +1203,7 @@ static int mirror_end_io(struct dm_targe * We need to dec pending if this was a write. */ if (rw == WRITE) { - if (likely(!bio_empty_barrier(bio))) + if (!bio_empty_flush(bio)) dm_rh_dec(ms->rh, map_context->ll); return error; } Index: linux-2.6/drivers/md/dm-region-hash.c =================================================================== --- linux-2.6.orig/drivers/md/dm-region-hash.c 2010-08-13 16:11:04.228004631 +0200 +++ linux-2.6/drivers/md/dm-region-hash.c 2010-08-13 16:11:10.060003932 +0200 @@ -399,7 +399,7 @@ void dm_rh_mark_nosync(struct dm_region_ region_t region = dm_rh_bio_to_region(rh, bio); int recovering = 0; - if (bio_empty_barrier(bio)) { + if (bio_empty_flush(bio)) { rh->barrier_failure = 1; return; } @@ -524,7 +524,7 @@ void dm_rh_inc_pending(struct dm_region_ struct bio *bio; for (bio = bios->head; bio; bio = bio->bi_next) { - if (bio_empty_barrier(bio)) + if (bio_empty_flush(bio)) continue; rh_inc(rh, dm_rh_bio_to_region(rh, bio)); } Index: linux-2.6/drivers/md/dm-snap.c =================================================================== --- linux-2.6.orig/drivers/md/dm-snap.c 2010-08-13 16:11:04.238004701 +0200 +++ linux-2.6/drivers/md/dm-snap.c 2010-08-13 16:11:10.067005677 +0200 @@ -1581,7 +1581,7 @@ static int snapshot_map(struct dm_target chunk_t chunk; struct dm_snap_pending_exception *pe = NULL; - if (unlikely(bio_empty_barrier(bio))) { + if (bio_empty_flush(bio)) { bio->bi_bdev = s->cow->bdev; return DM_MAPIO_REMAPPED; } @@ -1685,7 +1685,7 @@ static int snapshot_merge_map(struct dm_ int r = DM_MAPIO_REMAPPED; chunk_t chunk; - if (unlikely(bio_empty_barrier(bio))) { + if (bio_empty_flush(bio)) { if (!map_context->flush_request) bio->bi_bdev = s->origin->bdev; else @@ -2123,7 +2123,7 @@ static int origin_map(struct dm_target * struct dm_dev *dev = ti->private; bio->bi_bdev = dev->bdev; - if (unlikely(bio_empty_barrier(bio))) + if (bio_empty_flush(bio)) return DM_MAPIO_REMAPPED; /* Only tell snapshots if this is a write */ Index: linux-2.6/drivers/md/dm-stripe.c =================================================================== --- linux-2.6.orig/drivers/md/dm-stripe.c 2010-08-13 16:11:04.247011266 +0200 +++ linux-2.6/drivers/md/dm-stripe.c 2010-08-13 16:11:10.072026629 +0200 @@ -214,7 +214,7 @@ static int stripe_map(struct dm_target * sector_t offset, chunk; uint32_t stripe; - if (unlikely(bio_empty_barrier(bio))) { + if (bio_empty_flush(bio)) { BUG_ON(map_context->flush_request >= sc->stripes); bio->bi_bdev = sc->stripe[map_context->flush_request].dev->bdev; return DM_MAPIO_REMAPPED; Index: linux-2.6/drivers/md/dm.c =================================================================== --- linux-2.6.orig/drivers/md/dm.c 2010-08-13 16:11:04.256004631 +0200 +++ linux-2.6/drivers/md/dm.c 2010-08-13 16:11:37.152005462 +0200 @@ -139,17 +139,6 @@ struct mapped_device { spinlock_t deferred_lock; /* - * An error from the barrier request currently being processed. - */ - int barrier_error; - - /* - * Protect barrier_error from concurrent endio processing - * in request-based dm. - */ - spinlock_t barrier_error_lock; - - /* * Processing queue (flush/barriers) */ struct workqueue_struct *wq; @@ -194,9 +183,6 @@ struct mapped_device { /* sysfs handle */ struct kobject kobj; - - /* zero-length barrier that will be cloned and submitted to targets */ - struct bio barrier_bio; }; /* @@ -505,10 +491,6 @@ static void end_io_acct(struct dm_io *io part_stat_add(cpu, &dm_disk(md)->part0, ticks[rw], duration); part_stat_unlock(); - /* - * After this is decremented the bio must not be touched if it is - * a barrier. - */ dm_disk(md)->part0.in_flight[rw] = pending = atomic_dec_return(&md->pending[rw]); pending += atomic_read(&md->pending[rw^0x1]); @@ -621,7 +603,7 @@ static void dec_pending(struct dm_io *io */ spin_lock_irqsave(&md->deferred_lock, flags); if (__noflush_suspending(md)) { - if (!(io->bio->bi_rw & REQ_HARDBARRIER)) + if (!(io->bio->bi_rw & (REQ_FLUSH|REQ_FUA))) bio_list_add_head(&md->deferred, io->bio); } else @@ -633,25 +615,13 @@ static void dec_pending(struct dm_io *io io_error = io->error; bio = io->bio; - if (bio->bi_rw & REQ_HARDBARRIER) { - /* - * There can be just one barrier request so we use - * a per-device variable for error reporting. - * Note that you can't touch the bio after end_io_acct - */ - if (!md->barrier_error && io_error != -EOPNOTSUPP) - md->barrier_error = io_error; - end_io_acct(io); - free_io(md, io); - } else { - end_io_acct(io); - free_io(md, io); + end_io_acct(io); + free_io(md, io); - if (io_error != DM_ENDIO_REQUEUE) { - trace_block_bio_complete(md->queue, bio); + if (io_error != DM_ENDIO_REQUEUE) { + trace_block_bio_complete(md->queue, bio); - bio_endio(bio, io_error); - } + bio_endio(bio, io_error); } } } @@ -744,23 +714,6 @@ static void end_clone_bio(struct bio *cl blk_update_request(tio->orig, 0, nr_bytes); } -static void store_barrier_error(struct mapped_device *md, int error) -{ - unsigned long flags; - - spin_lock_irqsave(&md->barrier_error_lock, flags); - /* - * Basically, the first error is taken, but: - * -EOPNOTSUPP supersedes any I/O error. - * Requeue request supersedes any I/O error but -EOPNOTSUPP. - */ - if (!md->barrier_error || error == -EOPNOTSUPP || - (md->barrier_error != -EOPNOTSUPP && - error == DM_ENDIO_REQUEUE)) - md->barrier_error = error; - spin_unlock_irqrestore(&md->barrier_error_lock, flags); -} - /* * Don't touch any member of the md after calling this function because * the md may be freed in dm_put() at the end of this function. @@ -798,13 +751,11 @@ static void free_rq_clone(struct request static void dm_end_request(struct request *clone, int error) { int rw = rq_data_dir(clone); - int run_queue = 1; - bool is_barrier = clone->cmd_flags & REQ_HARDBARRIER; struct dm_rq_target_io *tio = clone->end_io_data; struct mapped_device *md = tio->md; struct request *rq = tio->orig; - if (rq->cmd_type == REQ_TYPE_BLOCK_PC && !is_barrier) { + if (rq->cmd_type == REQ_TYPE_BLOCK_PC) { rq->errors = clone->errors; rq->resid_len = clone->resid_len; @@ -818,15 +769,8 @@ static void dm_end_request(struct reques } free_rq_clone(clone); - - if (unlikely(is_barrier)) { - if (unlikely(error)) - store_barrier_error(md, error); - run_queue = 0; - } else - blk_end_request_all(rq, error); - - rq_completed(md, rw, run_queue); + blk_end_request_all(rq, error); + rq_completed(md, rw, 1); } static void dm_unprep_request(struct request *rq) @@ -1113,7 +1057,7 @@ static struct bio *split_bvec(struct bio clone->bi_sector = sector; clone->bi_bdev = bio->bi_bdev; - clone->bi_rw = bio->bi_rw & ~REQ_HARDBARRIER; + clone->bi_rw = bio->bi_rw; clone->bi_vcnt = 1; clone->bi_size = to_bytes(len); clone->bi_io_vec->bv_offset = offset; @@ -1140,7 +1084,6 @@ static struct bio *clone_bio(struct bio clone = bio_alloc_bioset(GFP_NOIO, bio->bi_max_vecs, bs); __bio_clone(clone, bio); - clone->bi_rw &= ~REQ_HARDBARRIER; clone->bi_destructor = dm_bio_destructor; clone->bi_sector = sector; clone->bi_idx = idx; @@ -1186,7 +1129,7 @@ static void __flush_target(struct clone_ __map_bio(ti, clone, tio); } -static int __clone_and_map_empty_barrier(struct clone_info *ci) +static int __clone_and_map_empty_flush(struct clone_info *ci) { unsigned target_nr = 0, flush_nr; struct dm_target *ti; @@ -1208,8 +1151,8 @@ static int __clone_and_map(struct clone_ sector_t len = 0, max; struct dm_target_io *tio; - if (unlikely(bio_empty_barrier(bio))) - return __clone_and_map_empty_barrier(ci); + if (bio_empty_flush(bio)) + return __clone_and_map_empty_flush(ci); ti = dm_table_find_target(ci->map, ci->sector); if (!dm_target_is_valid(ti)) @@ -1308,11 +1251,7 @@ static void __split_and_process_bio(stru ci.map = dm_get_live_table(md); if (unlikely(!ci.map)) { - if (!(bio->bi_rw & REQ_HARDBARRIER)) - bio_io_error(bio); - else - if (!md->barrier_error) - md->barrier_error = -EIO; + bio_io_error(bio); return; } @@ -1326,7 +1265,7 @@ static void __split_and_process_bio(stru spin_lock_init(&ci.io->endio_lock); ci.sector = bio->bi_sector; ci.sector_count = bio_sectors(bio); - if (unlikely(bio_empty_barrier(bio))) + if (bio_empty_flush(bio)) ci.sector_count = 1; ci.idx = bio->bi_idx; @@ -1420,8 +1359,7 @@ static int _dm_request(struct request_qu * If we're suspended or the thread is processing barriers * we have to queue this io for later. */ - if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags)) || - unlikely(bio->bi_rw & REQ_HARDBARRIER)) { + if (unlikely(test_bit(DMF_QUEUE_IO_TO_THREAD, &md->flags))) { up_read(&md->io_lock); if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags)) && @@ -1873,7 +1811,6 @@ static struct mapped_device *alloc_dev(i init_rwsem(&md->io_lock); mutex_init(&md->suspend_lock); spin_lock_init(&md->deferred_lock); - spin_lock_init(&md->barrier_error_lock); rwlock_init(&md->map_lock); atomic_set(&md->holders, 1); atomic_set(&md->open_count, 0); @@ -2233,38 +2170,6 @@ static int dm_wait_for_completion(struct return r; } -static void dm_flush(struct mapped_device *md) -{ - dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE); - - bio_init(&md->barrier_bio); - md->barrier_bio.bi_bdev = md->bdev; - md->barrier_bio.bi_rw = WRITE_BARRIER; - __split_and_process_bio(md, &md->barrier_bio); - - dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE); -} - -static void process_barrier(struct mapped_device *md, struct bio *bio) -{ - md->barrier_error = 0; - - dm_flush(md); - - if (!bio_empty_barrier(bio)) { - __split_and_process_bio(md, bio); - dm_flush(md); - } - - if (md->barrier_error != DM_ENDIO_REQUEUE) - bio_endio(bio, md->barrier_error); - else { - spin_lock_irq(&md->deferred_lock); - bio_list_add_head(&md->deferred, bio); - spin_unlock_irq(&md->deferred_lock); - } -} - /* * Process the deferred bios */ @@ -2290,12 +2195,8 @@ static void dm_wq_work(struct work_struc if (dm_request_based(md)) generic_make_request(c); - else { - if (c->bi_rw & REQ_HARDBARRIER) - process_barrier(md, c); - else - __split_and_process_bio(md, c); - } + else + __split_and_process_bio(md, c); down_write(&md->io_lock); } @@ -2326,8 +2227,6 @@ static int dm_rq_barrier(struct mapped_d struct dm_target *ti; struct request *clone; - md->barrier_error = 0; - for (i = 0; i < num_targets; i++) { ti = dm_table_get_target(map, i); for (j = 0; j < ti->num_flush_requests; j++) { @@ -2341,7 +2240,7 @@ static int dm_rq_barrier(struct mapped_d dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE); dm_table_put(map); - return md->barrier_error; + return 0; } static void dm_rq_barrier_work(struct work_struct *work) Index: linux-2.6/include/linux/bio.h =================================================================== --- linux-2.6.orig/include/linux/bio.h 2010-08-13 16:11:04.268004351 +0200 +++ linux-2.6/include/linux/bio.h 2010-08-13 16:11:10.082005677 +0200 @@ -66,8 +66,8 @@ #define bio_offset(bio) bio_iovec((bio))->bv_offset #define bio_segments(bio) ((bio)->bi_vcnt - (bio)->bi_idx) #define bio_sectors(bio) ((bio)->bi_size >> 9) -#define bio_empty_barrier(bio) \ - ((bio->bi_rw & REQ_HARDBARRIER) && \ +#define bio_empty_flush(bio) \ + ((bio->bi_rw & REQ_FLUSH) && \ !bio_has_data(bio) && \ !(bio->bi_rw & REQ_DISCARD)) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |