From: Jakob Oestergaard on 13 Sep 2006 03:20 On Mon, Sep 11, 2006 at 04:00:32PM -0700, Dan Williams wrote: > Neil, > .... > > Concerning the context switching performance concerns raised at the > previous release, I have observed the following. For the hardware > accelerated case it appears that performance is always better with the > work queue than without since it allows multiple stripes to be operated > on simultaneously. I expect the same for an SMP platform, but so far my > testing has been limited to IOPs. For a single-processor > non-accelerated configuration I have not observed performance > degradation with work queue support enabled, but in the Kconfig option > help text I recommend disabling it (CONFIG_MD_RAID456_WORKQUEUE). Out of curiosity; how does accelerated compare to non-accelerated? -- / jakob - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Dan Williams on 13 Sep 2006 15:20 On 9/13/06, Jakob Oestergaard <jakob(a)unthought.net> wrote: > On Mon, Sep 11, 2006 at 04:00:32PM -0700, Dan Williams wrote: > > Neil, > > > ... > > > > Concerning the context switching performance concerns raised at the > > previous release, I have observed the following. For the hardware > > accelerated case it appears that performance is always better with the > > work queue than without since it allows multiple stripes to be operated > > on simultaneously. I expect the same for an SMP platform, but so far my > > testing has been limited to IOPs. For a single-processor > > non-accelerated configuration I have not observed performance > > degradation with work queue support enabled, but in the Kconfig option > > help text I recommend disabling it (CONFIG_MD_RAID456_WORKQUEUE). > > Out of curiosity; how does accelerated compare to non-accelerated? One quick example: 4-disk SATA array rebuild on iop321 without acceleration - 'top' reports md0_resync and md0_raid5 dueling for the CPU each at ~50% utilization. With acceleration - 'top' reports md0_resync cpu utilization at ~90% with the rest split between md0_raid5 and md0_raid5_ops. The sync speed reported by /proc/mdstat is ~40% higher in the accelerated case. That being said, array resync is a special case, so your mileage may vary with other applications. I will put together some data from bonnie++, iozone, maybe contest, and post it on SourceForge. > / jakob Dan - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Jakob Oestergaard on 14 Sep 2006 03:50 On Wed, Sep 13, 2006 at 12:17:55PM -0700, Dan Williams wrote: .... > >Out of curiosity; how does accelerated compare to non-accelerated? > > One quick example: > 4-disk SATA array rebuild on iop321 without acceleration - 'top' > reports md0_resync and md0_raid5 dueling for the CPU each at ~50% > utilization. > > With acceleration - 'top' reports md0_resync cpu utilization at ~90% > with the rest split between md0_raid5 and md0_raid5_ops. > > The sync speed reported by /proc/mdstat is ~40% higher in the accelerated > case. Ok, nice :) > > That being said, array resync is a special case, so your mileage may > vary with other applications. Every-day usage I/O performance data would be nice indeed :) > I will put together some data from bonnie++, iozone, maybe contest, > and post it on SourceForge. Great! -- / jakob - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Neil Brown on 8 Oct 2006 20:40 On Monday September 11, dan.j.williams(a)intel.com wrote: > Neil, > > The following patches implement hardware accelerated raid5 for the Intel > Xscale? series of I/O Processors. The MD changes allow stripe > operations to run outside the spin lock in a work queue. Hardware > acceleration is achieved by using a dma-engine-aware work queue routine > instead of the default software only routine. Hi Dan, Sorry for the delay in replying. I've looked through these patches at last (mostly the raid-specific bits) and while there is clearly a lot of good stuff here, it does 'feel' right - it just seems too complex. The particular issues that stand out to me are: - 33 new STRIPE_OP_* flags. I'm sure there doesn't need to be that many new flags. - the "raid5 dma client" patch moves far too much internal knowledge about raid5 into drivers/dma. Clearly there are some complex issues being dealt with and some complexity is to be expected, but I feel there must be room for some serious simplification. Let me try to describe how I envisage it might work. As you know, the theory-of-operation of handle_stripe is that it assesses the state of a stripe deciding what actions to perform and then performs them. Synchronous actions (e.g. current parity calcs) are performed 'in-line'. Async actions (reads, writes) and actions that cannot be performed under a spinlock (->b_end_io) are recorded as being needed and then are initiated at the end of handle_stripe outside of the sh->lock. The proposal is to bring the parity and other bulk-memory operations out of the spinlock and make them optionally asynchronous. The set of tasks that might be needed to be performed on a stripe are: Clear a target cache block pre-xor various cache blocks into a target copy data out of bios into cache blocks. (drain) post-xor various cache blocks into a target copy data into bios out of cache blocks (fill) test if a cache block is all zeros start a read on a cache block start a write on a cache block (There is also a memcpy when expanding raid5. I think I would try to simply avoid that copy and move pointers around instead). Some of these steps require sequencing. e.g. clear, pre-xor, copy, post-xor, write for a rwm cycle. We could require handle_stripe to be called again for each step. i.e. first call just clears the target and flags it as clear. Next call initiates the pre-xor and flags that as done. Etc. However I think that would make the non-offloaded case too slow, or at least too clumsy. So instead we set flags to say what needs to be done and have a workqueue system that does it. (so far this is all quite similar to what you have done.) So handle_stripe would set various flag and other things (like identify which block was the 'target' block) and run the following in a workqueue: raid5_do_stuff(struct stripe_head *sh) { raid5_cont_t *conf = sh->raid_conf; if (test_bit(CLEAR_TARGET, &sh->ops.pending)) { struct page = *p->sh->dev[sh->ops.target].page; rv = async_memset(p, 0, 0, PAGE_SIZE, ops_done, sh); if (rv != BUSY) clear_bit(CLEAR_TARGET, &sh->ops.pending); if (rv != COMPLETE) goto out; } while (test_bit(PRE_XOR, &sh->ops.pending)) { struct page *plist[XOR_MAX]; int offset[XOR_MAX]; int pos = 0; int d; for (d = sh->ops.nextdev; d < conf->raid_disks && pos < XOR_MAX ; d++) { if (sh->ops.nextdev == sh->ops.target) continue; if (!test_bit(R5_WantPreXor, &sh->dev[d].flags)) continue; plist[pos] = sh->dev[d].page; offset[pos++] = 0; } if (pos) { struct page *p = sh->dev[sh->ops.target].page; rv = async_xor(p, 0, plist, offset, pos, PAGE_SIZE, ops_done, sh); if (rv != BUSY) sh->ops.nextdev = d; if (rv != COMPLETE) goto out; } else { clear_bit(PRE_XOR, &sh->ops.pending); sh->ops.nextdev = 0; } } while (test_bit(COPY_IN, &sh0>ops.pending)) { ... } .... if (test_bit(START_IO, &sh->ops.pending)) { int d; for (d = 0 ; d < conf->raid_disk ; d++) { /* all that code from the end of handle_stripe */ } release_stripe(conf, sh); return; out: if (rv == BUSY) { /* wait on something and try again ???*/ } return; } ops_done(struct stripe_head *sh) { queue_work(....whatever..); } Things to note: - we keep track of where we are up to in sh->ops. .pending is flags saying what is left to be done .next_dev is the next device to process for operations that work on several devices .next_bio, .next_iov will be needed for copy operations that cross multiple bios and iovecs. - Each sh->dev has R5_Want flags reflecting which multi-device operations are wanted on each device. - async bulk-memory operations take pages, offsets, and lengths, and can return COMPLETE (if the operation was performed synchronously) IN_PROGRESS (if it has been started, or at least queued) or BUSY if it couldn't even be queued. Exactly what to do in that case I'm not sure. Probably we need a waitqueue to wait on. - The interface between the client and the ADMA hardware is a collection of async_ functions. async_memcpy, async_xor, async_memset etc. I gather there needs to be some understanding about whether the pages are already appropriately mapped for DMA or whether a mapping is needed. Maybe an extra flag argument should be passed. I imagine that any piece of ADMA hardware would register with the 'async_*' subsystem, and a call to async_X would be routed as appropriate, or be run in-line. This approach introduces 8 flags for sh->ops.pending and maybe two or three new R5_Want* flags. It also keeps the raid5 knowledge firmly in the raid5 code base. So it seems to keep the complexity under control Would this approach make sense to you? Is there something really important I have missed? (I'll try and be more responsive next time). Thanks, NeilBrown - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Dan Williams on 10 Oct 2006 14:30 On 10/8/06, Neil Brown <neilb(a)suse.de> wrote: > > > On Monday September 11, dan.j.williams(a)intel.com wrote: > > Neil, > > > > The following patches implement hardware accelerated raid5 for the Intel > > Xscale(r) series of I/O Processors. The MD changes allow stripe > > operations to run outside the spin lock in a work queue. Hardware > > acceleration is achieved by using a dma-engine-aware work queue routine > > instead of the default software only routine. > > Hi Dan, > Sorry for the delay in replying. > I've looked through these patches at last (mostly the raid-specific > bits) and while there is clearly a lot of good stuff here, it does > 'feel' right - it just seems too complex. > > The particular issues that stand out to me are: > - 33 new STRIPE_OP_* flags. I'm sure there doesn't need to be that > many new flags. > - the "raid5 dma client" patch moves far too much internal > knowledge about raid5 into drivers/dma. > > Clearly there are some complex issues being dealt with and some > complexity is to be expected, but I feel there must be room for some > serious simplification. A valid criticism. There was definitely a push to just get it functional, so I can now see how the complexity crept into the implementation. The primary cause was the choice to explicitly handle channel switching in raid5-dma. However, relieving "client" code from this responsibility is something I am taking care of in the async api changes. > > Let me try to describe how I envisage it might work. > > As you know, the theory-of-operation of handle_stripe is that it > assesses the state of a stripe deciding what actions to perform and > then performs them. Synchronous actions (e.g. current parity calcs) > are performed 'in-line'. Async actions (reads, writes) and actions > that cannot be performed under a spinlock (->b_end_io) are recorded > as being needed and then are initiated at the end of handle_stripe > outside of the sh->lock. > > The proposal is to bring the parity and other bulk-memory operations > out of the spinlock and make them optionally asynchronous. > > The set of tasks that might be needed to be performed on a stripe > are: > Clear a target cache block > pre-xor various cache blocks into a target > copy data out of bios into cache blocks. (drain) > post-xor various cache blocks into a target > copy data into bios out of cache blocks (fill) > test if a cache block is all zeros > start a read on a cache block > start a write on a cache block > > (There is also a memcpy when expanding raid5. I think I would try to > simply avoid that copy and move pointers around instead). > > Some of these steps require sequencing. e.g. > clear, pre-xor, copy, post-xor, write > for a rwm cycle. > We could require handle_stripe to be called again for each step. > i.e. first call just clears the target and flags it as clear. Next > call initiates the pre-xor and flags that as done. Etc. However I > think that would make the non-offloaded case too slow, or at least > too clumsy. > > So instead we set flags to say what needs to be done and have a > workqueue system that does it. > > (so far this is all quite similar to what you have done.) > > So handle_stripe would set various flag and other things (like > identify which block was the 'target' block) and run the following > in a workqueue: > > raid5_do_stuff(struct stripe_head *sh) > { > raid5_cont_t *conf = sh->raid_conf; > > if (test_bit(CLEAR_TARGET, &sh->ops.pending)) { > struct page = *p->sh->dev[sh->ops.target].page; > rv = async_memset(p, 0, 0, PAGE_SIZE, ops_done, sh); > if (rv != BUSY) > clear_bit(CLEAR_TARGET, &sh->ops.pending); > if (rv != COMPLETE) > goto out; > } > > while (test_bit(PRE_XOR, &sh->ops.pending)) { > struct page *plist[XOR_MAX]; > int offset[XOR_MAX]; > int pos = 0; > int d; > > for (d = sh->ops.nextdev; > d < conf->raid_disks && pos < XOR_MAX ; > d++) { > if (sh->ops.nextdev == sh->ops.target) > continue; > if (!test_bit(R5_WantPreXor, &sh->dev[d].flags)) > continue; > plist[pos] = sh->dev[d].page; > offset[pos++] = 0; > } > if (pos) { > struct page *p = sh->dev[sh->ops.target].page; > rv = async_xor(p, 0, plist, offset, pos, PAGE_SIZE, > ops_done, sh); > if (rv != BUSY) > sh->ops.nextdev = d; > if (rv != COMPLETE) > goto out; > } else { > clear_bit(PRE_XOR, &sh->ops.pending); > sh->ops.nextdev = 0; > } > } > > while (test_bit(COPY_IN, &sh0>ops.pending)) { > ... > } > .... > > if (test_bit(START_IO, &sh->ops.pending)) { > int d; > for (d = 0 ; d < conf->raid_disk ; d++) { > /* all that code from the end of handle_stripe */ > } > > release_stripe(conf, sh); > return; > > out: > if (rv == BUSY) { > /* wait on something and try again ???*/ > } > return; > } > > ops_done(struct stripe_head *sh) > { > queue_work(....whatever..); > } > > > Things to note: > - we keep track of where we are up to in sh->ops. > .pending is flags saying what is left to be done > .next_dev is the next device to process for operations that > work on several devices > .next_bio, .next_iov will be needed for copy operations that > cross multiple bios and iovecs. > > - Each sh->dev has R5_Want flags reflecting which multi-device > operations are wanted on each device. > > - async bulk-memory operations take pages, offse
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 Prev: SATA: Add PCI-ID Next: dmaengine: enable multiple clients and operations |