Prev: [PATCH 1/2] jz4740-adc/jz4740-hwmon: check kmalloc() result
Next: [patch] tmp_atmel: request correct region
From: Christoph Hellwig on 29 Jul 2010 04:50 Btw, I'm very happy with all this writeback related progress we've made for the 2.6.36 cycle. The only major thing that's really missing, and which should help dramatically with the I/O patters is stopping direct writeback from balance_dirty_pages(). I've seen patches frrom Wu and and Jan for this and lots of discussion. If we get either variant in this should be once of the best VM release from the filesystem point of view. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on 3 Aug 2010 03:40 On Thu, Jul 29, 2010 at 04:45:23PM +0800, Christoph Hellwig wrote: > Btw, I'm very happy with all this writeback related progress we've made > for the 2.6.36 cycle. The only major thing that's really missing, and > which should help dramatically with the I/O patters is stopping direct > writeback from balance_dirty_pages(). I've seen patches frrom Wu and > and Jan for this and lots of discussion. If we get either variant in > this should be once of the best VM release from the filesystem point of > view. Sorry for the delay. But I'm not feeling good about the current patches, both mine and Jan's. Accounting overheads/accuracy are the obvious problem. Both patches do not perform well on large NUMA machines and fast storage. They are found hard to improve in previous discussions. We might do dirty throttling based on throughput, ignoring the writeback completions totally. The basic idea is, for current process, we already have a per-bdi-and-task threshold B as the local throttle target. When dirty pages go beyond B*80% for example, we start throttling the task's writeback throughput. The more closer to B, the lower throughput. When reaches B or global threshold, we completely stop it. The hope is, the throughput will be sustained at some balance point. This will need careful calculation to perform stable/robust. In this way, the throttle can be made very smooth. My old experiments show that the current writeback completion based throttling fluctuates a lot for the stall time. In particular it makes bumpy writeback for NFS, so that some times the network pipe is not active at all and performance is impacted noticeably. By the way, we'll harvest a writeback IO controller :) Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Jan Kara on 3 Aug 2010 09:00 On Tue 03-08-10 15:34:49, Wu Fengguang wrote: > On Thu, Jul 29, 2010 at 04:45:23PM +0800, Christoph Hellwig wrote: > > Btw, I'm very happy with all this writeback related progress we've made > > for the 2.6.36 cycle. The only major thing that's really missing, and > > which should help dramatically with the I/O patters is stopping direct > > writeback from balance_dirty_pages(). I've seen patches frrom Wu and > > and Jan for this and lots of discussion. If we get either variant in > > this should be once of the best VM release from the filesystem point of > > view. > > Sorry for the delay. But I'm not feeling good about the current > patches, both mine and Jan's. > > Accounting overheads/accuracy are the obvious problem. Both patches do > not perform well on large NUMA machines and fast storage. They are found > hard to improve in previous discussions. Yes, my patch for balance_dirty_pages() has a problem with percpu counter (im)precision and resorting to pure atomic type could result in bouncing of the cache line among CPUs completing the IO (at least that is the reason why all other BDI stats are per-cpu I believe). We could solve the problem by doing the accounting on page IO submission time (there using the atomic type should be fine as we mostly submit IO from the flusher thread anyway). It's just that doing the accounting on completion time has the nice property that we really hold the throttled thread upto the moment when vm can really reuse the pages. > We might do dirty throttling based on throughput, ignoring the > writeback completions totally. The basic idea is, for current process, > we already have a per-bdi-and-task threshold B as the local throttle Do we? The limit is currently just per-bdi, isn't it? Or do you mean the ratelimiting - i.e. how often do we call balance_dirty_pages()? That is per-cpu if I'm right. > target. When dirty pages go beyond B*80% for example, we start > throttling the task's writeback throughput. The more closer to B, the > lower throughput. When reaches B or global threshold, we completely > stop it. The hope is, the throughput will be sustained at some balance > point. This will need careful calculation to perform stable/robust. But what do you exactly mean by throttling the task in your scenario? What would it wait on? > In this way, the throttle can be made very smooth. My old experiments > show that the current writeback completion based throttling fluctuates > a lot for the stall time. In particular it makes bumpy writeback for > NFS, so that some times the network pipe is not active at all and > performance is impacted noticeably. > > By the way, we'll harvest a writeback IO controller :) Honza -- Jan Kara <jack(a)suse.cz> SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on 3 Aug 2010 11:10 On Tue, Aug 03, 2010 at 08:52:49PM +0800, Jan Kara wrote: > On Tue 03-08-10 15:34:49, Wu Fengguang wrote: > > On Thu, Jul 29, 2010 at 04:45:23PM +0800, Christoph Hellwig wrote: > > > Btw, I'm very happy with all this writeback related progress we've made > > > for the 2.6.36 cycle. The only major thing that's really missing, and > > > which should help dramatically with the I/O patters is stopping direct > > > writeback from balance_dirty_pages(). I've seen patches frrom Wu and > > > and Jan for this and lots of discussion. If we get either variant in > > > this should be once of the best VM release from the filesystem point of > > > view. > > > > Sorry for the delay. But I'm not feeling good about the current > > patches, both mine and Jan's. > > > > Accounting overheads/accuracy are the obvious problem. Both patches do > > not perform well on large NUMA machines and fast storage. They are found > > hard to improve in previous discussions. > Yes, my patch for balance_dirty_pages() has a problem with percpu counter > (im)precision and resorting to pure atomic type could result in bouncing > of the cache line among CPUs completing the IO (at least that is the reason > why all other BDI stats are per-cpu I believe). > We could solve the problem by doing the accounting on page IO submission > time (there using the atomic type should be fine as we mostly submit IO > from the flusher thread anyway). It's just that doing the accounting on > completion time has the nice property that we really hold the throttled > thread upto the moment when vm can really reuse the pages. Could try this and check how it works with NFS. The attached patch will also be necessary for the test. It implements a writeback wait queue for NFS, without it all dirty pages may be put to writeback. I suspect the resulting fluctuations will be the same. Because balance_dirty_pages() will wait on some background writeback (as you proposed), which will block on the NFS writeback queue, which in turn wait for the completion of COMMIT RPCs (the current patches directly wait here). On the completion of one COMMIT, lots of pages may be freed in a burst, which makes the whole stack progress very bumpy. > > We might do dirty throttling based on throughput, ignoring the > > writeback completions totally. The basic idea is, for current process, > > we already have a per-bdi-and-task threshold B as the local throttle > Do we? The limit is currently just per-bdi, isn't it? Or do you mean bdi_dirty_limit() calls task_dirty_limit(), so it's also related to the current task. For convenience we called it per-bdi writeback :) > the ratelimiting - i.e. how often do we call balance_dirty_pages()? > That is per-cpu if I'm right. > > target. When dirty pages go beyond B*80% for example, we start > > throttling the task's writeback throughput. The more closer to B, the > > lower throughput. When reaches B or global threshold, we completely > > stop it. The hope is, the throughput will be sustained at some balance > > point. This will need careful calculation to perform stable/robust. > But what do you exactly mean by throttling the task in your scenario? > What would it wait on? It will simply wait for eg. 10ms for every N pages written. The more closer to B, the less N will be. Thanks, Fengguang > > In this way, the throttle can be made very smooth. My old experiments > > show that the current writeback completion based throttling fluctuates > > a lot for the stall time. In particular it makes bumpy writeback for > > NFS, so that some times the network pipe is not active at all and > > performance is impacted noticeably. > > > > By the way, we'll harvest a writeback IO controller :) > > Honza > -- > Jan Kara <jack(a)suse.cz> > SUSE Labs, CR -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Wu Fengguang on 3 Aug 2010 11:10
Sorry, forgot the attachment :) Thanks, Fengguang On Tue, Aug 03, 2010 at 11:04:46PM +0800, Wu Fengguang wrote: > On Tue, Aug 03, 2010 at 08:52:49PM +0800, Jan Kara wrote: > > On Tue 03-08-10 15:34:49, Wu Fengguang wrote: > > > On Thu, Jul 29, 2010 at 04:45:23PM +0800, Christoph Hellwig wrote: > > > > Btw, I'm very happy with all this writeback related progress we've made > > > > for the 2.6.36 cycle. The only major thing that's really missing, and > > > > which should help dramatically with the I/O patters is stopping direct > > > > writeback from balance_dirty_pages(). I've seen patches frrom Wu and > > > > and Jan for this and lots of discussion. If we get either variant in > > > > this should be once of the best VM release from the filesystem point of > > > > view. > > > > > > Sorry for the delay. But I'm not feeling good about the current > > > patches, both mine and Jan's. > > > > > > Accounting overheads/accuracy are the obvious problem. Both patches do > > > not perform well on large NUMA machines and fast storage. They are found > > > hard to improve in previous discussions. > > Yes, my patch for balance_dirty_pages() has a problem with percpu counter > > (im)precision and resorting to pure atomic type could result in bouncing > > of the cache line among CPUs completing the IO (at least that is the reason > > why all other BDI stats are per-cpu I believe). > > We could solve the problem by doing the accounting on page IO submission > > time (there using the atomic type should be fine as we mostly submit IO > > from the flusher thread anyway). It's just that doing the accounting on > > completion time has the nice property that we really hold the throttled > > thread upto the moment when vm can really reuse the pages. > > Could try this and check how it works with NFS. The attached patch > will also be necessary for the test. It implements a writeback wait > queue for NFS, without it all dirty pages may be put to writeback. > > I suspect the resulting fluctuations will be the same. Because > balance_dirty_pages() will wait on some background writeback (as you > proposed), which will block on the NFS writeback queue, which in turn > wait for the completion of COMMIT RPCs (the current patches directly > wait here). On the completion of one COMMIT, lots of pages may be > freed in a burst, which makes the whole stack progress very bumpy. > > > > We might do dirty throttling based on throughput, ignoring the > > > writeback completions totally. The basic idea is, for current process, > > > we already have a per-bdi-and-task threshold B as the local throttle > > Do we? The limit is currently just per-bdi, isn't it? Or do you mean > > bdi_dirty_limit() calls task_dirty_limit(), so it's also related to > the current task. For convenience we called it per-bdi writeback :) > > > the ratelimiting - i.e. how often do we call balance_dirty_pages()? > > That is per-cpu if I'm right. > > > target. When dirty pages go beyond B*80% for example, we start > > > throttling the task's writeback throughput. The more closer to B, the > > > lower throughput. When reaches B or global threshold, we completely > > > stop it. The hope is, the throughput will be sustained at some balance > > > point. This will need careful calculation to perform stable/robust. > > But what do you exactly mean by throttling the task in your scenario? > > What would it wait on? > > It will simply wait for eg. 10ms for every N pages written. The more > closer to B, the less N will be. > > Thanks, > Fengguang > > > > In this way, the throttle can be made very smooth. My old experiments > > > show that the current writeback completion based throttling fluctuates > > > a lot for the stall time. In particular it makes bumpy writeback for > > > NFS, so that some times the network pipe is not active at all and > > > performance is impacted noticeably. > > > > > > By the way, we'll harvest a writeback IO controller :) > > > > Honza > > -- > > Jan Kara <jack(a)suse.cz> > > SUSE Labs, CR |