Prev: linux-next: manual merge of the fsnotify tree with Linus' tree
Next: genirq: warn about IRQF_SHARED|IRQF_DISABLED at the right place
From: KAMEZAWA Hiroyuki on 14 Mar 2010 22:50 On Mon, 15 Mar 2010 00:26:37 +0100 Andrea Righi <arighi(a)develer.com> wrote: > Control the maximum amount of dirty pages a cgroup can have at any given time. > > Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) > page cache used by any cgroup. So, in case of multiple cgroup writers, they > will not be able to consume more than their designated share of dirty pages and > will be forced to perform write-out if they cross that limit. > > The overall design is the following: > > - account dirty pages per cgroup > - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes > and memory.dirty_background_ratio / memory.dirty_background_bytes in > cgroupfs > - start to write-out (background or actively) when the cgroup limits are > exceeded > > This feature is supposed to be strictly connected to any underlying IO > controller implementation, so we can stop increasing dirty pages in VM layer > and enforce a write-out before any cgroup will consume the global amount of > dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and > /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits. > > Changelog (v6 -> v7) > ~~~~~~~~~~~~~~~~~~~~~~ > * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup() > is never called under tree_lock (no strict accounting, but better overall > performance) > * do not account file cache statistics for the root cgroup (zero > overhead for the root cgroup) > * fix: evaluate cgroup free pages as at the minimum free pages of all > its parents > > Results > ~~~~~~~ > The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @ > 1.2GHz: > > <before> > - root cgroup: 11m51.983s > - child cgroup: 11m56.596s > > <after> > - root cgroup: 11m51.742s > - child cgroup: 12m5.016s > > In the previous version of this patchset, using the "complex" locking scheme > with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the > child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled. > > With this version there's no overhead for the root cgroup (the small difference > is in error range). I expected to see less overhead for the child cgroup, I'll > do more testing and try to figure better what's happening. > Okay, thanks. This seems good result. Optimization for children can be done under -mm tree, I think. (If no nack, this seems ready for test in -mm.) > In the while, it would be great if someone could perform some tests on a larger > system... unfortunately at the moment I don't have a big system available for > this kind of tests... > I hope, too. Thanks, -Kame -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Andrea Righi on 15 Mar 2010 06:10 On Mon, Mar 15, 2010 at 11:36:12AM +0900, KAMEZAWA Hiroyuki wrote: > On Mon, 15 Mar 2010 00:26:37 +0100 > Andrea Righi <arighi(a)develer.com> wrote: > > > Control the maximum amount of dirty pages a cgroup can have at any given time. > > > > Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) > > page cache used by any cgroup. So, in case of multiple cgroup writers, they > > will not be able to consume more than their designated share of dirty pages and > > will be forced to perform write-out if they cross that limit. > > > > The overall design is the following: > > > > - account dirty pages per cgroup > > - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes > > and memory.dirty_background_ratio / memory.dirty_background_bytes in > > cgroupfs > > - start to write-out (background or actively) when the cgroup limits are > > exceeded > > > > This feature is supposed to be strictly connected to any underlying IO > > controller implementation, so we can stop increasing dirty pages in VM layer > > and enforce a write-out before any cgroup will consume the global amount of > > dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and > > /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits. > > > > Changelog (v6 -> v7) > > ~~~~~~~~~~~~~~~~~~~~~~ > > * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup() > > is never called under tree_lock (no strict accounting, but better overall > > performance) > > * do not account file cache statistics for the root cgroup (zero > > overhead for the root cgroup) > > * fix: evaluate cgroup free pages as at the minimum free pages of all > > its parents > > > > Results > > ~~~~~~~ > > The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @ > > 1.2GHz: > > > > <before> > > - root cgroup: 11m51.983s > > - child cgroup: 11m56.596s > > > > <after> > > - root cgroup: 11m51.742s > > - child cgroup: 12m5.016s > > > > In the previous version of this patchset, using the "complex" locking scheme > > with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the > > child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled. > > > > With this version there's no overhead for the root cgroup (the small difference > > is in error range). I expected to see less overhead for the child cgroup, I'll > > do more testing and try to figure better what's happening. > > > Okay, thanks. This seems good result. Optimization for children can be done under > -mm tree, I think. (If no nack, this seems ready for test in -mm.) OK, I'll wait a bit to see if someone has other fixes or issues and post a new version soon including these small changes. Thanks, -Andrea -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on 15 Mar 2010 13:20 On Mon, Mar 15, 2010 at 12:26:37AM +0100, Andrea Righi wrote: > Control the maximum amount of dirty pages a cgroup can have at any given time. > > Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) > page cache used by any cgroup. So, in case of multiple cgroup writers, they > will not be able to consume more than their designated share of dirty pages and > will be forced to perform write-out if they cross that limit. > For me even with this version I see that group with 100M limit is getting much more BW. root cgroup ========== #time dd if=/dev/zero of=/root/zerofile bs=4K count=1M 4294967296 bytes (4.3 GB) copied, 55.7979 s, 77.0 MB/s real 0m56.209s test1 cgroup with memory limit of 100M ====================================== # time dd if=/dev/zero of=/root/zerofile1 bs=4K count=1M 4294967296 bytes (4.3 GB) copied, 20.9252 s, 205 MB/s real 0m21.096s Note, these two jobs are not running in parallel. These are running one after the other. Vivek > The overall design is the following: > > - account dirty pages per cgroup > - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes > and memory.dirty_background_ratio / memory.dirty_background_bytes in > cgroupfs > - start to write-out (background or actively) when the cgroup limits are > exceeded > > This feature is supposed to be strictly connected to any underlying IO > controller implementation, so we can stop increasing dirty pages in VM layer > and enforce a write-out before any cgroup will consume the global amount of > dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and > /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits. > > Changelog (v6 -> v7) > ~~~~~~~~~~~~~~~~~~~~~~ > * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup() > is never called under tree_lock (no strict accounting, but better overall > performance) > * do not account file cache statistics for the root cgroup (zero > overhead for the root cgroup) > * fix: evaluate cgroup free pages as at the minimum free pages of all > its parents > > Results > ~~~~~~~ > The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @ > 1.2GHz: > > <before> > - root cgroup: 11m51.983s > - child cgroup: 11m56.596s > > <after> > - root cgroup: 11m51.742s > - child cgroup: 12m5.016s > > In the previous version of this patchset, using the "complex" locking scheme > with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the > child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled. > > With this version there's no overhead for the root cgroup (the small difference > is in error range). I expected to see less overhead for the child cgroup, I'll > do more testing and try to figure better what's happening. > > In the while, it would be great if someone could perform some tests on a larger > system... unfortunately at the moment I don't have a big system available for > this kind of tests... > > Thanks, > -Andrea > > Documentation/cgroups/memory.txt | 36 +++ > fs/nfs/write.c | 4 + > include/linux/memcontrol.h | 87 ++++++- > include/linux/page_cgroup.h | 35 +++ > include/linux/writeback.h | 2 - > mm/filemap.c | 1 + > mm/memcontrol.c | 542 +++++++++++++++++++++++++++++++++++--- > mm/page-writeback.c | 215 ++++++++++------ > mm/rmap.c | 4 +- > mm/truncate.c | 1 + > 10 files changed, 806 insertions(+), 121 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on 15 Mar 2010 13:30 On Mon, Mar 15, 2010 at 01:12:09PM -0400, Vivek Goyal wrote: > On Mon, Mar 15, 2010 at 12:26:37AM +0100, Andrea Righi wrote: > > Control the maximum amount of dirty pages a cgroup can have at any given time. > > > > Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) > > page cache used by any cgroup. So, in case of multiple cgroup writers, they > > will not be able to consume more than their designated share of dirty pages and > > will be forced to perform write-out if they cross that limit. > > > > For me even with this version I see that group with 100M limit is getting > much more BW. > > root cgroup > ========== > #time dd if=/dev/zero of=/root/zerofile bs=4K count=1M > 4294967296 bytes (4.3 GB) copied, 55.7979 s, 77.0 MB/s > > real 0m56.209s > > test1 cgroup with memory limit of 100M > ====================================== > # time dd if=/dev/zero of=/root/zerofile1 bs=4K count=1M > 4294967296 bytes (4.3 GB) copied, 20.9252 s, 205 MB/s > > real 0m21.096s > > Note, these two jobs are not running in parallel. These are running one > after the other. > Ok, here is the strange part. I am seeing similar behavior even without your patches applied. root cgroup ========== #time dd if=/dev/zero of=/root/zerofile bs=4K count=1M 4294967296 bytes (4.3 GB) copied, 56.098 s, 76.6 MB/s real 0m56.614s test1 cgroup with memory limit 100M =================================== # time dd if=/dev/zero of=/root/zerofile1 bs=4K count=1M 4294967296 bytes (4.3 GB) copied, 19.8097 s, 217 MB/s real 0m19.992s Vivek > > > The overall design is the following: > > > > - account dirty pages per cgroup > > - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes > > and memory.dirty_background_ratio / memory.dirty_background_bytes in > > cgroupfs > > - start to write-out (background or actively) when the cgroup limits are > > exceeded > > > > This feature is supposed to be strictly connected to any underlying IO > > controller implementation, so we can stop increasing dirty pages in VM layer > > and enforce a write-out before any cgroup will consume the global amount of > > dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and > > /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits. > > > > Changelog (v6 -> v7) > > ~~~~~~~~~~~~~~~~~~~~~~ > > * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup() > > is never called under tree_lock (no strict accounting, but better overall > > performance) > > * do not account file cache statistics for the root cgroup (zero > > overhead for the root cgroup) > > * fix: evaluate cgroup free pages as at the minimum free pages of all > > its parents > > > > Results > > ~~~~~~~ > > The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @ > > 1.2GHz: > > > > <before> > > - root cgroup: 11m51.983s > > - child cgroup: 11m56.596s > > > > <after> > > - root cgroup: 11m51.742s > > - child cgroup: 12m5.016s > > > > In the previous version of this patchset, using the "complex" locking scheme > > with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the > > child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled. > > > > With this version there's no overhead for the root cgroup (the small difference > > is in error range). I expected to see less overhead for the child cgroup, I'll > > do more testing and try to figure better what's happening. > > > > In the while, it would be great if someone could perform some tests on a larger > > system... unfortunately at the moment I don't have a big system available for > > this kind of tests... > > > > Thanks, > > -Andrea > > > > Documentation/cgroups/memory.txt | 36 +++ > > fs/nfs/write.c | 4 + > > include/linux/memcontrol.h | 87 ++++++- > > include/linux/page_cgroup.h | 35 +++ > > include/linux/writeback.h | 2 - > > mm/filemap.c | 1 + > > mm/memcontrol.c | 542 +++++++++++++++++++++++++++++++++++--- > > mm/page-writeback.c | 215 ++++++++++------ > > mm/rmap.c | 4 +- > > mm/truncate.c | 1 + > > 10 files changed, 806 insertions(+), 121 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Balbir Singh on 17 Mar 2010 02:50
* Andrea Righi <arighi(a)develer.com> [2010-03-15 00:26:37]: > Control the maximum amount of dirty pages a cgroup can have at any given time. > > Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim) > page cache used by any cgroup. So, in case of multiple cgroup writers, they > will not be able to consume more than their designated share of dirty pages and > will be forced to perform write-out if they cross that limit. > > The overall design is the following: > > - account dirty pages per cgroup > - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes > and memory.dirty_background_ratio / memory.dirty_background_bytes in > cgroupfs > - start to write-out (background or actively) when the cgroup limits are > exceeded > > This feature is supposed to be strictly connected to any underlying IO > controller implementation, so we can stop increasing dirty pages in VM layer > and enforce a write-out before any cgroup will consume the global amount of > dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and > /proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits. > > Changelog (v6 -> v7) > ~~~~~~~~~~~~~~~~~~~~~~ > * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup() > is never called under tree_lock (no strict accounting, but better overall > performance) > * do not account file cache statistics for the root cgroup (zero > overhead for the root cgroup) > * fix: evaluate cgroup free pages as at the minimum free pages of all > its parents > > Results > ~~~~~~~ > The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @ > 1.2GHz: > > <before> > - root cgroup: 11m51.983s > - child cgroup: 11m56.596s > > <after> > - root cgroup: 11m51.742s > - child cgroup: 12m5.016s > > In the previous version of this patchset, using the "complex" locking scheme > with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the > child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled. > > With this version there's no overhead for the root cgroup (the small difference > is in error range). I expected to see less overhead for the child cgroup, I'll > do more testing and try to figure better what's happening. I like that the root overhead is going away. > > In the while, it would be great if someone could perform some tests on a larger > system... unfortunately at the moment I don't have a big system available for > this kind of tests... > I'll test this, I have a small machine to test on at the moment, I'll revert back with data. -- Three Cheers, Balbir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |