Prev: [PATCH] x86: ioremap: fix wrong physical address handling in PAT code
Next: [Patch] kexec: increase max of kexec segments and use dynamic allocation
From: Christoph Hellwig on 22 Jul 2010 02:00 On Wed, Jul 21, 2010 at 03:06:18PM -0400, Vivek Goyal wrote: > On high end storage (I got on HP EVA storage array with 12 SATA disks in > RAID 5), That's actually quite low end storage for a server these days :) > So this is not the default mode. This new tunable group_idle, allows one to > set slice_idle=0 to disable some of the CFQ features and and use primarily > group service differentation feature. While this is better than before needing a sysfs tweak to get any performance out of any kind of server class hardware still is pretty horrible. And slice_idle=0 is not exactly the most obvious paramter I would look for either. So having some way to automatically disable this mode based on hardware characteristics would be really useful, and if that's not possible at least make sure it's very obviously document and easily found using web searches. Btw, what effect does slice_idle=0 with your patches have to single SATA disk and single SSD setups? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on 22 Jul 2010 10:10 On Thu, Jul 22, 2010 at 01:56:02AM -0400, Christoph Hellwig wrote: > On Wed, Jul 21, 2010 at 03:06:18PM -0400, Vivek Goyal wrote: > > On high end storage (I got on HP EVA storage array with 12 SATA disks in > > RAID 5), > > That's actually quite low end storage for a server these days :) > Yes it is. Just that this is the best I got access to. :-) > > So this is not the default mode. This new tunable group_idle, allows one to > > set slice_idle=0 to disable some of the CFQ features and and use primarily > > group service differentation feature. > > While this is better than before needing a sysfs tweak to get any > performance out of any kind of server class hardware still is pretty > horrible. And slice_idle=0 is not exactly the most obvious paramter > I would look for either. So having some way to automatically disable > this mode based on hardware characteristics would be really useful, An IO scheduler able to change its behavior based on unerlying storage property is the ideal and most convenient thing. For that we will need some kind of auto tuning features in CFQ where we monitor for the ongoing IO (for sequentiality, for block size) and then try to make some predictions about the storage property. Auto tuning is little hard to implement. So I thought that in first step we can make sure things work reasonably well with the help of tunables and then look into auto tuning the stuff. I was actually thinking of writting a user space utility which can do some specific IO patterns to the disk/lun and setup some IO scheduler tunables automatically. > and if that's not possible at least make sure it's very obviously > document and easily found using web searches. Sure. I think I will create a new file Documentation/block/cfq-iosched.txt and document this new mode there. Becuase this mode primarily is useful for group scheduling, I will also add some info in Documentation/cgroups/blkio-controller.txt. > > Btw, what effect does slice_idle=0 with your patches have to single SATA > disk and single SSD setups? I am not expecting any major effect of IOPS mode on a non-group setup on any kind of storage. IOW, currently if one sets slice_idle=0 in CFQ, then we kind of become almost like deadline (with some differences here and there). Notion of ioprio almost disappears except that in some cases you can still see some service differentation among queues of different prio level. With this patchset, one would swtich to IOPS mode with slice_idle=0. We will still show a deadlinish behavior. The only difference will be that there will be no service differentation among ioprio levels. I am not bothering about fixing it currently because in slice_idle=0 mode, notion of ioprio is so weak and unpredictable that I think it is not worth fixing it at this point of time. If somebody is looking for service differentation with slice_idle=0, using cgroups might turn out to be a better bet. In summary, in non cgroup setup, wth slice_idle=0, one should not see significant change with this patchset on any kind of storage. With slice_idle=0, CFQ stops idling and achieves much better throughput and even in IOPS mode it will continue doing that. The difference is primarily visible for cgroup users where we get better accounting done in IOPS mode and are able to provide service differentation among groups in a more predictable manner. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on 22 Jul 2010 10:50 On Thu, Jul 22, 2010 at 03:08:00PM +0800, Gui Jianfeng wrote: > Vivek Goyal wrote: > > Hi, > > > > This is V3 of the group_idle and CFQ IOPS mode implementation patchset. Since V2 > > I have cleaned up the code a bit to clarify the confusion lingering around in > > what cases do we charge time slice and in what cases do we charge number of > > requests. > > > > What's the problem > > ------------------ > > On high end storage (I got on HP EVA storage array with 12 SATA disks in > > RAID 5), CFQ's model of dispatching requests from a single queue at a > > time (sequential readers/write sync writers etc), becomes a bottleneck. > > Often we don't drive enough request queue depth to keep all the disks busy > > and suffer a lot in terms of overall throughput. > > > > All these problems primarily originate from two things. Idling on per > > cfq queue and quantum (dispatching limited number of requests from a > > single queue) and till then not allowing dispatch from other queues. Once > > you set the slice_idle=0 and quantum to higher value, most of the CFQ's > > problem on higher end storage disappear. > > > > This problem also becomes visible in IO controller where one creates > > multiple groups and gets the fairness but overall throughput is less. In > > the following table, I am running increasing number of sequential readers > > (1,2,4,8) in 8 groups of weight 100 to 800. > > > > Kernel=2.6.35-rc5-iops+ > > GROUPMODE=1 NRGRP=8 > > DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4 > > Workload=bsr iosched=cfq Filesz=512M bs=4K > > group_isolation=1 slice_idle=8 group_idle=8 quantum=8 > > ========================================================================= > > AVERAGE[bsr] [bw in KB/s] > > ------- > > job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total > > --- --- -- --------------------------------------------------------------- > > bsr 3 1 6186 12752 16568 23068 28608 35785 42322 48409 213701 > > bsr 3 2 5396 10902 16959 23471 25099 30643 37168 42820 192461 > > bsr 3 4 4655 9463 14042 20537 24074 28499 34679 37895 173847 > > bsr 3 8 4418 8783 12625 19015 21933 26354 29830 36290 159249 > > > > Notice that overall throughput is just around 160MB/s with 8 sequential reader > > in each group. > > > > With this patch set, I have set slice_idle=0 and re-ran same test. > > > > Kernel=2.6.35-rc5-iops+ > > GROUPMODE=1 NRGRP=8 > > DIR=/mnt/iostestmnt/fio DEV=/dev/dm-4 > > Workload=bsr iosched=cfq Filesz=512M bs=4K > > group_isolation=1 slice_idle=0 group_idle=8 quantum=8 > > ========================================================================= > > AVERAGE[bsr] [bw in KB/s] > > ------- > > job Set NR cgrp1 cgrp2 cgrp3 cgrp4 cgrp5 cgrp6 cgrp7 cgrp8 total > > --- --- -- --------------------------------------------------------------- > > bsr 3 1 6523 12399 18116 24752 30481 36144 42185 48894 219496 > > bsr 3 2 10072 20078 29614 38378 46354 52513 58315 64833 320159 > > bsr 3 4 11045 22340 33013 44330 52663 58254 63883 70990 356520 > > bsr 3 8 12362 25860 37920 47486 61415 47292 45581 70828 348747 > > > > Notice how overall throughput has shot upto 348MB/s while retaining the ability > > to do the IO control. > > > > So this is not the default mode. This new tunable group_idle, allows one to > > set slice_idle=0 to disable some of the CFQ features and and use primarily > > group service differentation feature. > > > > If you have thoughts on other ways of solving the problem, I am all ears > > to it. > > Hi Vivek > > Would you attach your fio job config file? > Hi Gui, I have written a fio based test script, "iostest", to be able to do cgroup and other IO scheduler testing more smoothly and I am using that. I am attaching the compressed script with the mail. Try using it and if it works for you and you find it useful, I can think of hosting a git tree somewhere. I used following following command lines to test above. # iostest <block-device> -G -w bsr -m 8 -c --nrgrp 8 --total With slice idle disabled. # iostest <block-device> -G -w bsr -m 8 -c --nrgrp 8 --total -I 0 Thanks Vivek
From: Vivek Goyal on 22 Jul 2010 17:00 On Thu, Jul 22, 2010 at 01:56:02AM -0400, Christoph Hellwig wrote: > On Wed, Jul 21, 2010 at 03:06:18PM -0400, Vivek Goyal wrote: > > On high end storage (I got on HP EVA storage array with 12 SATA disks in > > RAID 5), > > That's actually quite low end storage for a server these days :) > > > So this is not the default mode. This new tunable group_idle, allows one to > > set slice_idle=0 to disable some of the CFQ features and and use primarily > > group service differentation feature. > > While this is better than before needing a sysfs tweak to get any > performance out of any kind of server class hardware still is pretty > horrible. And slice_idle=0 is not exactly the most obvious paramter > I would look for either. So having some way to automatically disable > this mode based on hardware characteristics would be really useful, > and if that's not possible at least make sure it's very obviously > document and easily found using web searches. > > Btw, what effect does slice_idle=0 with your patches have to single SATA > disk and single SSD setups? Well after responding to your mail in the morning, I realized that it was a twisted answer and not very clear. That forced me to change the patch a bit. With new patches (yet to be posted), answer to your question is that nothing will change for SATA or SSD setup with slice_idle=0 with my patches.. Why? CFQ is using two different algorithms for cfq queue and cfq group scheduling. This IOPS mode will only affect group scheduling and not the cfqq scheduling. So switching to IOPS mode should not change anything for non cgroup users on all kind of storage. It will impact only group scheduling users who will start seeing fairness among groups in terms of IOPS and not time. Of course slice_idle needs to be set to 0 only on high end storage so that we get fairness among groups in IOPS at the same time achieve full potential of storage box. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on 24 Jul 2010 05:00
To me this sounds like slice_idle=0 is the right default then, as it gives useful behaviour for all systems linux runs on. Setups with more than a few spindles are for sure more common than setups making use of cgroups. Especially given that cgroups are more of a high end feature you'd rarely use on a single SATA spindle anyway. So setting a paramter to make this useful sounds like the much better option. Especially given that the block cgroup code doesn't work particularly well in presence of barriers, which are on for any kind of real life production setup anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |