Prev: kernel : USB sound problem
Next: [PATCH 1/2] jsm: IRQ handlers doesn't need to have IRQ_DISABLED enabled
From: Ryo Tsuruta on 5 Oct 2009 11:10 Hi Vivek, Vivek Goyal <vgoyal(a)redhat.com> wrote: > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: > > Hi, > > > > Munehiro Ikeda <m-ikeda(a)ds.jp.nec.com> wrote: > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > > > Before finishing this mail, will throw a whacky idea in the ring. I was > > > > going through the request based dm-multipath paper. Will it make sense > > > > to implement request based dm-ioband? So basically we implement all the > > > > group scheduling in CFQ and let dm-ioband implement a request function > > > > to take the request and break it back into bios. This way we can keep > > > > all the group control at one place and also meet most of the requirements. > > > > > > > > So request based dm-ioband will have a request in hand once that request > > > > has passed group control and prio control. Because dm-ioband is a device > > > > mapper target, one can put it on higher level devices (practically taking > > > > CFQ at higher level device), and provide fairness there. One can also > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > > > them to use the IO scheduler.) > > > > > > > > I am sure that will be many issues but one big issue I could think of that > > > > CFQ thinks that there is one device beneath it and dipsatches requests > > > > from one queue (in case of idling) and that would kill parallelism at > > > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > > > > > Thanks > > > > Vivek > > > > > > As long as using CFQ, your idea is reasonable for me. But how about for > > > other IO schedulers? In my understanding, one of the keys to guarantee > > > group isolation in your patch is to have per-group IO scheduler internal > > > queue even with as, deadline, and noop scheduler. I think this is > > > great idea, and to implement generic code for all IO schedulers was > > > concluded when we had so many IO scheduler specific proposals. > > > If we will still need per-group IO scheduler internal queues with > > > request-based dm-ioband, we have to modify elevator layer. It seems > > > out of scope of dm. > > > I might miss something... > > > > IIUC, the request based device-mapper could not break back a request > > into bio, so it could not work with block devices which don't use the > > IO scheduler. > > > > I think current request based multipath drvier does not do it but can't it > be implemented that requests are broken back into bio? I guess it would be hard to implement it, and we need to hold requests and throttle them at there and it would break the ordering by CFQ. > Anyway, I don't feel too strongly about this approach as it might > introduce more serialization at higher layer. Yes, I know it. > > How about adding a callback function to the higher level controller? > > CFQ calls it when the active queue runs out of time, then the higer > > level controller use it as a trigger or a hint to move IO group, so > > I think a time-based controller could be implemented at higher level. > > > > Adding a call back should not be a big issue. But that means you are > planning to run only one group at higher layer at one time and I think > that's the problem because than we are introducing serialization at higher > layer. So any higher level device mapper target which has multiple > physical disks under it, we might be underutilizing these even more and > take a big hit on overall throughput. > > The whole design of doing proportional weight at lower layer is optimial > usage of system. But I think that the higher level approch makes easy to configure against striped software raid devices. If one would like to combine some physical disks into one logical device like a dm-linear, I think one should map the IO controller on each physical device and combine them into one logical device. > > My requirements for IO controller are: > > - Implement s a higher level controller, which is located at block > > layer and bio is grabbed in generic_make_request(). > > How are you planning to handle the issue of buffered writes Andrew raised? I think that it would be better to use the higher-level controller along with the memory controller and have limits memory usage for each cgroup. And as Kamezawa-san said, having limits of dirty pages would be better, too. > > - Can work with any type of IO scheduler. > > - Can work with any type of block devices. > > - Support multiple policies, proportional wegiht, max rate, time > > based, ans so on. > > > > The IO controller mini-summit will be held in next week, and I'm > > looking forard to meet you all and discuss about IO controller. > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit > > Is there a new version of dm-ioband now where you have solved the issue of > sync/async dispatch with-in group? Before meeting at mini-summit, I am > trying to run some tests and come up with numbers so that we have more > clear picture of pros/cons. Yes, I've released new versions of dm-ioband and blkio-cgroup. The new dm-ioband handles sync/async IO requests separately and the write-starve-read issue you pointed out is fixed. I would appreciate it if you would try them. http://sourceforge.net/projects/ioband/files/ Thanks, Ryo Tsuruta -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on 5 Oct 2009 13:20 On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal(a)redhat.com> wrote: > > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: > > > Hi, > > > > > > Munehiro Ikeda <m-ikeda(a)ds.jp.nec.com> wrote: > > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: > > > > > Before finishing this mail, will throw a whacky idea in the ring. I was > > > > > going through the request based dm-multipath paper. Will it make sense > > > > > to implement request based dm-ioband? So basically we implement all the > > > > > group scheduling in CFQ and let dm-ioband implement a request function > > > > > to take the request and break it back into bios. This way we can keep > > > > > all the group control at one place and also meet most of the requirements. > > > > > > > > > > So request based dm-ioband will have a request in hand once that request > > > > > has passed group control and prio control. Because dm-ioband is a device > > > > > mapper target, one can put it on higher level devices (practically taking > > > > > CFQ at higher level device), and provide fairness there. One can also > > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing > > > > > them to use the IO scheduler.) > > > > > > > > > > I am sure that will be many issues but one big issue I could think of that > > > > > CFQ thinks that there is one device beneath it and dipsatches requests > > > > > from one queue (in case of idling) and that would kill parallelism at > > > > > higher layer and throughput will suffer on many of the dm/md configurations. > > > > > > > > > > Thanks > > > > > Vivek > > > > > > > > As long as using CFQ, your idea is reasonable for me. But how about for > > > > other IO schedulers? In my understanding, one of the keys to guarantee > > > > group isolation in your patch is to have per-group IO scheduler internal > > > > queue even with as, deadline, and noop scheduler. I think this is > > > > great idea, and to implement generic code for all IO schedulers was > > > > concluded when we had so many IO scheduler specific proposals. > > > > If we will still need per-group IO scheduler internal queues with > > > > request-based dm-ioband, we have to modify elevator layer. It seems > > > > out of scope of dm. > > > > I might miss something... > > > > > > IIUC, the request based device-mapper could not break back a request > > > into bio, so it could not work with block devices which don't use the > > > IO scheduler. > > > > > > > I think current request based multipath drvier does not do it but can't it > > be implemented that requests are broken back into bio? > > I guess it would be hard to implement it, and we need to hold requests > and throttle them at there and it would break the ordering by CFQ. > > > Anyway, I don't feel too strongly about this approach as it might > > introduce more serialization at higher layer. > > Yes, I know it. > > > > How about adding a callback function to the higher level controller? > > > CFQ calls it when the active queue runs out of time, then the higer > > > level controller use it as a trigger or a hint to move IO group, so > > > I think a time-based controller could be implemented at higher level. > > > > > > > Adding a call back should not be a big issue. But that means you are > > planning to run only one group at higher layer at one time and I think > > that's the problem because than we are introducing serialization at higher > > layer. So any higher level device mapper target which has multiple > > physical disks under it, we might be underutilizing these even more and > > take a big hit on overall throughput. > > > > The whole design of doing proportional weight at lower layer is optimial > > usage of system. > > But I think that the higher level approch makes easy to configure > against striped software raid devices. How does it make easier to configure in case of higher level controller? In case of lower level design, one just have to create cgroups and assign weights to cgroups. This mininum step will be required in higher level controller also. (Even if you get rid of dm-ioband device setup step). > If one would like to > combine some physical disks into one logical device like a dm-linear, > I think one should map the IO controller on each physical device and > combine them into one logical device. > In fact this sounds like a more complicated step where one has to setup one dm-ioband device on top of each physical device. But I am assuming that this will go away once you move to per reuqest queue like implementation. I think it should be same in principal as my initial implementation of IO controller on request queue and I stopped development on it because of FIFO dispatch. So you seem to be suggesting that you will move dm-ioband to request queue so that setting up additional device setup is gone. You will also enable it to do time based groups policy, so that we don't run into issues on seeky media. Will also enable dispatch from one group only at a time so that we don't run into isolation issues and can do time accounting accruately. If yes, then that has the potential to solve the issue. At higher layer one can think of enabling size of IO/number of IO policy both for proportional BW and max BW type of control. At lower level one can enable pure time based control on seeky media. I think this will still left with the issue of prio with-in group as group control is separate and you will not be maintatinig separate queues for each process. Similarly you will also have isseus with read vs write ratios as IO schedulers underneath change. So I will be curious to see that implementation. > > > My requirements for IO controller are: > > > - Implement s a higher level controller, which is located at block > > > layer and bio is grabbed in generic_make_request(). > > > > How are you planning to handle the issue of buffered writes Andrew raised? > > I think that it would be better to use the higher-level controller > along with the memory controller and have limits memory usage for each > cgroup. And as Kamezawa-san said, having limits of dirty pages would > be better, too. > Ok. So if we plan to co-mount memory controller with per memory group dirty_ratio implemented, that can work with both higher level as well as low level controller. Not sure if we also require some kind of a per memory group flusher thread infrastructure also to make sure higher weight group gets more job done. > > > - Can work with any type of IO scheduler. > > > - Can work with any type of block devices. > > > - Support multiple policies, proportional wegiht, max rate, time > > > based, ans so on. > > > > > > The IO controller mini-summit will be held in next week, and I'm > > > looking forard to meet you all and discuss about IO controller. > > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit > > > > Is there a new version of dm-ioband now where you have solved the issue of > > sync/async dispatch with-in group? Before meeting at mini-summit, I am > > trying to run some tests and come up with numbers so that we have more > > clear picture of pros/cons. > > Yes, I've released new versions of dm-ioband and blkio-cgroup. The new > dm-ioband handles sync/async IO requests separately and > the write-starve-read issue you pointed out is fixed. I would > appreciate it if you would try them. > http://sourceforge.net/projects/ioband/files/ Cool. Will get to testing it. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Nauman Rafique on 5 Oct 2009 14:20 On Mon, Oct 5, 2009 at 10:10 AM, Vivek Goyal <vgoyal(a)redhat.com> wrote: > On Mon, Oct 05, 2009 at 11:55:35PM +0900, Ryo Tsuruta wrote: >> Hi Vivek, >> >> Vivek Goyal <vgoyal(a)redhat.com> wrote: >> > On Mon, Oct 05, 2009 at 07:38:08PM +0900, Ryo Tsuruta wrote: >> > > Hi, >> > > >> > > Munehiro Ikeda <m-ikeda(a)ds.jp.nec.com> wrote: >> > > > Vivek Goyal wrote, on 10/01/2009 10:57 PM: >> > > > > Before finishing this mail, will throw a whacky idea in the ring. I was >> > > > > going through the request based dm-multipath paper. Will it make sense >> > > > > to implement request based dm-ioband? So basically we implement all the >> > > > > group scheduling in CFQ and let dm-ioband implement a request function >> > > > > to take the request and break it back into bios. This way we can keep >> > > > > all the group control at one place and also meet most of the requirements. >> > > > > >> > > > > So request based dm-ioband will have a request in hand once that request >> > > > > has passed group control and prio control. Because dm-ioband is a device >> > > > > mapper target, one can put it on higher level devices (practically taking >> > > > > CFQ at higher level device), and provide fairness there. One can also >> > > > > put it on those SSDs which don't use IO scheduler (this is kind of forcing >> > > > > them to use the IO scheduler.) >> > > > > >> > > > > I am sure that will be many issues but one big issue I could think of that >> > > > > CFQ thinks that there is one device beneath it and dipsatches requests >> > > > > from one queue (in case of idling) and that would kill parallelism at >> > > > > higher layer and throughput will suffer on many of the dm/md configurations. >> > > > > >> > > > > Thanks >> > > > > Vivek >> > > > >> > > > As long as using CFQ, your idea is reasonable for me. �But how about for >> > > > other IO schedulers? �In my understanding, one of the keys to guarantee >> > > > group isolation in your patch is to have per-group IO scheduler internal >> > > > queue even with as, deadline, and noop scheduler. �I think this is >> > > > great idea, and to implement generic code for all IO schedulers was >> > > > concluded when we had so many IO scheduler specific proposals. >> > > > If we will still need per-group IO scheduler internal queues with >> > > > request-based dm-ioband, we have to modify elevator layer. �It seems >> > > > out of scope of dm. >> > > > I might miss something... >> > > >> > > IIUC, the request based device-mapper could not break back a request >> > > into bio, so it could not work with block devices which don't use the >> > > IO scheduler. >> > > >> > >> > I think current request based multipath drvier does not do it but can't it >> > be implemented that requests are broken back into bio? >> >> I guess it would be hard to implement it, and we need to hold requests >> and throttle them at there and it would break the ordering by CFQ. >> >> > Anyway, I don't feel too strongly about this approach as it might >> > introduce more serialization at higher layer. >> >> Yes, I know it. >> >> > > How about adding a callback function to the higher level controller? >> > > CFQ calls it when the active queue runs out of time, then the higer >> > > level controller use it as a trigger or a hint to move IO group, so >> > > I think a time-based controller could be implemented at higher level. >> > > >> > >> > Adding a call back should not be a big issue. But that means you are >> > planning to run only one group at higher layer at one time and I think >> > that's the problem because than we are introducing serialization at higher >> > layer. So any higher level device mapper target which has multiple >> > physical disks under it, we might be underutilizing these even more and >> > take a big hit on overall throughput. >> > >> > The whole design of doing proportional weight at lower layer is optimial >> > usage of system. >> >> But I think that the higher level approch makes easy to configure >> against striped software raid devices. > > How does it make easier to configure in case of higher level controller? > > In case of lower level design, one just have to create cgroups and assign > weights to cgroups. This mininum step will be required in higher level > controller also. (Even if you get rid of dm-ioband device setup step). > >> If one would like to >> combine some physical disks into one logical device like a dm-linear, >> I think one should map the IO controller on each physical device and >> combine them into one logical device. >> > > In fact this sounds like a more complicated step where one has to setup > one dm-ioband device on top of each physical device. But I am assuming > that this will go away once you move to per reuqest queue like implementation. > > I think it should be same in principal as my initial implementation of IO > controller on request queue and I stopped development on it because of FIFO > dispatch. > > So you seem to be suggesting that you will move dm-ioband to request queue > so that setting up additional device setup is gone. You will also enable > it to do time based groups policy, so that we don't run into issues on > seeky media. Will also enable dispatch from one group only at a time so > that we don't run into isolation issues and can do time accounting > accruately. Will that approach solve the problem of doing bandwidth control on logical devices? What would be the advantages compared to Vivek's current patches? > > If yes, then that has the potential to solve the issue. At higher layer one > can think of enabling size of IO/number of IO policy both for proportional > BW and max BW type of control. At lower level one can enable pure time > based control on seeky media. > > I think this will still left with the issue of prio with-in group as group > control is separate and you will not be maintatinig separate queues for > each process. Similarly you will also have isseus with read vs write > ratios as IO schedulers underneath change. > > So I will be curious to see that implementation. > >> > > My requirements for IO controller are: >> > > - Implement s a higher level controller, which is located at block >> > > � layer and bio is grabbed in generic_make_request(). >> > >> > How are you planning to handle the issue of buffered writes Andrew raised? >> >> I think that it would be better to use the higher-level controller >> along with the memory controller and have limits memory usage for each >> cgroup. And as Kamezawa-san said, having limits of dirty pages would >> be better, too. >> > > Ok. So if we plan to co-mount memory controller with per memory group > dirty_ratio implemented, that can work with both higher level as well as > low level controller. Not sure if we also require some kind of a per > memory group flusher thread infrastructure also to make sure higher weight > group gets more job done. > >> > > - Can work with any type of IO scheduler. >> > > - Can work with any type of block devices. >> > > - Support multiple policies, proportional wegiht, max rate, time >> > > � based, ans so on. >> > > >> > > The IO controller mini-summit will be held in next week, and I'm >> > > looking forard to meet you all and discuss about IO controller. >> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit >> > >> > Is there a new version of dm-ioband now where you have solved the issue of >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am >> > trying to run some tests and come up with numbers so that we have more >> > clear picture of pros/cons. >> >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new >> dm-ioband handles sync/async IO requests separately and >> the write-starve-read issue you pointed out is fixed. I would >> appreciate it if you would try them. >> http://sourceforge.net/projects/ioband/files/ > > Cool. Will get to testing it. > > Thanks > Vivek > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Ryo Tsuruta on 6 Oct 2009 03:30 Hi Vivek and Nauman, Nauman Rafique <nauman(a)google.com> wrote: > >> > > How about adding a callback function to the higher level controller? > >> > > CFQ calls it when the active queue runs out of time, then the higer > >> > > level controller use it as a trigger or a hint to move IO group, so > >> > > I think a time-based controller could be implemented at higher level. > >> > > > >> > > >> > Adding a call back should not be a big issue. But that means you are > >> > planning to run only one group at higher layer at one time and I think > >> > that's the problem because than we are introducing serialization at higher > >> > layer. So any higher level device mapper target which has multiple > >> > physical disks under it, we might be underutilizing these even more and > >> > take a big hit on overall throughput. > >> > > >> > The whole design of doing proportional weight at lower layer is optimial > >> > usage of system. > >> > >> But I think that the higher level approch makes easy to configure > >> against striped software raid devices. > > > > How does it make easier to configure in case of higher level controller? > > > > In case of lower level design, one just have to create cgroups and assign > > weights to cgroups. This mininum step will be required in higher level > > controller also. (Even if you get rid of dm-ioband device setup step). In the case of lower level controller, if we need to assign weights on a per device basis, we have to assign weights to all devices of which a raid device consists, but in the case of higher level controller, we just assign weights to the raid device only. > >> If one would like to > >> combine some physical disks into one logical device like a dm-linear, > >> I think one should map the IO controller on each physical device and > >> combine them into one logical device. > >> > > > > In fact this sounds like a more complicated step where one has to setup > > one dm-ioband device on top of each physical device. But I am assuming > > that this will go away once you move to per reuqest queue like implementation. I don't understand why the per request queue implementation makes it go away. If dm-ioband is integrated into the LVM tools, it could allow users to skip the complicated steps to configure dm-linear devices. > > I think it should be same in principal as my initial implementation of IO > > controller on request queue and I stopped development on it because of FIFO > > dispatch. I think that FIFO dispatch seldom lead to prioviry inversion, because holding period for throttling is not too long to break the IO priority. I did some tests to see whether priority inversion is happened. The first test ran fio sequential readers on the same group. The BE0 reader got the highest throughput as I expected. nr_threads 16 | 16 | 1 ionice BE7 | BE7 | BE0 ------------------------+------------+------------- vanilla 10,076KiB/s | 9,779KiB/s | 32,775KiB/s ioband 9,576KiB/s | 9,367KiB/s | 34,154KiB/s The second test ran fio sequential readers on two different groups and give weights of 20 and 10 to each group respectively. The bandwidth was distributed according to their weights and the BE0 reader got higher throughput than the BE7 readers in the same group. IO priority was preserved within the IO group. group group1 | group2 weight 20 | 10 ------------------------+-------------------------- nr_threads 16 | 16 | 1 ionice BE7 | BE7 | BE0 ------------------------+-------------------------- ioband 27,513KiB/s | 3,524KiB/s | 10,248KiB/s | Total = 13,772KiB/s Here is my test script. ------------------------------------------------------------------------- arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ --group_reporting" sync echo 3 > /proc/sys/vm/drop_caches echo $$ > /cgroup/1/tasks ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & echo $$ > /cgroup/2/tasks ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & echo $$ > /cgroup/tasks wait ------------------------------------------------------------------------- Be that as it way, I think that if every bio can point the iocontext of the process, then it makes it possible to handle IO priority in the higher level controller. A patchse has already posted by Takhashi-san. What do you think about this idea? Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) Subject [RFC][PATCH 1/10] I/O context inheritance From Hirokazu Takahashi <> http://lkml.org/lkml/2008/4/22/195 > > So you seem to be suggesting that you will move dm-ioband to request queue > > so that setting up additional device setup is gone. You will also enable > > it to do time based groups policy, so that we don't run into issues on > > seeky media. Will also enable dispatch from one group only at a time so > > that we don't run into isolation issues and can do time accounting > > accruately. > > Will that approach solve the problem of doing bandwidth control on > logical devices? What would be the advantages compared to Vivek's > current patches? I will only move the point where dm-ioband grabs bios, other dm-ioband's mechanism and functionality will stll be the same. The advantages against to scheduler based controllers are: - can work with any type of block devices - can work with any type of IO scheduler and no need a big change. > > If yes, then that has the potential to solve the issue. At higher layer one > > can think of enabling size of IO/number of IO policy both for proportional > > BW and max BW type of control. At lower level one can enable pure time > > based control on seeky media. > > > > I think this will still left with the issue of prio with-in group as group > > control is separate and you will not be maintatinig separate queues for > > each process. Similarly you will also have isseus with read vs write > > ratios as IO schedulers underneath change. > > > > So I will be curious to see that implementation. > > > >> > > My requirements for IO controller are: > >> > > - Implement s a higher level controller, which is located at block > >> > > � layer and bio is grabbed in generic_make_request(). > >> > > >> > How are you planning to handle the issue of buffered writes Andrew raised? > >> > >> I think that it would be better to use the higher-level controller > >> along with the memory controller and have limits memory usage for each > >> cgroup. And as Kamezawa-san said, having limits of dirty pages would > >> be better, too. > >> > > > > Ok. So if we plan to co-mount memory controller with per memory group > > dirty_ratio implemented, that can work with both higher level as well as > > low level controller. Not sure if we also require some kind of a per > > memory group flusher thread infrastructure also to make sure higher weight > > group gets more job done. I'm not sure either that a per memory group flusher is necessary. An we have to consider not only pdflush but also other threads which issue IOs from multiple groups. > >> > > - Can work with any type of IO scheduler. > >> > > - Can work with any type of block devices. > >> > > - Support multiple policies, proportional wegiht, max rate, time > >> > > � based, ans so on. > >> > > > >> > > The IO controller mini-summit will be held in next week, and I'm > >> > > looking forard to meet you all and discuss about IO controller. > >> > > https://sourceforge.net/apps/trac/ioband/wiki/iosummit > >> > > >> > Is there a new version of dm-ioband now where you have solved the issue of > >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am > >> > trying to run some tests and come up with numbers so that we have more > >> > clear picture of pros/cons. > >> > >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new > >> dm-ioband handles sync/async IO requests separately and > >> the write-starve-read issue you pointed out is fixed. I would > >> appreciate it if you would try them. > >> http://sourceforge.net/projects/ioband/files/ > > > > Cool. Will get to testing it. Thanks for your help in advance. Thanks, Ryo Tsuruta -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on 6 Oct 2009 07:30
On Tue, Oct 06, 2009 at 04:17:44PM +0900, Ryo Tsuruta wrote: > Hi Vivek and Nauman, > > Nauman Rafique <nauman(a)google.com> wrote: > > >> > > How about adding a callback function to the higher level controller? > > >> > > CFQ calls it when the active queue runs out of time, then the higer > > >> > > level controller use it as a trigger or a hint to move IO group, so > > >> > > I think a time-based controller could be implemented at higher level. > > >> > > > > >> > > > >> > Adding a call back should not be a big issue. But that means you are > > >> > planning to run only one group at higher layer at one time and I think > > >> > that's the problem because than we are introducing serialization at higher > > >> > layer. So any higher level device mapper target which has multiple > > >> > physical disks under it, we might be underutilizing these even more and > > >> > take a big hit on overall throughput. > > >> > > > >> > The whole design of doing proportional weight at lower layer is optimial > > >> > usage of system. > > >> > > >> But I think that the higher level approch makes easy to configure > > >> against striped software raid devices. > > > > > > How does it make easier to configure in case of higher level controller? > > > > > > In case of lower level design, one just have to create cgroups and assign > > > weights to cgroups. This mininum step will be required in higher level > > > controller also. (Even if you get rid of dm-ioband device setup step). > > In the case of lower level controller, if we need to assign weights on > a per device basis, we have to assign weights to all devices of which > a raid device consists, but in the case of higher level controller, > we just assign weights to the raid device only. > This is required only if you need to assign different weights to different devices. This is just additional facility and not a requirement. Normally you will not be required to do that and devices will inherit the cgroup weights automatically. So one has to only assign the cgroup weights. > > >> If one would like to > > >> combine some physical disks into one logical device like a dm-linear, > > >> I think one should map the IO controller on each physical device and > > >> combine them into one logical device. > > >> > > > > > > In fact this sounds like a more complicated step where one has to setup > > > one dm-ioband device on top of each physical device. But I am assuming > > > that this will go away once you move to per reuqest queue like implementation. > > I don't understand why the per request queue implementation makes it > go away. If dm-ioband is integrated into the LVM tools, it could allow > users to skip the complicated steps to configure dm-linear devices. > Those who are not using dm-tools will be forced to use dm-tools for bandwidth control features. > > > I think it should be same in principal as my initial implementation of IO > > > controller on request queue and I stopped development on it because of FIFO > > > dispatch. > > I think that FIFO dispatch seldom lead to prioviry inversion, because > holding period for throttling is not too long to break the IO priority. > I did some tests to see whether priority inversion is happened. > > The first test ran fio sequential readers on the same group. The BE0 > reader got the highest throughput as I expected. > > nr_threads 16 | 16 | 1 > ionice BE7 | BE7 | BE0 > ------------------------+------------+------------- > vanilla 10,076KiB/s | 9,779KiB/s | 32,775KiB/s > ioband 9,576KiB/s | 9,367KiB/s | 34,154KiB/s > > The second test ran fio sequential readers on two different groups and > give weights of 20 and 10 to each group respectively. The bandwidth > was distributed according to their weights and the BE0 reader got > higher throughput than the BE7 readers in the same group. IO priority > was preserved within the IO group. > > group group1 | group2 > weight 20 | 10 > ------------------------+-------------------------- > nr_threads 16 | 16 | 1 > ionice BE7 | BE7 | BE0 > ------------------------+-------------------------- > ioband 27,513KiB/s | 3,524KiB/s | 10,248KiB/s > | Total = 13,772KiB/s > Interesting. In all the test cases you always test with sequential readers. I have changed the test case a bit (I have already reported the results in another mail, now running the same test again with dm-version 1.14). I made all the readers doing direct IO and in other group I put a buffered writer. So setup looks as follows. In group1, I launch 1 prio 0 reader and increasing number of prio4 readers. In group 2 I just run a dd doing buffered writes. Weights of both the groups are 100 each. Following are the results on 2.6.31 kernel. With-dm-ioband ============== <------------prio4 readers----------------------> <---prio0 reader------> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 9992KiB/s 9992KiB/s 9992KiB/s 413K usec 4621KiB/s 369K usec 2 4859KiB/s 4265KiB/s 9122KiB/s 344K usec 4915KiB/s 401K usec 4 2238KiB/s 1381KiB/s 7703KiB/s 532K usec 3195KiB/s 546K usec 8 504KiB/s 46KiB/s 1439KiB/s 399K usec 7661KiB/s 220K usec 16 131KiB/s 26KiB/s 638KiB/s 492K usec 4847KiB/s 359K usec With vanilla CFQ ================ <------------prio4 readers----------------------> <---prio0 reader------> nr Max-bdwidth Min-bdwidth Agg-bdwidth Max-latency Agg-bdwidth Max-latency 1 10779KiB/s 10779KiB/s 10779KiB/s 407K usec 16094KiB/s 808K usec 2 7045KiB/s 6913KiB/s 13959KiB/s 538K usec 18794KiB/s 761K usec 4 7842KiB/s 4409KiB/s 20967KiB/s 876K usec 12543KiB/s 443K usec 8 6198KiB/s 2426KiB/s 24219KiB/s 1469K usec 9483KiB/s 685K usec 16 5041KiB/s 1358KiB/s 27022KiB/s 2417K usec 6211KiB/s 1025K usec Above results are showing how bandwidth got distributed between prio4 and prio1 readers with-in group as we increased number of prio4 readers in the group. In another group a buffered writer is continuously going on as competitor. Notice, with dm-ioband how bandwidth allocation is broken. With 1 prio4 reader, prio4 reader got more bandwidth than prio1 reader. With 2 prio4 readers, looks like prio4 got almost same BW as prio1. With 8 and 16 prio4 readers, looks like prio0 readers takes over and prio4 readers starve. As we incresae number of prio4 readers in the group, their total aggregate BW share should increase. Instread it is decreasing. So to me in the face of competition with a writer in other group, BW is all over the place. Some of these might be dm-ioband bugs and some of these might be coming from the fact that buffering takes place in higher layer and dispatch is FIFO? > Here is my test script. > ------------------------------------------------------------------------- > arg="--time_base --rw=read --runtime=30 --directory=/mnt1 --size=1024M \ > --group_reporting" > > sync > echo 3 > /proc/sys/vm/drop_caches > > echo $$ > /cgroup/1/tasks > ionice -c 2 -n 0 fio $arg --name=read1 --output=read1.log --numjobs=16 & > echo $$ > /cgroup/2/tasks > ionice -c 2 -n 0 fio $arg --name=read2 --output=read2.log --numjobs=16 & > ionice -c 1 -n 0 fio $arg --name=read3 --output=read3.log --numjobs=1 & > echo $$ > /cgroup/tasks > wait > ------------------------------------------------------------------------- > > Be that as it way, I think that if every bio can point the iocontext > of the process, then it makes it possible to handle IO priority in the > higher level controller. A patchse has already posted by Takhashi-san. > What do you think about this idea? > > Date Tue, 22 Apr 2008 22:51:31 +0900 (JST) > Subject [RFC][PATCH 1/10] I/O context inheritance > From Hirokazu Takahashi <> > http://lkml.org/lkml/2008/4/22/195 So far you have been denying that there are issues with ioprio with-in group in higher level controller. Here you seems to be saying that there are issues with ioprio and we need to take this patch in to solve the issue? I am confused? Anyway, if you think that above patch is needed to solve the issue of ioprio in higher level controller, why are you not posting it as part of your patch series regularly, so that we can also apply this patch along with other patches and test the effects? > > > > So you seem to be suggesting that you will move dm-ioband to request queue > > > so that setting up additional device setup is gone. You will also enable > > > it to do time based groups policy, so that we don't run into issues on > > > seeky media. Will also enable dispatch from one group only at a time so > > > that we don't run into isolation issues and can do time accounting > > > accruately. > > > > Will that approach solve the problem of doing bandwidth control on > > logical devices? What would be the advantages compared to Vivek's > > current patches? > > I will only move the point where dm-ioband grabs bios, other > dm-ioband's mechanism and functionality will stll be the same. > The advantages against to scheduler based controllers are: > - can work with any type of block devices > - can work with any type of IO scheduler and no need a big change. > The big change thing we will come to know for sure when we have implementation for the timed groups done and shown that it works as well as my patches. There are so many subtle things with time based approach. [..] > > >> > Is there a new version of dm-ioband now where you have solved the issue of > > >> > sync/async dispatch with-in group? Before meeting at mini-summit, I am > > >> > trying to run some tests and come up with numbers so that we have more > > >> > clear picture of pros/cons. > > >> > > >> Yes, I've released new versions of dm-ioband and blkio-cgroup. The new > > >> dm-ioband handles sync/async IO requests separately and > > >> the write-starve-read issue you pointed out is fixed. I would > > >> appreciate it if you would try them. > > >> http://sourceforge.net/projects/ioband/files/ > > > > > > Cool. Will get to testing it. > > Thanks for your help in advance. Against what kernel version above patches apply. The biocgroup patches I tried against 2.6.31 as well as 2.6.32-rc1 and it does not apply cleanly against any of these? So for the time being I am doing testing with biocgroup patches. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |