Prev: kernel : USB sound problem
Next: [PATCH 1/2] jsm: IRQ handlers doesn't need to have IRQ_DISABLED enabled
From: Vivek Goyal on 25 Sep 2009 10:40 On Fri, Sep 25, 2009 at 06:07:24PM +0900, Ryo Tsuruta wrote: > Hi Vivek, > > Vivek Goyal <vgoyal(a)redhat.com> wrote: > > Higher level solutions are not keeping track of time slices. Time slices will > > be allocated by CFQ which does not have any idea about grouping. Higher > > level controller just keeps track of size of IO done at group level and > > then run either a leaky bucket or token bucket algorithm. > > > > IO throttling is a max BW controller, so it will not even care about what is > > happening in other group. It will just be concerned with rate of IO in one > > particular group and if we exceed specified limit, throttle it. So until and > > unless sequential reader group hits it max bw limit, it will keep sending > > reads down to CFQ, and CFQ will happily assign 100ms slices to readers. > > > > dm-ioband will not try to choke the high throughput sequential reader group > > for the slow random reader group because that would just kill the throughput > > of rotational media. Every sequential reader will run for few ms and then > > be throttled and this goes on. Disk will soon be seek bound. > > Because dm-ioband provides faireness in terms of how many IO requests > are issued or how many bytes are transferred, so this behaviour is to > be expected. Do you think fairness in terms of IO requests and size is > not fair? > Hi Ryo, Fairness in terms of size of IO or number of requests is probably not the best thing to do on rotational media where seek latencies are significant. It probably should work just well on media with very low seek latencies like SSD. So on rotational media, either you will not provide fairness to random readers because they are too slow or you will choke the sequential readers in other group and also bring down the overall disk throughput. If you don't decide to choke/throttle sequential reader group for the sake of random reader in other group then you will not have a good control on random reader latencies. Because now IO scheduler sees the IO from both sequential reader as well as random reader and sequential readers have not been throttled. So the dispatch pattern/time slices will again look like.. SR1 SR2 SR3 SR4 SR5 RR..... instead of SR1 RR SR2 RR SR3 RR SR4 RR .... SR --> sequential reader, RR --> random reader > > > > Buffering at higher layer can delay read requests for more than slice idle > > > > period of CFQ (default 8 ms). That means, it is possible that we are waiting > > > > for a request from the queue but it is buffered at higher layer and then idle > > > > timer will fire. It means that queue will losse its share at the same time > > > > overall throughput will be impacted as we lost those 8 ms. > > > > > > That sounds like a bug. > > > > > > > Actually this probably is a limitation of higher level controller. It most > > likely is sitting so high in IO stack that it has no idea what underlying > > IO scheduler is and what are IO scheduler's policies. So it can't keep up > > with IO scheduler's policies. Secondly, it might be a low weight group and > > tokens might not be available fast enough to release the request. > > > > > > Read Vs Write > > > > ------------- > > > > Writes can overwhelm readers hence second level controller FIFO release > > > > will run into issue here. If there is a single queue maintained then reads > > > > will suffer large latencies. If there separate queues for reads and writes > > > > then it will be hard to decide in what ratio to dispatch reads and writes as > > > > it is IO scheduler's decision to decide when and how much read/write to > > > > dispatch. This is another place where higher level controller will not be in > > > > sync with lower level io scheduler and can change the effective policies of > > > > underlying io scheduler. > > > > > > The IO schedulers already take care of read-vs-write and already take > > > care of preventing large writes-starve-reads latencies (or at least, > > > they're supposed to). > > > > True. Actually this is a limitation of higher level controller. A higher > > level controller will most likely implement some of kind of queuing/buffering > > mechanism where it will buffer requeuests when it decides to throttle the > > group. Now once a fair number read and requests are buffered, and if > > controller is ready to dispatch some requests from the group, which > > requests/bio should it dispatch? reads first or writes first or reads and > > writes in certain ratio? > > The write-starve-reads on dm-ioband, that you pointed out before, was > not caused by FIFO release, it was caused by IO flow control in > dm-ioband. When I turned off the flow control, then the read > throughput was quite improved. What was flow control doing? > > Now I'm considering separating dm-ioband's internal queue into sync > and async and giving a certain priority of dispatch to async IOs. Even if you maintain separate queues for sync and async, in what ratio will you dispatch reads and writes to underlying layer once fresh tokens become available to the group and you decide to unthrottle the group. Whatever policy you adopt for read and write dispatch, it might not match with policy of underlying IO scheduler because every IO scheduler seems to have its own way of determining how reads and writes should be dispatched. Now somebody might start complaining that my job inside the group is not getting same reader/writer ratio as it was getting outside the group. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Rik van Riel on 25 Sep 2009 11:10 Ryo Tsuruta wrote: > Because dm-ioband provides faireness in terms of how many IO requests > are issued or how many bytes are transferred, so this behaviour is to > be expected. Do you think fairness in terms of IO requests and size is > not fair? When there are two workloads competing for the same resources, I would expect each of the workloads to run at about 50% of the speed at which it would run on an uncontended system. Having one of the workloads run at 95% of the uncontended speed and the other workload at 5% is "not fair" (to put it diplomatically). -- All rights reversed. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Vivek Goyal on 25 Sep 2009 16:30 On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: > Vivek Goyal wrote: > > Notes: > > - With vanilla CFQ, random writers can overwhelm a random reader. > > Bring down its throughput and bump up latencies significantly. > > > IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, > too. > > I'm basing this assumption on the observations I made on both OpenSuse > 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML > titled: "Poor desktop responsiveness with background I/O-operations" of > 2009-09-20. > (Message ID: 4AB59CBB.8090907(a)datenparkplatz.de) > > > Thus, I'm posting this to show that your work is greatly appreciated, > given the rather disappointig status quo of Linux's fairness when it > comes to disk IO time. > > I hope that your efforts lead to a change in performance of current > userland applications, the sooner, the better. > [Please don't remove people from original CC list. I am putting them back.] Hi Ulrich, I quicky went through that mail thread and I tried following on my desktop. ########################################## dd if=/home/vgoyal/4G-file of=/dev/null & sleep 5 time firefox # close firefox once gui pops up. ########################################## It was taking close to 1 minute 30 seconds to launch firefox and dd got following. 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s (Results do vary across runs, especially if system is booted fresh. Don't know why...). Then I tried putting both the applications in separate groups and assign them weights 200 each. ########################################## dd if=/home/vgoyal/4G-file of=/dev/null & echo $! > /cgroup/io/test1/tasks sleep 5 echo $$ > /cgroup/io/test2/tasks time firefox # close firefox once gui pops up. ########################################## Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s Notice that throughput of dd also improved. I ran the block trace and noticed in many a cases firefox threads immediately preempted the "dd". Probably because it was a file system request. So in this case latency will arise from seek time. In some other cases, threads had to wait for up to 100ms because dd was not preempted. In this case latency will arise both from waiting on queue as well as seek time. With cgroup thing, We will run 100ms slice for the group in which firefox is being launched and then give 100ms uninterrupted time slice to dd. So it should cut down on number of seeks happening and that's why we probably see this improvement. So grouping can help in such cases. May be you can move your X session in one group and launch the big IO in other group. Most likely you should have better desktop experience without compromising on dd thread output. Thanks Vivek -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mike Galbraith on 26 Sep 2009 11:00 On Fri, 2009-09-25 at 16:26 -0400, Vivek Goyal wrote: > On Fri, Sep 25, 2009 at 04:20:14AM +0200, Ulrich Lukas wrote: > > Vivek Goyal wrote: > > > Notes: > > > - With vanilla CFQ, random writers can overwhelm a random reader. > > > Bring down its throughput and bump up latencies significantly. > > > > > > IIRC, with vanilla CFQ, sequential writing can overwhelm random readers, > > too. > > > > I'm basing this assumption on the observations I made on both OpenSuse > > 11.1 and Ubuntu 9.10 alpha6 which I described in my posting on LKML > > titled: "Poor desktop responsiveness with background I/O-operations" of > > 2009-09-20. > > (Message ID: 4AB59CBB.8090907(a)datenparkplatz.de) > > > > > > Thus, I'm posting this to show that your work is greatly appreciated, > > given the rather disappointig status quo of Linux's fairness when it > > comes to disk IO time. > > > > I hope that your efforts lead to a change in performance of current > > userland applications, the sooner, the better. > > > [Please don't remove people from original CC list. I am putting them back.] > > Hi Ulrich, > > I quicky went through that mail thread and I tried following on my > desktop. > > ########################################## > dd if=/home/vgoyal/4G-file of=/dev/null & > sleep 5 > time firefox > # close firefox once gui pops up. > ########################################## > > It was taking close to 1 minute 30 seconds to launch firefox and dd got > following. > > 4294967296 bytes (4.3 GB) copied, 100.602 s, 42.7 MB/s > > (Results do vary across runs, especially if system is booted fresh. Don't > know why...). > > > Then I tried putting both the applications in separate groups and assign > them weights 200 each. > > ########################################## > dd if=/home/vgoyal/4G-file of=/dev/null & > echo $! > /cgroup/io/test1/tasks > sleep 5 > echo $$ > /cgroup/io/test2/tasks > time firefox > # close firefox once gui pops up. > ########################################## > > Now I firefox pops up in 27 seconds. So it cut down the time by 2/3. > > 4294967296 bytes (4.3 GB) copied, 84.6138 s, 50.8 MB/s > > Notice that throughput of dd also improved. > > I ran the block trace and noticed in many a cases firefox threads > immediately preempted the "dd". Probably because it was a file system > request. So in this case latency will arise from seek time. > > In some other cases, threads had to wait for up to 100ms because dd was > not preempted. In this case latency will arise both from waiting on queue > as well as seek time. Hm, with tip, I see ~10ms max wakeup latency running scriptlet below. > With cgroup thing, We will run 100ms slice for the group in which firefox > is being launched and then give 100ms uninterrupted time slice to dd. So > it should cut down on number of seeks happening and that's why we probably > see this improvement. I'm not testing with group IO/CPU, but my numbers kinda agree that it's seek latency that's THE killer. What the compiled numbers below from the cheezy script below that _seem_ to be telling me is that the default setting of CFQ quantum is allowing too many write requests through, inflicting too much read latency... for the disk where my binaries live. The longer the seeky burst, the more it hurts both reader/writer, so cutting down the max requests queueable helps the reader (which i think can't queue anything near per unit time that the writer can) finish and get out of the writer's way sooner. 'nuff possibly useless words, onward to possibly useless numbers :) dd pre == number dd emits upon receiving USR1 before execing perf. perf stat == time to load/execute perf stat konsole -e exit. dd post == same after dd number, after perf finishes. quantum = 1 Avg dd pre 58.4 52.5 56.1 61.6 52.3 56.1 MB/s perf stat 2.87 0.91 1.64 1.41 0.90 1.5 Sec dd post 56.6 61.0 66.3 64.7 60.9 61.9 quantum = 2 dd pre 59.7 62.4 58.9 65.3 60.3 61.3 perf stat 5.81 6.09 6.24 10.13 6.21 6.8 dd post 64.0 62.6 64.2 60.4 61.1 62.4 quantum = 3 dd pre 65.5 57.7 54.5 51.1 56.3 57.0 perf stat 14.01 13.71 8.35 5.35 8.57 9.9 dd post 59.2 49.1 58.8 62.3 62.1 58.3 quantum = 4 dd pre 57.2 52.1 56.8 55.2 61.6 56.5 perf stat 11.98 1.61 9.63 16.21 11.13 10.1 dd post 57.2 52.6 62.2 49.3 50.2 54.3 Nothing pinned btw, 4 cores available, but only 1 drive. #!/bin/sh DISK=sdb QUANTUM=/sys/block/$DISK/queue/iosched/quantum END=$(cat $QUANTUM) for q in `seq 1 $END`; do echo $q > $QUANTUM LOGFILE=quantum_log_$q rm -f $LOGFILE for i in `seq 1 5`; do echo 2 > /proc/sys/vm/drop_caches sh -c "dd if=/dev/zero of=./deleteme.dd 2>&1|tee -a $LOGFILE" & sleep 30 sh -c "echo quantum $(cat $QUANTUM) loop $i" 2>&1|tee -a $LOGFILE perf stat -- killlall -q get_stuf_into_ram >/dev/null 2>&1 sleep 1 killall -q -USR1 dd & sleep 1 sh -c "perf stat -- konsole -e exit" 2>&1|tee -a $LOGFILE sleep 1 killall -q -USR1 dd & sleep 5 killall -qw dd rm -f ./deleteme.dd sync sh -c "echo" 2>&1|tee -a $LOGFILE done; done; -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mike Galbraith on 27 Sep 2009 03:00
My dd vs load non-cached binary woes seem to be coming from backmerge. #if 0 /*MIKEDIDIT sand in gearbox?*/ /* * See if our hash lookup can find a potential backmerge. */ __rq = elv_rqhash_find(q, bio->bi_sector); if (__rq && elv_rq_merge_ok(__rq, bio)) { *req = __rq; return ELEVATOR_BACK_MERGE; } #endif - = stock = 0 + = /sys/block/sdb/queue/nomerges = 1 x = backmerge disabled quantum = 1 Avg dd pre 58.4 52.5 56.1 61.6 52.3 56.1- MB/s virgin/foo 59.6 54.4 53.0 56.1 58.6 56.3+ 1.003 53.8 56.6 54.7 50.7 59.3 55.0x .980 perf stat 2.87 0.91 1.64 1.41 0.90 1.5- Sec 2.61 1.14 1.45 1.43 1.47 1.6+ 1.066 1.07 1.19 1.20 1.24 1.37 1.2x .800 dd post 56.6 61.0 66.3 64.7 60.9 61.9- 54.0 59.3 61.1 58.3 58.9 58.3+ .941 54.3 60.2 59.6 60.6 60.3 59.0x .953 quantum = 2 dd pre 59.7 62.4 58.9 65.3 60.3 61.3- 49.4 51.9 58.7 49.3 52.4 52.3+ .853 58.3 52.8 53.1 50.4 59.9 54.9x .895 perf stat 5.81 6.09 6.24 10.13 6.21 6.8- 2.48 2.10 3.23 2.29 2.31 2.4+ .352 2.09 2.73 1.72 1.96 1.83 2.0x .294 dd post 64.0 62.6 64.2 60.4 61.1 62.4- 52.9 56.2 49.6 51.3 51.2 52.2+ .836 54.7 60.9 56.0 54.0 55.4 56.2x .900 quantum = 3 dd pre 65.5 57.7 54.5 51.1 56.3 57.0- 58.1 53.9 52.2 58.2 51.8 54.8+ .961 60.5 56.5 56.7 55.3 54.6 56.7x .994 perf stat 14.01 13.71 8.35 5.35 8.57 9.9- 1.84 2.30 2.14 2.10 2.45 2.1+ .212 2.12 1.63 2.54 2.23 2.29 2.1x .212 dd post 59.2 49.1 58.8 62.3 62.1 58.3- 59.8 53.2 55.2 50.9 53.7 54.5+ .934 56.1 61.9 51.9 54.3 53.1 55.4x .950 quantun = 4 dd pre 57.2 52.1 56.8 55.2 61.6 56.5- 48.7 55.4 51.3 49.7 54.5 51.9+ .918 55.8 54.5 50.3 56.4 49.3 53.2x .941 perf stat 11.98 1.61 9.63 16.21 11.13 10.1- 2.29 1.94 2.68 2.46 2.45 2.3+ .227 3.01 1.84 2.11 2.27 2.30 2.3x .227 dd post 57.2 52.6 62.2 49.3 50.2 54.3- 50.1 54.5 58.4 54.1 49.0 53.2+ .979 52.9 53.2 50.6 53.2 50.5 52.0x .957 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |