Prev: oss: msnd: check request_region() return value
Next: Checkpatch: warn about unexpectedly long msleep's
From: Kay Diederichs on 2 Aug 2010 11:30 Ted Ts'o schrieb: > On Wed, Jul 28, 2010 at 09:51:48PM +0200, Kay Diederichs wrote: >> When looking at the I/O statistics while the benchmark is running, we >> see very choppy patterns for 2.6.32, but quite smooth stats for >> 2.6.27-stable. > > Could you try to do two things for me? Using (preferably from a > recent e2fsprogs, such as 1.41.11 or 12) run filefrag -v on the files > created from your 2.6.27 run and your 2.6.32 run? > > Secondly can capture blktrace results from 2.6.27 and 2.6.32? That > would be very helpful to understand what might be going on. > > Either would be helpful; both would be greatly appreciated. > > Thanks, > > - Ted Ted, we pared down the benchmark to the last step (called "run xds_par in nfs directory (reads 600M, and writes 50M)") because this captures most of the problem. Here we report kernel messages with stacktrace, and the blktrace output that you requested. Kernel messages: with 2.6.32.16 we observe [ 6961.838032] INFO: task jbd2/md5-8:2010 blocked for more than 120 seconds. [ 6961.838111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 6961.838191] jbd2/md5-8 D 00000634 0 2010 2 0x00000000 [ 6961.838200] f5171e78 00000046 231a9999 00000634 ddf91f2c f652cc4c 00000001 f5171e1c [ 6961.838307] c0a6f140 c0a6f140 c0a6f140 c0a6a6ac f6354c20 f6354ecc c2008140 00000000 [ 6961.838412] 00637f84 00000003 f652cc58 00000000 00000292 00000048 c20036ac f6354ecc [ 6961.838518] Call Trace: [ 6961.838556] [<c056c39e>] jbd2_journal_commit_transaction+0x1d9/0x1187 [ 6961.838627] [<c040220a>] ? __switch_to+0xd5/0x147 [ 6961.838681] [<c07a390a>] ? schedule+0x837/0x885 [ 6961.838734] [<c0455e5f>] ? autoremove_wake_function+0x0/0x38 [ 6961.838799] [<c0448c84>] ? try_to_del_timer_sync+0x58/0x60 [ 6961.838859] [<c0572426>] kjournald2+0xa2/0x1be [ 6961.838909] [<c0455e5f>] ? autoremove_wake_function+0x0/0x38 [ 6961.838971] [<c0572384>] ? kjournald2+0x0/0x1be [ 6961.839035] [<c0455c11>] kthread+0x66/0x6b [ 6961.839089] [<c0455bab>] ? kthread+0x0/0x6b [ 6961.839139] [<c0404167>] kernel_thread_helper+0x7/0x10 [ 6961.839215] INFO: task sync:11600 blocked for more than 120 seconds. [ 6961.839286] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 6961.839367] sync D 00000632 0 11600 11595 0x00000000 [ 6961.839375] ddf91ea4 00000086 dca59d8b 00000632 76f570ee 00000048 00001dec 773fd796 [ 6961.839486] c0a6f140 c0a6f140 c0a6f140 c200819c f4ce0000 f4ce02ac c2008140 00000000 [ 6961.839600] ddf91ea8 dca5c36b 00000632 dca5bb77 c2008180 773fd796 00000282 f4ce02ac [ 6961.839727] Call Trace: [ 6961.839762] [<c04f4de6>] bdi_sched_wait+0xd/0x11 [ 6961.841438] [<c07a3ede>] __wait_on_bit+0x3b/0x62 [ 6961.843109] [<c04f4dd9>] ? bdi_sched_wait+0x0/0x11 [ 6961.844782] [<c07a3fb5>] out_of_line_wait_on_bit+0xb0/0xb8 [ 6961.846479] [<c04f4dd9>] ? bdi_sched_wait+0x0/0x11 [ 6961.848181] [<c0455e97>] ? wake_bit_function+0x0/0x48 [ 6961.849906] [<c04f4c75>] wait_on_bit+0x25/0x31 [ 6961.851601] [<c04f4e5d>] sync_inodes_sb+0x73/0x121 [ 6961.853287] [<c04f8acc>] __sync_filesystem+0x48/0x69 [ 6961.854983] [<c04f8b72>] sync_filesystems+0x85/0xc7 [ 6961.856670] [<c04f8c04>] sys_sync+0x20/0x32 [ 6961.858363] [<c040351b>] sysenter_do_call+0x12/0x28 Blktrace: blktrace was run for 2.6.27.48, 2.6.32.16 and a patched 2.6.32.16 (called 2.6.32.16p below and in the .tar file), where the patch just reverts "ext4: Avoid group preallocation for closed files". This revert removes a substantial part of the regression. For 2.6.32.16p and 2.6.27.48 there are two runs: run1 is directly after booting, then the directory is unexported, unmounted, mounted, exported, and then run2 is done. For 2.6.32.16 there is just run1; all subsequent runs yield approximately the same results, i.e. they are as slow as run1. Some numbers (time, and number of lines with flush|nfsd|sync in the blkparse output): 2.6.27.48 2.6.32.16 2.6.32.16p run1 run2 run1 run2 run1 run2 wallclock 113s 61s 280s ~280s 137s 61s flush 25362 9285 71861 32656 12066 nfsd 7595 8580 8685 8359 8444 sync 2860 3925 303 212 169 The total time seems to be dominated by the number of flushes. It should be noted that all these runs used barrier=0 ; barrier=1 does not have a significant effect, though. So we find: a) in 2.6.32.16, there is a problem which manifests itself in kernel messages associated with the jbd2/md5-8 and sync tasks, and vastly increased number of flush operations b) reverting patch "ext4: Avoid group preallocation for closed files" cures part of the problem c) even after reverting that patch, the first run takes much longer than the subsequent runs, despite "sync", "echo 3 > /proc/sys/vm/drop_caches", and umounting/re-mounting the filesystem. The blktrace files are at http://strucbio.biologie.uni-konstanz.de/~dikay/blktraces.tar.bz2 . Should we test any other patches? thanks, Kay -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Eric Sandeen on 2 Aug 2010 12:20 On 08/02/2010 09:52 AM, Kay Diederichs wrote: > Dave, > > as you suggested, we reverted "ext4: Avoid group preallocation for > closed files" and this indeed fixes a big part of the problem: after > booting the NFS server we get > > NFS-Server: turn5 2.6.32.16p i686 > NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64 > > exported directory on the nfs-server: > /dev/md5 /mnt/md5 ext4 > rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0 > > 48 seconds for preparations > 28 seconds to rsync 100 frames with 597M from nfs directory > 57 seconds to rsync 100 frames with 595M to nfs directory > 70 seconds to untar 24353 kernel files with 323M to nfs directory > 57 seconds to rsync 24353 kernel files with 323M from nfs directory > 133 seconds to run xds_par in nfs directory > 425 seconds to run the script Interesting, I had found this commit to be a problem for small files which are constantly created & deleted; the commit had the effect of packing the newly created files in the first free space that could be found, rather than walking down the disk leaving potentially fragmented freespace behind (see seekwatcher graph attached). Reverting the patch sped things up for this test, but left the filesystem freespace in bad shape. But you seem to see one of the largest effects in here: 261 seconds to rsync 100 frames with 595M to nfs directory vs 57 seconds to rsync 100 frames with 595M to nfs directory with the patch reverted making things go faster. So you are doing 100 6MB writes to the server, correct? Is the filesystem mkfs'd fresh before each test, or is it aged? If not mkfs'd, is it at least completely empty prior to the test, or does data remain on it? I'm just wondering if fragmented freespace is contributing to this behavior as well. If there is fragmented freespace, then with the patch I think the allocator is more likely to hunt around for small discontiguous chunks of free sapce, rather than going further out in the disk looking for a large area to allocate from. It might be interesting to use seekwatcher on the server to visualize the allocation/IO patterns for the test running just this far? -Eric
From: Kay Diederichs on 2 Aug 2010 17:10 Am 02.08.2010 18:12, schrieb Eric Sandeen: > On 08/02/2010 09:52 AM, Kay Diederichs wrote: >> Dave, >> >> as you suggested, we reverted "ext4: Avoid group preallocation for >> closed files" and this indeed fixes a big part of the problem: after >> booting the NFS server we get >> >> NFS-Server: turn5 2.6.32.16p i686 >> NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64 >> >> exported directory on the nfs-server: >> /dev/md5 /mnt/md5 ext4 >> rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0 >> >> 48 seconds for preparations >> 28 seconds to rsync 100 frames with 597M from nfs directory >> 57 seconds to rsync 100 frames with 595M to nfs directory >> 70 seconds to untar 24353 kernel files with 323M to nfs directory >> 57 seconds to rsync 24353 kernel files with 323M from nfs directory >> 133 seconds to run xds_par in nfs directory >> 425 seconds to run the script > > Interesting, I had found this commit to be a problem for small files > which are constantly created& deleted; the commit had the effect of > packing the newly created files in the first free space that could be > found, rather than walking down the disk leaving potentially fragmented > freespace behind (see seekwatcher graph attached). Reverting the patch > sped things up for this test, but left the filesystem freespace in bad > shape. > > But you seem to see one of the largest effects in here: > > 261 seconds to rsync 100 frames with 595M to nfs directory > vs > 57 seconds to rsync 100 frames with 595M to nfs directory > > with the patch reverted making things go faster. So you are doing 100 > 6MB writes to the server, correct? correct. > > Is the filesystem mkfs'd fresh > before each test, or is it aged? it is too big to "just create it freshly". It was actually created a week ago, and filled by a single ~ 10-hour rsync job run on the server such that the filesystem should be filled in the most linear way possible. Since then, the benchmarking has created and deleted lots of files. > If not mkfs'd, is it at least > completely empty prior to the test, or does data remain on it? I'm just it's not empty: df -h reports Filesystem Size Used Avail Use% Mounted on /dev/md5 3.7T 2.8T 712G 80% /mnt/md5 e2freefrag-1.41.12 reports: Device: /dev/md5 Blocksize: 4096 bytes Total blocks: 976761344 Free blocks: 235345984 (24.1%) Min. free extent: 4 KB Max. free extent: 99348 KB Avg. free extent: 1628 KB HISTOGRAM OF FREE EXTENT SIZES: Extent Size Range : Free extents Free Blocks Percent 4K... 8K- : 1858 1858 0.00% 8K... 16K- : 3415 8534 0.00% 16K... 32K- : 9952 54324 0.02% 32K... 64K- : 23884 288848 0.12% 64K... 128K- : 27901 658130 0.28% 128K... 256K- : 25761 1211519 0.51% 256K... 512K- : 35863 3376274 1.43% 512K... 1024K- : 48643 9416851 4.00% 1M... 2M- : 150311 60704033 25.79% 2M... 4M- : 244895 148283666 63.01% 4M... 8M- : 3970 5508499 2.34% 8M... 16M- : 187 551835 0.23% 16M... 32M- : 302 1765912 0.75% 32M... 64M- : 282 2727162 1.16% 64M... 128M- : 42 788539 0.34% > wondering if fragmented freespace is contributing to this behavior as > well. If there is fragmented freespace, then with the patch I think the > allocator is more likely to hunt around for small discontiguous chunks > of free sapce, rather than going further out in the disk looking for a > large area to allocate from. the last step of the benchmark, "xds_par", reads 600MB and writes 50MB. It has 16 threads which might put some additional pressure on the freespace hunting. That step also is fast in 2.6.27.48 but slow in 2.6.32+ . > > It might be interesting to use seekwatcher on the server to visualize > the allocation/IO patterns for the test running just this far? > > -Eric will try to install seekwatcher. thanks, Kay
From: Kay Diederichs on 3 Aug 2010 09:40 Eric Sandeen schrieb: > On 08/02/2010 09:52 AM, Kay Diederichs wrote: >> Dave, >> >> as you suggested, we reverted "ext4: Avoid group preallocation for >> closed files" and this indeed fixes a big part of the problem: after >> booting the NFS server we get >> >> NFS-Server: turn5 2.6.32.16p i686 >> NFS-Client: turn10 2.6.18-194.8.1.el5 x86_64 >> >> exported directory on the nfs-server: >> /dev/md5 /mnt/md5 ext4 >> rw,seclabel,noatime,barrier=1,stripe=512,data=writeback 0 0 >> >> 48 seconds for preparations >> 28 seconds to rsync 100 frames with 597M from nfs directory >> 57 seconds to rsync 100 frames with 595M to nfs directory >> 70 seconds to untar 24353 kernel files with 323M to nfs directory >> 57 seconds to rsync 24353 kernel files with 323M from nfs directory >> 133 seconds to run xds_par in nfs directory >> 425 seconds to run the script > > Interesting, I had found this commit to be a problem for small files > which are constantly created & deleted; the commit had the effect of > packing the newly created files in the first free space that could be > found, rather than walking down the disk leaving potentially fragmented > freespace behind (see seekwatcher graph attached). Reverting the patch > sped things up for this test, but left the filesystem freespace in bad > shape. > > But you seem to see one of the largest effects in here: > > 261 seconds to rsync 100 frames with 595M to nfs directory > vs > 57 seconds to rsync 100 frames with 595M to nfs directory > > with the patch reverted making things go faster. So you are doing 100 > 6MB writes to the server, correct? Is the filesystem mkfs'd fresh > before each test, or is it aged? If not mkfs'd, is it at least > completely empty prior to the test, or does data remain on it? I'm just > wondering if fragmented freespace is contributing to this behavior as > well. If there is fragmented freespace, then with the patch I think the > allocator is more likely to hunt around for small discontiguous chunks > of free sapce, rather than going further out in the disk looking for a > large area to allocate from. > > It might be interesting to use seekwatcher on the server to visualize > the allocation/IO patterns for the test running just this far? > > -Eric > > > ------------------------------------------------------------------------ > Eric, seekwatcher does not seem to understand the blktrace output of old kernels, so I rolled my own primitive plotting, e.g. blkparse -i md5.xds_par.2.6.32.16p_run1 > blkparse.out grep flush blkparse.out | grep W > flush_W grep flush blkparse.out | grep R > flush_R grep nfsd blkparse.out | grep R > nfsd_R grep nfsd blkparse.out | grep W > nfsd_W grep sync blkparse.out | grep R > sync_R grep sync blkparse.out | grep W > sync_W gnuplot<<EOF set term png set out '2.6.32.16p_run1.png' set key outside set title "2.6.32.16p_run1" plot 'nfsd_W' us 4:8,'flush_W' us 4:8,'sync_W' us 4:8,'nfsd_R' us 4:8,'flush_R' us 4:8 EOF I attach the resulting plots for 2.6.27.48_run1 (after booting) and 2.6.27.48_run2 (after run1 ; sync; and drop_cache). They show seconds on the x axis (horizontal) and block numbers (512-byte blocks, I suppose; the ext4 filesystem has 976761344 4096-byte blocks so that would be about 8e+09 512-byte blocks) on the y axis (vertical). You'll have to do the real interpretation of the plots yourself, but even someone who does not know exactly what the pdflush (in 2.6.27.48) or flush (in 2.6.32+) kernel threads are supposed to do can tell that the kernels behave _very_ differently. In particular, stock 2.6.32.16 every time (only run1 is shown, but run2 is the same) has the flush thread visiting all of the filesystem, in steps of 263168 blocks. I have no idea why it does this. Roughly the first 1/3 of the filesystem is also visited by kernels 2.6.27.48 and the patched 2.6.32.16 that Dave Chinner suggested, but only in the first run after booting. Subsequent runs are fast and do not employ the flush thread much. Hope this helps to pin down the regression. thanks, Kay
From: Kay Diederichs on 4 Aug 2010 04:20 Am 02.08.2010 22:21, schrieb Ted Ts'o: > On Mon, Aug 02, 2010 at 05:30:03PM +0200, Kay Diederichs wrote: >> >> we pared down the benchmark to the last step (called "run xds_par in nfs >> directory (reads 600M, and writes 50M)") because this captures most of >> the problem. Here we report kernel messages with stacktrace, and the >> blktrace output that you requested. > > Thanks, I'll take a look at it. > > Is NFS required to reproduce the problem? If you simply copy the 100 > files using rsync, or cp -r while logged onto the server, do you > notice the performance regression? > > Thanks, regards, > > - Ted Ted, we've run the benchmarks internally on the file server; it turns out that NFS is not required to reproduce the problem. We also took the opportunity to try 2.6.32.17 which just came out. 2.6.32.17 behaves similar to 2.6.32.16-patched (i.e. with reverted "ext4: Avoid group preallocation for closed files"); 2.6.32.17 has quite a few ext4 patches so one or a couple of those seems to have a similar effect as reverting "ext4: Avoid group preallocation for closed files". These are the times for the second (and higher) benchmark runs; the first run is always slower. The last step ("run xds_par") is slower than in the NFS case because it's heavy in CPU usage (total CPU time is more than 200 seconds); the NFS client is a 8-core (+HT) Nehalem-type machine, whereas the NFS server is just a 2-core Pentium D @ 3.40GHz Local machine: turn5 2.6.27.48 i686 Raid5: /dev/md5 /mnt/md5 ext4dev rw,noatime,barrier=1,stripe=512,data=writeback 0 0 32 seconds for preparations 19 seconds to rsync 100 frames with 597M from raid5,ext4 directory 17 seconds to rsync 100 frames with 595M to raid5,ext4 directory 36 seconds to untar 24353 kernel files with 323M to raid5,ext4 directory 31 seconds to rsync 24353 kernel files with 323M from raid5,ext4 directory 267 seconds to run xds_par in raid5,ext4 directory 427 seconds to run the script Local machine: turn5 2.6.32.16 i686 (vanilla, i.e. not patched) Raid5: /dev/md5 /mnt/md5 ext4 rw,seclabel,noatime,barrier=0,stripe=512,data=writeback 0 0 36 seconds for preparations 18 seconds to rsync 100 frames with 597M from raid5,ext4 directory 33 seconds to rsync 100 frames with 595M to raid5,ext4 directory 68 seconds to untar 24353 kernel files with 323M to raid5,ext4 directory 40 seconds to rsync 24353 kernel files with 323M from raid5,ext4 directory 489 seconds to run xds_par in raid5,ext4 directory 714 seconds to run the script Local machine: turn5 2.6.32.17 i686 Raid5: /dev/md5 /mnt/md5 ext4 rw,seclabel,noatime,barrier=0,stripe=512,data=writeback 0 0 38 seconds for preparations 18 seconds to rsync 100 frames with 597M from raid5,ext4 directory 33 seconds to rsync 100 frames with 595M to raid5,ext4 directory 67 seconds to untar 24353 kernel files with 323M to raid5,ext4 directory 41 seconds to rsync 24353 kernel files with 323M from raid5,ext4 directory 266 seconds to run xds_par in raid5,ext4 directory 492 seconds to run the script So even if the patches that went into 2.6.32.17 seem to fix the worst stalls, it is obvious that untarring and rsyncing kernel files is significantly slower on 2.6.32.17 than 2.6.27.48 . HTH, Kay
First
|
Prev
|
Pages: 1 2 Prev: oss: msnd: check request_region() return value Next: Checkpatch: warn about unexpectedly long msleep's |