Prev: [PATCH 08/11] ocfs2: Pass the locking protocol into ocfs2_cluster_connect().
Next: [PATCH tip/core/rcu 1/3] rcu: fixes for accelerated grace periods for last non-dynticked CPU
From: Asdo on 27 Feb 2010 22:20 Justin Piszcz wrote: > > > On Sat, 27 Feb 2010, Dmitry Monakhov wrote: > >> Justin Piszcz <jpiszcz(a)lucidpixels.com> writes: >> >>> Hello, >>> >>> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes? >>> I see about half the performance as XFS for sequential writes. >>> >>> I have checked the doc and tried several options, a few of which are >>> shown >>> below (I have also tried the commit/journal_async/etc options but >>> none of >>> them get the write speeds anywhere near XFS)? >>> >>> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2 >>> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write. >>> >>> When it was XFS I used to get 400-600MiB/s for writes for the same RAID >>> volume. >>> >>> How do I 'speed' up ext4? Is it possible? Hi Justin sorry for being OT in my reply (I can't answer your question unfortunately) You can really get 550MiB/sec through a 10gigabit ethernet connection? I didn't think it was possible. Just a few years ago it seems to me there were problems in obtaining a full gigabit out of 1Gigabit ethernet adapters... Is it running some kind of offloading like TOE, or RDMA or other magic things? (maybe by default... you can check something with ethtool --show-offload eth0, but TOE isn't there) Or really computers became so fast and I missed something...? Sorry for the stupid question (pls note: I removed most CC recipients because I went OT) Thank you -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: tytso on 28 Feb 2010 00:50 On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote: > > I still would like to know however, why 350MiB/s seems to be the maximum > performance I can get from two different md raids (that easily do 600MiB/s > with XFS). Can you run "filefrag -v <filename>" on the large file you created using dd? Part of the problem may be the block allocator simply not being well optimized super large writes. To be honest, that's not something we've tried (at all) to optimize, mainly because for most users of ext4 they're more interested in much more reasonable sized files, and we only have so many hours in a day to hack on ext4. :-) XFS in contrast has in the past had plenty of paying customers interested in writing really large scientific data sets, so this is something Irix *has* spent time optimizing. As far as I know none of the ext4 developers' day jobs are currently focused on really large files using ext4. Some of us do use ext4 to support really large files, but it's via some kind of cluster or parallel file system layered on top of ext4 (i.e., Sun/Clusterfs Lustre File Systems, or Google's GFS) --- and so what gets actually stored in ext4 isn't a single 10-20 gigabyte file. I'm saying this not as an excuse; but it's an explanation for why no one has really noticed this performance problem until you brought it up. I'd like to see ext4 be a good general purpose file system, which includes handling the really big files stored in a single system. But it's just not something we've tried optimizing yet. So if you can gather some data, such as the filefrag information, that would be a great first step. Something else that would be useful is gathering blktrace information, so we can see how we are scheduling the writes and whether we have something bad going on there. I wouldn't be surprised if there is some stupidity going on in the generic FS/MM writeback code which is throttling us, and which XFS has worked around. Ext4 has worked around some writeback brain-damage already, but I've been focused on much smaller files (files in the tens or hundreds megabytes) since that's what I tend to use much more frequently. It's great to see that you're really interested in this; if you're willing to do some investigative work, hopefully it's something we can address. Best Regards, - Ted P.S. I'm a bit unclear regarding your comment about "-o nodelalloc" in one of your earlier threads. Does using nodelalloc actually speeds things up? There were a bunch of numbers being thrown around, and in some configurations I thought you were getting around 300 MB/s without using nodelalloc? Or am I misunderstanding your numbers and what configuratoins you used with each test run? If nodelalloc is actually speeding things up, then we almost certainly have some kind of writeback problem. So filefrag and blktrace are definitely the tools we need to look at to understand what is going on. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Justin Piszcz on 28 Feb 2010 05:00 On Sun, 28 Feb 2010, Asdo wrote: > Justin Piszcz wrote: >> >> >> On Sat, 27 Feb 2010, Dmitry Monakhov wrote: >> >>> Justin Piszcz <jpiszcz(a)lucidpixels.com> writes: >>> >>>> Hello, >>>> >>>> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes? >>>> I see about half the performance as XFS for sequential writes. >>>> >>>> I have checked the doc and tried several options, a few of which are >>>> shown >>>> below (I have also tried the commit/journal_async/etc options but none of >>>> them get the write speeds anywhere near XFS)? >>>> >>>> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2 >>>> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write. >>>> >>>> When it was XFS I used to get 400-600MiB/s for writes for the same RAID >>>> volume. >>>> >>>> How do I 'speed' up ext4? Is it possible? > Hi Justin > sorry for being OT in my reply (I can't answer your question unfortunately) > You can really get 550MiB/sec through a 10gigabit ethernet connection? Yes, I am capped by the disk I/O, the network card itself card does ~1 gigabyte per second over iperf. If I had two raid systems that did >= 1Gbyte/sec read+write AND enough PCI-e bandwidth, it is plausible to see (large-files) transferring at 10Gbps speeds. > I didn't think it was possible. Just a few years ago it seems to me there > were problems in obtaining a full gigabit out of 1Gigabit ethernet > adapters... I have been running gigabit for awhile now and have been able to saturate it for some time between Linux hosts. If you are referring to windows and the transfer rates via samba, their networking stack did not get 'fixed' until Windows 7, otherwise it seemd like it was 'capped' at 40-60MiB/s, regardless of the HW. With 7, you always get ~100MiB/s if your HW is fast enough. A single Intel X25-E SSD can read > 200MiB/s as can many of the newer SSDs being released (the Micron 6Gbps) pusing 300MiB/s. As SSDs become more mainstream, gigabit will become more and more of a bottleneck. > Is it running some kind of offloading like TOE, or RDMA or other magic > things? (maybe by default... you can check something with ethtool Yes, check the features here (page 2/4), half way down: http://www.intel.com/Assets/PDF/prodbrief/318349.pdf > --show-offload eth0, but TOE isn't there) > Or really computers became so fast and I missed something...? PCI-express (for the bandwidth) (not PCI-X), jumbo frames (mtu=9000) and the 2.6 kernel. > Sorry for the stupid question > (pls note: I removed most CC recipients because I went OT) > > Thank you > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Justin Piszcz on 28 Feb 2010 10:00 On Sun, 28 Feb 2010, tytso(a)mit.edu wrote: > On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote: >> >> I still would like to know however, why 350MiB/s seems to be the maximum >> performance I can get from two different md raids (that easily do 600MiB/s >> with XFS). > Can you run "filefrag -v <filename>" on the large file you created > using dd? Part of the problem may be the block allocator simply not > being well optimized super large writes. To be honest, that's not > something we've tried (at all) to optimize, mainly because for most > users of ext4 they're more interested in much more reasonable sized > files, and we only have so many hours in a day to hack on ext4. :-) > XFS in contrast has in the past had plenty of paying customers > interested in writing really large scientific data sets, so this is > something Irix *has* spent time optimizing. Yes, this is shown at the bottom of the e-mail both with -o data=ordered and data=writeback. [ .. ] > So if you can gather some data, such as the filefrag information, that > would be a great first step. Something else that would be useful is > gathering blktrace information, so we can see how we are scheduling > the writes and whether we have something bad going on there. I > wouldn't be surprised if there is some stupidity going on in the > generic FS/MM writeback code which is throttling us, and which XFS has > worked around. Ext4 has worked around some writeback brain-damage > already, but I've been focused on much smaller files (files in the > tens or hundreds megabytes) since that's what I tend to use much more > frequently. > > It's great to see that you're really interested in this; if you're > willing to do some investigative work, hopefully it's something we can > address. [ .. ] > P.S. I'm a bit unclear regarding your comment about "-o nodelalloc" > in one of your earlier threads. Does using nodelalloc actually speeds > things up? There were a bunch of numbers being thrown around, and in > some configurations I thought you were getting around 300 MB/s without > using nodelalloc? Or am I misunderstanding your numbers and what > configuratoins you used with each test run? This is more dramatic on the software raid (mdadm) RAID-5 configuration. Without -o nodelalloc, I see roughly 200MiB/s. With -o nodelalloc, I hit the same barrier as the RAID-0, 350MiB/s, but its effect on RAID-0 is less dramatic. The full tests and output appear at the bottom of this e-mail; however, for brevity, the example below shows 55MiB/s and 132MiB/s performance increases with RAID-0 and RAID-5 respectively: For the RAID-0: -o data=writeback,nobarrier: 10737418240 bytes (11 GB) copied, 34.755 s, 309 MB/s -o data=writeback,nobarrier,nodelalloc: 10737418240 bytes (11 GB) copied, 29.5299 s, 364 MB/s An increase of 55MiB/s. For the RAID-5 (from earlier testing): -o data=writeback,nobarrier: 10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s -o data=writeback,nobarrier,nodelalloc: 10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s An increase of 132MiB/s. > > If nodelalloc is actually speeding things up, then we almost certainly > have some kind of writeback problem. So filefrag and blktrace are > definitely the tools we need to look at to understand what is going > on. > === CREATE RAID-0 WITH 11 DISKS p63:~# mdadm --create -e 0.90 /dev/md0 /dev/sd[b-l]1 --level=0 -n 11 -c 64 mdadm: /dev/sdb1 appears to be part of a raid array: level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010 mdadm: /dev/sdc1 appears to be part of a raid array: level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010 mdadm: /dev/sdd1 appears to be part of a raid array: level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010 mdadm: /dev/sde1 appears to be part of a raid array: level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010 mdadm: /dev/sdf1 appears to be part of a raid array: level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010 mdadm: /dev/sdg1 appears to be part of a raid array: level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010 mdadm: /dev/sdh1 appears to be part of a raid array: level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010 mdadm: /dev/sdi1 appears to be part of a raid array: level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010 mdadm: /dev/sdj1 appears to be part of a raid array: level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010 mdadm: /dev/sdk1 appears to be part of a raid array: level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010 mdadm: /dev/sdl1 appears to be part of a raid array: level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010 Continue creating array? y mdadm: array /dev/md0 started. p63:~# === SHOW MDADM RAID-0 p63:~# mdadm -D /dev/md0 /dev/md0: Version : 0.90 Creation Time : Sun Feb 28 06:31:41 2010 Raid Level : raid0 Array Size : 5372223296 (5123.35 GiB 5501.16 GB) Raid Devices : 11 Total Devices : 11 Preferred Minor : 0 Persistence : Superblock is persistent Update Time : Sun Feb 28 06:31:41 2010 State : clean Active Devices : 11 Working Devices : 11 Failed Devices : 0 Spare Devices : 0 Chunk Size : 64K UUID : 077d4d5c:5acbcb29:26614430:c3345183 (local to host p63) Events : 0.1 Number Major Minor RaidDevice State 0 8 17 0 active sync /dev/sdb1 1 8 33 1 active sync /dev/sdc1 2 8 49 2 active sync /dev/sdd1 3 8 65 3 active sync /dev/sde1 4 8 81 4 active sync /dev/sdf1 5 8 97 5 active sync /dev/sdg1 6 8 113 6 active sync /dev/sdh1 7 8 129 7 active sync /dev/sdi1 8 8 145 8 active sync /dev/sdj1 9 8 161 9 active sync /dev/sdk1 10 8 177 10 active sync /dev/sdl1 p63:~# === KERNEL CONFIGURATION BASELINE The following kernel configuration was used: http://home.comcast.net/~jpiszcz/20100228/config-2.6.33-baseline.txt === ESTABLISH CONTROL / BASELINE p63:~# mkfs.xfs /dev/md0 -f meta-data=/dev/md0 isize=256 agcount=32, agsize=41970496 blks = sectsz=512 attr=2 data = bsize=4096 blocks=1343055824, imaxpct=5 = sunit=16 swidth=176 blks naming =version 2 bsize=4096 ascii-ci=0 log =internal log bsize=4096 blocks=521728, version=2 = sectsz=512 sunit=16 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 p63:~# mount /dev/md0 /r1 -o nobarrier p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 17.9816 s, 597 MB/s 0.03user 16.10system 0:17.99elapsed 89%CPU (0avgtext+0avgdata 7312maxresident)k 0inputs+0outputs (1major+495minor)pagefaults 0swaps p63:/r1# p63:/r1# xfs_bmap -v /r1/bigfile /r1/bigfile: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..20971519]: 671528064..692499583 2 (128..20971647) 20971520 00011 p63:/r1# === CREATE EXT4 FILESYSTEM ON ARRAY (note the stripe/width appears to be irrelevant to to the speed problem as as the cap is '350MiB/s' whether it is aligned or not, see the following URL for those tests) http://lkml.org/lkml/2010/2/27/77 NOTE: It compares ext2 vs. ext3 vs. ext4 vs. XFS. NOTE: nobarrier does not seem to be a factor either, but I will include it below to ensure it is not somehow impacting the tests performed. p63:~# /usr/bin/time mkfs.ext4 /dev/md0 mke2fs 1.41.10 (10-Feb-2009) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=0 blocks, Stripe width=0 blocks 335765504 inodes, 1343055824 blocks 67152791 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 40987 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 102400000, 214990848, 512000000, 550731776, 644972544 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done This filesystem will be automatically checked every 36 mounts or 180 days, whichever comes first. Use tune2fs -c or -i to override. 6.50user 83.89system 2:01.86elapsed 74%CPU (0avgtext+0avgdata 829552maxresident)k 0inputs+0outputs (5major+51889minor)pagefaults 0swaps p63:~# === MOUNT FILESYSTEM WITH NOBARRIER, ORDERED (DEFAULT) & RUN TEST p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 35.2676 s, 304 MB/s 0.02user 19.40system 0:35.29elapsed 55%CPU (0avgtext+0avgdata 7312maxresident)k 0inputs+0outputs (3major+493minor)pagefaults 0swaps p63:/r1# === SHOW FILEFRAG OUTPUT (NOBARRIER,ORDERED) p63:/r1# filefrag -v /r1/bigfile Filesystem type is: ef53 File size of /r1/bigfile is 10737418240 (2621440 blocks, blocksize 4096) ext logical physical expected length flags 0 0 34816 32768 1 32768 67584 30720 2 63488 100352 98303 32768 3 96256 133120 30720 4 126976 165888 163839 32768 5 159744 198656 30720 6 190464 231424 229375 32768 7 223232 264192 30720 8 253952 296960 294911 32768 9 286720 329728 32768 10 319488 362496 32768 11 352256 395264 32768 12 385024 428032 32768 13 417792 460800 32768 14 450560 493568 30720 15 481280 557056 524287 32768 16 514048 589824 32768 17 546816 622592 32768 18 579584 655360 32768 19 612352 688128 32768 20 645120 720896 32768 21 677888 753664 32768 22 710656 786432 32768 23 743424 821248 819199 32768 24 776192 854016 30720 25 806912 886784 884735 32768 26 839680 919552 32768 27 872448 952320 32768 28 905216 985088 32768 29 937984 1017856 30720 30 968704 1081344 1048575 32768 31 1001472 1114112 32768 32 1034240 1146880 32768 33 1067008 1179648 32768 34 1099776 1212416 32768 35 1132544 1245184 32768 36 1165312 1277952 32768 37 1198080 1310720 32768 38 1230848 1343488 32768 39 1263616 1376256 32768 40 1296384 1409024 32768 41 1329152 1441792 32768 42 1361920 1474560 32768 43 1394688 1507328 32768 44 1427456 1540096 32768 45 1460224 1607680 1572863 32768 46 1492992 1640448 32768 47 1525760 1673216 32768 48 1558528 1705984 32768 49 1591296 1738752 32768 50 1624064 1771520 32768 51 1656832 1804288 32768 52 1689600 1837056 32768 53 1722368 1869824 32768 54 1755136 1902592 32768 55 1787904 1935360 32768 56 1820672 1968128 32768 57 1853440 2000896 32768 58 1886208 2033664 32768 59 1918976 2066432 30720 60 1949696 2129920 2097151 32768 61 1982464 2162688 32768 62 2015232 2195456 32768 63 2048000 2228224 32768 64 2080768 2260992 32768 65 2113536 2293760 32768 66 2146304 2326528 32768 67 2179072 2359296 32768 68 2211840 2392064 32768 69 2244608 2424832 32768 70 2277376 2457600 32768 71 2310144 2490368 32768 72 2342912 2523136 32768 73 2375680 2555904 32768 74 2408448 2588672 32768 75 2441216 2656256 2621439 32768 76 2473984 2689024 32768 77 2506752 2721792 32768 78 2539520 2754560 32768 79 2572288 2787328 18432 80 2590720 2818048 2805759 30720 eof /r1/bigfile: 13 extents found p63:/r1# === MOUNT FILESYSTEM WITH NOBARRIER, WRITEBACK & RUN TEST p63:/# mount /dev/md0 -o data=writeback,nobarrier /r1 p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 34.755 s, 309 MB/s 0.02user 19.38system 0:34.78elapsed 55%CPU (0avgtext+0avgdata 7280maxresident)k 0inputs+0outputs (3major+491minor)pagefaults 0swaps p63:/r1# === SHOW FILEFRAG OUTPUT (NOBARRIER,WRITEBACK) p63:/r1# filefrag -v /r1/bigfile Filesystem type is: ef53 File size of /r1/bigfile is 10737418240 (2621440 blocks, blocksize 4096) ext logical physical expected length flags 0 0 34816 32768 1 32768 67584 30720 2 63488 100352 98303 32768 3 96256 133120 30720 4 126976 165888 163839 32768 5 159744 198656 30720 6 190464 231424 229375 32768 7 223232 264192 30720 8 253952 296960 294911 32768 9 286720 329728 32768 10 319488 362496 32768 11 352256 395264 32768 12 385024 428032 32768 13 417792 460800 32768 14 450560 493568 30720 15 481280 557056 524287 32768 16 514048 589824 32768 17 546816 622592 32768 18 579584 655360 32768 19 612352 688128 32768 20 645120 720896 32768 21 677888 753664 32768 22 710656 786432 32768 23 743424 821248 819199 32768 24 776192 854016 30720 25 806912 886784 884735 32768 26 839680 919552 32768 27 872448 952320 32768 28 905216 985088 32768 29 937984 1017856 30720 30 968704 1081344 1048575 32768 31 1001472 1114112 32768 32 1034240 1146880 32768 33 1067008 1179648 32768 34 1099776 1212416 32768 35 1132544 1245184 32768 36 1165312 1277952 32768 37 1198080 1310720 32768 38 1230848 1343488 32768 39 1263616 1376256 32768 40 1296384 1409024 32768 41 1329152 1441792 32768 42 1361920 1474560 32768 43 1394688 1507328 32768 44 1427456 1540096 32768 45 1460224 1607680 1572863 32768 46 1492992 1640448 32768 47 1525760 1673216 32768 48 1558528 1705984 32768 49 1591296 1738752 32768 50 1624064 1771520 32768 51 1656832 1804288 32768 52 1689600 1837056 32768 53 1722368 1869824 32768 54 1755136 1902592 32768 55 1787904 1935360 32768 56 1820672 1968128 32768 57 1853440 2000896 32768 58 1886208 2033664 32768 59 1918976 2066432 30720 60 1949696 2129920 2097151 32768 61 1982464 2162688 32768 62 2015232 2195456 32768 63 2048000 2228224 32768 64 2080768 2260992 32768 65 2113536 2293760 32768 66 2146304 2326528 32768 67 2179072 2359296 32768 68 2211840 2392064 32768 69 2244608 2424832 32768 70 2277376 2457600 32768 71 2310144 2490368 32768 72 2342912 2523136 32768 73 2375680 2555904 32768 74 2408448 2588672 32768 75 2441216 2656256 2621439 32768 76 2473984 2689024 32768 77 2506752 2721792 32768 78 2539520 2754560 16384 /r1/bigfile: 12 extents found p63:/r1# === USE OF -o nodelalloc WITH SOFTWARE RAID-0 (SPEED IMPROVEMENT) p63:/r1# mount /dev/md0 -o data=writeback,nobarrier,nodelalloc /r1 p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 29.5299 s, 364 MB/s 0.02user 28.95system 0:29.56elapsed 98%CPU (0avgtext+0avgdata 7312maxresident)k 0inputs+0outputs (3major+493minor)pagefaults 0swaps p63:/r1# While it does help, I have not been able to get > 400MiB/s, it stops at roughly 350-360MiB/s. === FIRST ATTEMPT AT USING BLKTRACE Following these docs: http://git.kernel.org/?p=linux/kernel/git/axboe/blktrace.git;a=blob;f=README http://github.com/znmeb/linux_perf_viz/raw/master/blktrace-howto/blktrace-howto.pdf http://pdfedit.petricek.net/bt/file_download.php?file_id=17&type=bug http://www.cse.unsw.edu.au/~aaronc/iosched/doc/blktrace.html Options required in the kernel: Kernel hacking: | | [*] Debug Filesystem | | Then the BLK_IO_TRACE (it has moved from where the old docs say to go) Kernel Hacking: | | [ ] Tracers ---> | | | | [*] Support for tracing block IO actions | | Compile new kernel, reboot. New kernel configuration used (only enabled the options shown above) http://home.comcast.net/~jpiszcz/20100228/config-2.6.33-blktrace.txt Next step, create a fresh filesystem for the trace event: p63:~# /usr/bin/time mkfs.ext4 /dev/md0 < .. > Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done Reboot to new kernel. Per: http://pdfedit.petricek.net/bt/file_download.php?file_id=17&type=bug Mount the debug filesystem/make sure it iss mounted: p63:~# mount -t debugfs debugfs /sys/kernel/debug mount: debugfs already mounted or /sys/kernel/debug busy mount: according to mtab, debugfs is already mounted on /sys/kernel/debug p63:~# Then follow instructions on page 14 from: http://github.com/znmeb/linux_perf_viz/raw/master/blktrace-howto/blktrace-howto.pdf p63:/dev/shm/server# blktrace -l server: waiting for connections... server: connection from 192.168.168.113 p63:/dev/shm/client# blktrace -h 192.168.168.113 /dev/md0 blktrace: connecting to 192.168.168.113 blktrace: connected! Mount filesystem with -o data=writeback,nobarrier, run test blktrace1. p63:~# mount -o data=writeback,nobarrier /dev/md0 /r1 p63:~# cd /r1 p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 35.6317 s, 301 MB/s 0.03user 19.41system 0:35.67elapsed 54%CPU (0avgtext+0avgdata 7312maxresident)k 0inputs+0outputs (2major+494minor)pagefaults 0swaps p63:/r1# rm bigfile p63:/r1# sync p63:/r1# cd p63:~# umount /r1 p63:~# SERVER PROCESS: p63:/dev/shm/server# blktrace -l server: waiting for connections... server: connection from 192.168.168.113 server: end of run for 192.168.168.113:md0 === md0 === CPU 0: 1548634 events, 72593 KiB data CPU 1: 1009268 events, 47310 KiB data Total: 2557902 events (dropped 0), 119902 KiB data p63:/dev/shm/server# ls CLIENT PROCESS: # blktrace -h 192.168.168.113 /dev/md0 blktrace: connecting to 192.168.168.113 blktrace: connected! ^C=== md0 === CPU 0: 1548634 events, 72593 KiB data CPU 1: 1009268 events, 47310 KiB data Total: 2557902 events (dropped 0), 119902 KiB data From this test, the following resulted: # du -sh * 56K 192.168.168.113-2010-02-28-13:10:48 118M 192.168.168.113-2010-02-28-13:14:00 Let this trace be called blktrace1. p63:/dev/shm/server# du -sh blktrace1/* 56K blktrace1/192.168.168.113-2010-02-28-13:10:48 118M blktrace1/192.168.168.113-2010-02-28-13:14:00 p63:/dev/shm/server# Mount with -o data=writeback,nobarrier,nodelalloc, run test blktrace2. p63:~# umount /r1 p63:~# mount -o data=writeback,nobarrier,nodelalloc /dev/md0 /r1 p63:~# cd /r1 p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 30.6692 s, 350 MB/s 0.03user 29.55system 0:30.70elapsed 96%CPU (0avgtext+0avgdata 7312maxresident)k 0inputs+0outputs (1major+495minor)pagefaults 0swaps p63:/r1# rm bigfile p63:/r1# sync p63:/r1# cd p63:~# umount /r1 p63:~# SERVER PROCESS: p63:/dev/shm/server# blktrace -l server: waiting for connections... server: connection from 192.168.168.113 server: end of run for 192.168.168.113:md0 === md0 === CPU 0: 50056 events, 2347 KiB data CPU 1: 2478242 events, 116168 KiB data Total: 2528298 events (dropped 0), 118515 KiB data CLIENT PROCESS: # blktrace -h 192.168.168.113 /dev/md0 blktrace: connecting to 192.168.168.113 blktrace: connected! ^C=== md0 === CPU 0: 50056 events, 2347 KiB data CPU 1: 2478242 events, 116168 KiB data Total: 2528298 events (dropped 0), 118515 KiB data # p63:/dev/shm/server# du -sh 192.168.168.113-2010-02-28-13\:17\:22/* 2.4M 192.168.168.113-2010-02-28-13:17:22/md0.blktrace.0 114M 192.168.168.113-2010-02-28-13:17:22/md0.blktrace.1 This is blktrace2. One more time (blktrace3) with ordered. p63:~# mount -o nobarrier /dev/md0 /r1 p63:~# dmesg | tail -n 2 [ 2788.928806] EXT4-fs (md0): barriers disabled [ 2789.340573] EXT4-fs (md0): mounted filesystem with ordered data mode p63:~# cd /r1 p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240 10240+0 records in 10240+0 records out 10737418240 bytes (11 GB) copied, 36.2893 s, 296 MB/s 0.04user 19.29system 0:36.32elapsed 53%CPU (0avgtext+0avgdata 7296maxresident)k 0inputs+0outputs (1major+494minor)pagefaults 0swaps p63:/r1# rm bigfile p63:/r1# sync p63:/r1# cd p63:~# umount /r1 p63:~# SERVER PROCESS: p63:/dev/shm/server# blktrace -l server: waiting for connections... server: connection from 192.168.168.113 server: end of run for 192.168.168.113:md0 === md0 === CPU 0: 1587087 events, 74395 KiB data CPU 1: 970979 events, 45515 KiB data Total: 2558066 events (dropped 0), 119910 KiB data p63:/dev/shm/server# CLIENT PROCESS: # blktrace -h 192.168.168.113 /dev/md0 blktrace: connecting to 192.168.168.113 blktrace: connected! === md0 === CPU 0: 1587087 events, 74395 KiB data CPU 1: 970979 events, 45515 KiB data Total: 2558066 events (dropped 0), 119910 KiB data # TRACE OUTPUT TOTAL AND SUMMARY: p63:~/results-20100228# du -sh * 570M blktrace1 => -o data=writeback,nobarrier 570M blktrace1-redo => -o data=writeback,nobarrier 563M blktrace2 => -o data=writeback,nobarrier,nodelalloc 570M blktrace3 => -o data=nobarrier 4.0K script p63:~/results-20100228# USING SCRIPT ON PAGE 24/30: Running post-process.sh for each trace: blktrace{1,2,3}, the script itself from page 24/30: # cat /root/post-process.sh #! /bin/bash blkrawverify md0 # check data for errors blkparse -d md0.bin -i md0 > md0.blkparse # merged binary, parsed btt -i md0.bin --all-data > md0.btt # basic btt report # now the whole enchilada! btt -i md0.bin -o md0x --all-data --easy-parse-avgs \ --iostat=md0x.iostat \ --per-io-dump=md0x.pid \ --q2d-latencies=md0x \ --d2c-latencies=md0x \ --q2c-latencies=md0x \ --dump-blocknos=md0x_dbn \ --active-queue-depth=md0x \ --unplug-hist=md0x_uph \ --seeks=seeks \ --seeks-per-second=sps \ --per-io-trees=md0x_pit \ > md0x.btt # md0x.btt is empty # Before running any tests, backup raw data: p63:/dev/shm# tar cf /root/server.tar server p63:/dev/shm# For each directory, run post-process: blktrace1: (I must have waited too long in between steps here so it made two) p63:/dev/shm/server/blktrace1# ls -1 192.168.168.113-2010-02-28-13:10:48 192.168.168.113-2010-02-28-13:14:00 p63:/dev/shm/server/blktrace1# cd *48 p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:10:48# /root/post-process.sh Verifying md0 CPU 0 CPU 1 Wrote output to md0.verify.out p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:10:48# cd ../*00 p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:14:00# /root/post-process.sh Verifying md0 CPU 0 CPU 1 Wrote output to md0.verify.out p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:14:00# I will make another blktrace1 and be faster this time so all data results are of the same type, it is called blktrace1-redo: blktrace1-redo: p63:/dev/shm/server/blktrace1-redo/192.168.168.113-2010-02-28-13:35:45# /root/post-process.sh Verifying md0 CPU 0 CPU 1 Wrote output to md0.verify.out p63:/dev/shm/server/blktrace1-redo/192.168.168.113-2010-02-28-13:35:45# blktrace2: p63:/dev/shm/server/blktrace2/192.168.168.113-2010-02-28-13:17:22# /root/post-process.sh Verifying md0 CPU 0 CPU 1 Wrote output to md0.verify.out p63:/dev/shm/server/blktrace2/192.168.168.113-2010-02-28-13:17:22# blktrace3: p63:/dev/shm/server/blktrace3/192.168.168.113-2010-02-28-13:31:29# /root/post-process.sh Verifying md0 CPU 0 CPU 1 Wrote output to md0.verify.out p63:/dev/shm/server/blktrace3/192.168.168.113-2010-02-28-13:31:29# ------------ === FINAL RESULTS p63:~/results-20100228# du -sh */* 216K blktrace1/192.168.168.113-2010-02-28-13:10:48 570M blktrace1/192.168.168.113-2010-02-28-13:14:00 570M blktrace1-redo/192.168.168.113-2010-02-28-13:35:45 563M blktrace2/192.168.168.113-2010-02-28-13:17:22 570M blktrace3/192.168.168.113-2010-02-28-13:31:29 4.0K script/post-process.sh p63:~/results-20100228# I used 7zip to compress the results because it offers the best compression ratio of any other utility, including the latest 'xz' utility: http://fixunix.com/kernel/238089-response-kernel-compression-e-mail-few-months-ago.html $ xz -9 linux-2.6.16.17.tar $ du -sk * | sort -n 32392 linux-2.6.16.17.tar.7z 32404 linux-2.6.16.17.tar.xz 33520 linux-2.6.16.17.tar.lzma 33760 linux-2.6.16.17.tar.rar 38064 linux-2.6.16.17.tar.rz 39472 linux-2.6.16.17.tar.szip 39520 linux-2.6.16.17.tar.bz 39936 linux-2.6.16.17.tar.bz2 40000 linux-2.6.16.17.tar.bicom 40656 linux-2.6.16.17.tar.sit 47664 linux-2.6.16.17.tar.lha 49968 linux-2.6.16.17.tar.dzip 50000 linux-2.6.16.17.tar.gz 51344 linux-2.6.16.17.tar.arj 57552 linux-2.6.16.17.tar.lzo 57984 linux-2.6.16.17.tar.F 81136 linux-2.6.16.17.tar.Z 94544 linux-2.6.16.17.tar.zoo 101216 linux-2.6.16.17.tar.arc 228608 linux-2.6.16.17.tar === COMPRESSION RESULTS: -rw-r--r-- 1 abc users 155M 2010-02-28 09:02 results-20100228.tar.7z -rw-r--r-- 1 abc users 290M 2010-02-28 08:42 results-20100228.tar.bz2 -rw-r--r-- 1 abc users 2.3G 2010-02-28 08:42 results-20100228.tar === LOCATION: http://liquidswords.org/~war/results-20100228.tar.7z wget http://liquidswords.org/~war/results-20100228.tar.7z === MD5 CHECKSUM: $ md5sum * 1db01600ce2700854b4bafcfd68f7028 results-20100228.tar.7z 35793b283edf5c0f38738276812aad52 results-20100228.tar === VERIFICATION: MAKE SURE IT WORKS FOR OTHERS: $ wget http://liquidswords.org/~war/results-20100228.tar.7z --2010-02-28 09:48:36-- http://liquidswords.org/~war/results-20100228.tar.7z Resolving liquidswords.org... 71.6.165.232 Connecting to liquidswords.org|71.6.165.232|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 161814574 (154M) [application/x-tar] Saving to: "results-20100228.tar.7z" 100%[======================================>] 161,814,574 2.00M/s in 69s $ $ md5sum *7z 1db01600ce2700854b4bafcfd68f7028 results-20100228.tar.7z CORRECT $ 7z x results-20100228.tar.7z 7-Zip 4.58 beta Copyright (c) 1999-2008 Igor Pavlov 2008-05-05 p7zip Version 4.58 (locale=en_US,Utf16=on,HugeFiles=on,8 CPUs) Processing archive: results-20100228.tar.7z Extracting results-20100228.tar Everything is Ok Size: 2382561280 Compressed: 161814574 $ md5sum *tar 35793b283edf5c0f38738276812aad52 results-20100228.tar CORRECT Again, the trace information details: p63:~/results-20100228# du -sh * 570M blktrace1 => -o data=writeback,nobarrier 570M blktrace1-redo => -o data=writeback,nobarrier 563M blktrace2 => -o data=writeback,nobarrier,nodelalloc 570M blktrace3 => -o data=nobarrier 4.0K script p63:~/results-20100228# Let me know if you need anything else, thanks. Justin. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on 28 Feb 2010 19:00
On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote: > Besides large sequential I/O, ext4 seems to be MUCH faster than XFS when > working with many small files. > > EXT4 > > p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync' > 0.18user 2.43system 0:02.86elapsed 91%CPU (0avgtext+0avgdata 5216maxresident)k > 0inputs+0outputs (0major+971minor)pagefaults 0swaps > linux-2.6.33 linux-2.6.33.tar > p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync' > 0.02user 0.98system 0:01.03elapsed 97%CPU (0avgtext+0avgdata 5216maxresident)k > 0inputs+0outputs (0major+865minor)pagefaults 0swaps > > XFS > > p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync' > 0.20user 2.62system 1:03.90elapsed 4%CPU (0avgtext+0avgdata 5200maxresident)k > 0inputs+0outputs (0major+970minor)pagefaults 0swaps > p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync' > 0.03user 2.02system 0:29.04elapsed 7%CPU (0avgtext+0avgdata 5200maxresident)k > 0inputs+0outputs (0major+864minor)pagefaults 0swaps Mount XFS with "-o logbsize=262144". Metadata intensive workloads on XFS are log IO bound, so larger log buffer size makes a big difference. On 2.6.33 kernels on a single 15krpm SCSI drive I've been getting ~21s for the untar, and 8s for the rm -rf with that option set. Still slower than ext4, but nowhere near as bad. > So I guess that's the tradeoff, for massive I/O you should use XFS, else, > use EXT4? I wouldn't consider writing an 11GB file "massive IO", nor would I consider an 600MB/s massive, either, since you can get that out of a sub-$10k server these days.... > I still would like to know however, why 350MiB/s seems to be the maximum > performance I can get from two different md raids (that easily do 600MiB/s > with XFS). Check whether the dd process on ext4 is CPU bound.... Cheers, Dave. -- Dave Chinner david(a)fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |