From: Asdo on
Justin Piszcz wrote:
>
>
> On Sat, 27 Feb 2010, Dmitry Monakhov wrote:
>
>> Justin Piszcz <jpiszcz(a)lucidpixels.com> writes:
>>
>>> Hello,
>>>
>>> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
>>> I see about half the performance as XFS for sequential writes.
>>>
>>> I have checked the doc and tried several options, a few of which are
>>> shown
>>> below (I have also tried the commit/journal_async/etc options but
>>> none of
>>> them get the write speeds anywhere near XFS)?
>>>
>>> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
>>> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>>>
>>> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
>>> volume.
>>>
>>> How do I 'speed' up ext4? Is it possible?
Hi Justin
sorry for being OT in my reply (I can't answer your question unfortunately)
You can really get 550MiB/sec through a 10gigabit ethernet connection?
I didn't think it was possible. Just a few years ago it seems to me
there were problems in obtaining a full gigabit out of 1Gigabit ethernet
adapters...
Is it running some kind of offloading like TOE, or RDMA or other magic
things? (maybe by default... you can check something with ethtool
--show-offload eth0, but TOE isn't there)
Or really computers became so fast and I missed something...?
Sorry for the stupid question
(pls note: I removed most CC recipients because I went OT)

Thank you
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: tytso on
On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote:
>
> I still would like to know however, why 350MiB/s seems to be the maximum
> performance I can get from two different md raids (that easily do 600MiB/s
> with XFS).

Can you run "filefrag -v <filename>" on the large file you created
using dd? Part of the problem may be the block allocator simply not
being well optimized super large writes. To be honest, that's not
something we've tried (at all) to optimize, mainly because for most
users of ext4 they're more interested in much more reasonable sized
files, and we only have so many hours in a day to hack on ext4. :-)
XFS in contrast has in the past had plenty of paying customers
interested in writing really large scientific data sets, so this is
something Irix *has* spent time optimizing.

As far as I know none of the ext4 developers' day jobs are currently
focused on really large files using ext4. Some of us do use ext4 to
support really large files, but it's via some kind of cluster or
parallel file system layered on top of ext4 (i.e., Sun/Clusterfs
Lustre File Systems, or Google's GFS) --- and so what gets actually
stored in ext4 isn't a single 10-20 gigabyte file.

I'm saying this not as an excuse; but it's an explanation for why no
one has really noticed this performance problem until you brought it
up. I'd like to see ext4 be a good general purpose file system, which
includes handling the really big files stored in a single system. But
it's just not something we've tried optimizing yet.

So if you can gather some data, such as the filefrag information, that
would be a great first step. Something else that would be useful is
gathering blktrace information, so we can see how we are scheduling
the writes and whether we have something bad going on there. I
wouldn't be surprised if there is some stupidity going on in the
generic FS/MM writeback code which is throttling us, and which XFS has
worked around. Ext4 has worked around some writeback brain-damage
already, but I've been focused on much smaller files (files in the
tens or hundreds megabytes) since that's what I tend to use much more
frequently.

It's great to see that you're really interested in this; if you're
willing to do some investigative work, hopefully it's something we can
address.

Best Regards,

- Ted

P.S. I'm a bit unclear regarding your comment about "-o nodelalloc"
in one of your earlier threads. Does using nodelalloc actually speeds
things up? There were a bunch of numbers being thrown around, and in
some configurations I thought you were getting around 300 MB/s without
using nodelalloc? Or am I misunderstanding your numbers and what
configuratoins you used with each test run?

If nodelalloc is actually speeding things up, then we almost certainly
have some kind of writeback problem. So filefrag and blktrace are
definitely the tools we need to look at to understand what is going
on.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Justin Piszcz on


On Sun, 28 Feb 2010, Asdo wrote:

> Justin Piszcz wrote:
>>
>>
>> On Sat, 27 Feb 2010, Dmitry Monakhov wrote:
>>
>>> Justin Piszcz <jpiszcz(a)lucidpixels.com> writes:
>>>
>>>> Hello,
>>>>
>>>> Is it possible to 'optimize' ext4 so it is as fast as XFS for writes?
>>>> I see about half the performance as XFS for sequential writes.
>>>>
>>>> I have checked the doc and tried several options, a few of which are
>>>> shown
>>>> below (I have also tried the commit/journal_async/etc options but none of
>>>> them get the write speeds anywhere near XFS)?
>>>>
>>>> Sure 'dd' is not a real benchmark, etc, etc, but with 10Gbps between 2
>>>> hosts I get 550MiB/s+ on reads from EXT4 but only 100-200MiB/s write.
>>>>
>>>> When it was XFS I used to get 400-600MiB/s for writes for the same RAID
>>>> volume.
>>>>
>>>> How do I 'speed' up ext4? Is it possible?
> Hi Justin
> sorry for being OT in my reply (I can't answer your question unfortunately)
> You can really get 550MiB/sec through a 10gigabit ethernet connection?
Yes, I am capped by the disk I/O, the network card itself card does ~1
gigabyte per second over iperf. If I had two raid systems that
did >= 1Gbyte/sec read+write AND enough PCI-e bandwidth, it is plausible to
see (large-files) transferring at 10Gbps speeds.

> I didn't think it was possible. Just a few years ago it seems to me there
> were problems in obtaining a full gigabit out of 1Gigabit ethernet
> adapters...
I have been running gigabit for awhile now and have been able to saturate
it for some time between Linux hosts. If you are referring to windows and
the transfer rates via samba, their networking stack did not get 'fixed'
until Windows 7, otherwise it seemd like it was 'capped' at 40-60MiB/s,
regardless of the HW. With 7, you always get ~100MiB/s if your HW is fast
enough. A single Intel X25-E SSD can read > 200MiB/s as can many of the
newer SSDs being released (the Micron 6Gbps) pusing 300MiB/s. As SSDs
become more mainstream, gigabit will become more and more of a bottleneck.

> Is it running some kind of offloading like TOE, or RDMA or other magic
> things? (maybe by default... you can check something with ethtool
Yes, check the features here (page 2/4), half way down:
http://www.intel.com/Assets/PDF/prodbrief/318349.pdf

> --show-offload eth0, but TOE isn't there)
> Or really computers became so fast and I missed something...?
PCI-express (for the bandwidth) (not PCI-X), jumbo frames (mtu=9000)
and the 2.6 kernel.

> Sorry for the stupid question
> (pls note: I removed most CC recipients because I went OT)
>
> Thank you
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Justin Piszcz on


On Sun, 28 Feb 2010, tytso(a)mit.edu wrote:

> On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote:
>>
>> I still would like to know however, why 350MiB/s seems to be the maximum
>> performance I can get from two different md raids (that easily do 600MiB/s
>> with XFS).

> Can you run "filefrag -v <filename>" on the large file you created
> using dd? Part of the problem may be the block allocator simply not
> being well optimized super large writes. To be honest, that's not
> something we've tried (at all) to optimize, mainly because for most
> users of ext4 they're more interested in much more reasonable sized
> files, and we only have so many hours in a day to hack on ext4. :-)
> XFS in contrast has in the past had plenty of paying customers
> interested in writing really large scientific data sets, so this is
> something Irix *has* spent time optimizing.
Yes, this is shown at the bottom of the e-mail both with -o data=ordered
and data=writeback.

[ .. ]

> So if you can gather some data, such as the filefrag information, that
> would be a great first step. Something else that would be useful is
> gathering blktrace information, so we can see how we are scheduling
> the writes and whether we have something bad going on there. I
> wouldn't be surprised if there is some stupidity going on in the
> generic FS/MM writeback code which is throttling us, and which XFS has
> worked around. Ext4 has worked around some writeback brain-damage
> already, but I've been focused on much smaller files (files in the
> tens or hundreds megabytes) since that's what I tend to use much more
> frequently.
>
> It's great to see that you're really interested in this; if you're
> willing to do some investigative work, hopefully it's something we can
> address.

[ .. ]

> P.S. I'm a bit unclear regarding your comment about "-o nodelalloc"
> in one of your earlier threads. Does using nodelalloc actually speeds
> things up? There were a bunch of numbers being thrown around, and in
> some configurations I thought you were getting around 300 MB/s without
> using nodelalloc? Or am I misunderstanding your numbers and what
> configuratoins you used with each test run?
This is more dramatic on the software raid (mdadm) RAID-5 configuration.
Without -o nodelalloc, I see roughly 200MiB/s. With -o nodelalloc, I hit
the same barrier as the RAID-0, 350MiB/s, but its effect on RAID-0 is less
dramatic. The full tests and output appear at the bottom of this e-mail;
however, for brevity, the example below shows 55MiB/s and 132MiB/s
performance increases with RAID-0 and RAID-5 respectively:

For the RAID-0:

-o data=writeback,nobarrier:
10737418240 bytes (11 GB) copied, 34.755 s, 309 MB/s
-o data=writeback,nobarrier,nodelalloc:
10737418240 bytes (11 GB) copied, 29.5299 s, 364 MB/s
An increase of 55MiB/s.

For the RAID-5 (from earlier testing):

-o data=writeback,nobarrier:
10737418240 bytes (11 GB) copied, 48.7335 s, 220 MB/s
-o data=writeback,nobarrier,nodelalloc:
10737418240 bytes (11 GB) copied, 30.5425 s, 352 MB/s
An increase of 132MiB/s.

>
> If nodelalloc is actually speeding things up, then we almost certainly
> have some kind of writeback problem. So filefrag and blktrace are
> definitely the tools we need to look at to understand what is going
> on.
>

=== CREATE RAID-0 WITH 11 DISKS

p63:~# mdadm --create -e 0.90 /dev/md0 /dev/sd[b-l]1 --level=0 -n 11 -c 64
mdadm: /dev/sdb1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdc1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdd1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sde1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdf1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdg1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdh1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdi1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdj1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdk1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
mdadm: /dev/sdl1 appears to be part of a raid array:
level=raid0 devices=11 ctime=Sun Feb 28 06:24:58 2010
Continue creating array? y
mdadm: array /dev/md0 started.
p63:~#

=== SHOW MDADM RAID-0

p63:~# mdadm -D /dev/md0
/dev/md0:
Version : 0.90
Creation Time : Sun Feb 28 06:31:41 2010
Raid Level : raid0
Array Size : 5372223296 (5123.35 GiB 5501.16 GB)
Raid Devices : 11
Total Devices : 11
Preferred Minor : 0
Persistence : Superblock is persistent

Update Time : Sun Feb 28 06:31:41 2010
State : clean
Active Devices : 11
Working Devices : 11
Failed Devices : 0
Spare Devices : 0

Chunk Size : 64K

UUID : 077d4d5c:5acbcb29:26614430:c3345183 (local to host p63)
Events : 0.1

Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 33 1 active sync /dev/sdc1
2 8 49 2 active sync /dev/sdd1
3 8 65 3 active sync /dev/sde1
4 8 81 4 active sync /dev/sdf1
5 8 97 5 active sync /dev/sdg1
6 8 113 6 active sync /dev/sdh1
7 8 129 7 active sync /dev/sdi1
8 8 145 8 active sync /dev/sdj1
9 8 161 9 active sync /dev/sdk1
10 8 177 10 active sync /dev/sdl1
p63:~#

=== KERNEL CONFIGURATION BASELINE

The following kernel configuration was used:
http://home.comcast.net/~jpiszcz/20100228/config-2.6.33-baseline.txt

=== ESTABLISH CONTROL / BASELINE

p63:~# mkfs.xfs /dev/md0 -f
meta-data=/dev/md0 isize=256 agcount=32, agsize=41970496 blks
= sectsz=512 attr=2
data = bsize=4096 blocks=1343055824, imaxpct=5
= sunit=16 swidth=176 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=521728, version=2
= sectsz=512 sunit=16 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0
p63:~# mount /dev/md0 /r1 -o nobarrier
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 17.9816 s, 597 MB/s
0.03user 16.10system 0:17.99elapsed 89%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (1major+495minor)pagefaults 0swaps
p63:/r1#

p63:/r1# xfs_bmap -v /r1/bigfile
/r1/bigfile:
EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS
0: [0..20971519]: 671528064..692499583 2 (128..20971647) 20971520 00011
p63:/r1#

=== CREATE EXT4 FILESYSTEM ON ARRAY (note the stripe/width appears to be
irrelevant to to the speed problem as
as the cap is '350MiB/s' whether it is
aligned or not, see the following URL
for those tests)
http://lkml.org/lkml/2010/2/27/77

NOTE: It compares ext2 vs. ext3 vs. ext4
vs. XFS.

NOTE: nobarrier does not seem to be a
factor either, but I will include
it below to ensure it is not
somehow impacting the tests
performed.

p63:~# /usr/bin/time mkfs.ext4 /dev/md0
mke2fs 1.41.10 (10-Feb-2009)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=0 blocks, Stripe width=0 blocks
335765504 inodes, 1343055824 blocks
67152791 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=4294967296
40987 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848, 512000000, 550731776, 644972544

Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 36 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
6.50user 83.89system 2:01.86elapsed 74%CPU (0avgtext+0avgdata 829552maxresident)k
0inputs+0outputs (5major+51889minor)pagefaults 0swaps
p63:~#

=== MOUNT FILESYSTEM WITH NOBARRIER, ORDERED (DEFAULT) & RUN TEST

p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 35.2676 s, 304 MB/s
0.02user 19.40system 0:35.29elapsed 55%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (3major+493minor)pagefaults 0swaps
p63:/r1#

=== SHOW FILEFRAG OUTPUT (NOBARRIER,ORDERED)

p63:/r1# filefrag -v /r1/bigfile
Filesystem type is: ef53
File size of /r1/bigfile is 10737418240 (2621440 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 34816 32768
1 32768 67584 30720
2 63488 100352 98303 32768
3 96256 133120 30720
4 126976 165888 163839 32768
5 159744 198656 30720
6 190464 231424 229375 32768
7 223232 264192 30720
8 253952 296960 294911 32768
9 286720 329728 32768
10 319488 362496 32768
11 352256 395264 32768
12 385024 428032 32768
13 417792 460800 32768
14 450560 493568 30720
15 481280 557056 524287 32768
16 514048 589824 32768
17 546816 622592 32768
18 579584 655360 32768
19 612352 688128 32768
20 645120 720896 32768
21 677888 753664 32768
22 710656 786432 32768
23 743424 821248 819199 32768
24 776192 854016 30720
25 806912 886784 884735 32768
26 839680 919552 32768
27 872448 952320 32768
28 905216 985088 32768
29 937984 1017856 30720
30 968704 1081344 1048575 32768
31 1001472 1114112 32768
32 1034240 1146880 32768
33 1067008 1179648 32768
34 1099776 1212416 32768
35 1132544 1245184 32768
36 1165312 1277952 32768
37 1198080 1310720 32768
38 1230848 1343488 32768
39 1263616 1376256 32768
40 1296384 1409024 32768
41 1329152 1441792 32768
42 1361920 1474560 32768
43 1394688 1507328 32768
44 1427456 1540096 32768
45 1460224 1607680 1572863 32768
46 1492992 1640448 32768
47 1525760 1673216 32768
48 1558528 1705984 32768
49 1591296 1738752 32768
50 1624064 1771520 32768
51 1656832 1804288 32768
52 1689600 1837056 32768
53 1722368 1869824 32768
54 1755136 1902592 32768
55 1787904 1935360 32768
56 1820672 1968128 32768
57 1853440 2000896 32768
58 1886208 2033664 32768
59 1918976 2066432 30720
60 1949696 2129920 2097151 32768
61 1982464 2162688 32768
62 2015232 2195456 32768
63 2048000 2228224 32768
64 2080768 2260992 32768
65 2113536 2293760 32768
66 2146304 2326528 32768
67 2179072 2359296 32768
68 2211840 2392064 32768
69 2244608 2424832 32768
70 2277376 2457600 32768
71 2310144 2490368 32768
72 2342912 2523136 32768
73 2375680 2555904 32768
74 2408448 2588672 32768
75 2441216 2656256 2621439 32768
76 2473984 2689024 32768
77 2506752 2721792 32768
78 2539520 2754560 32768
79 2572288 2787328 18432
80 2590720 2818048 2805759 30720 eof
/r1/bigfile: 13 extents found
p63:/r1#

=== MOUNT FILESYSTEM WITH NOBARRIER, WRITEBACK & RUN TEST

p63:/# mount /dev/md0 -o data=writeback,nobarrier /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 34.755 s, 309 MB/s
0.02user 19.38system 0:34.78elapsed 55%CPU (0avgtext+0avgdata 7280maxresident)k
0inputs+0outputs (3major+491minor)pagefaults 0swaps
p63:/r1#

=== SHOW FILEFRAG OUTPUT (NOBARRIER,WRITEBACK)

p63:/r1# filefrag -v /r1/bigfile
Filesystem type is: ef53
File size of /r1/bigfile is 10737418240 (2621440 blocks, blocksize 4096)
ext logical physical expected length flags
0 0 34816 32768
1 32768 67584 30720
2 63488 100352 98303 32768
3 96256 133120 30720
4 126976 165888 163839 32768
5 159744 198656 30720
6 190464 231424 229375 32768
7 223232 264192 30720
8 253952 296960 294911 32768
9 286720 329728 32768
10 319488 362496 32768
11 352256 395264 32768
12 385024 428032 32768
13 417792 460800 32768
14 450560 493568 30720
15 481280 557056 524287 32768
16 514048 589824 32768
17 546816 622592 32768
18 579584 655360 32768
19 612352 688128 32768
20 645120 720896 32768
21 677888 753664 32768
22 710656 786432 32768
23 743424 821248 819199 32768
24 776192 854016 30720
25 806912 886784 884735 32768
26 839680 919552 32768
27 872448 952320 32768
28 905216 985088 32768
29 937984 1017856 30720
30 968704 1081344 1048575 32768
31 1001472 1114112 32768
32 1034240 1146880 32768
33 1067008 1179648 32768
34 1099776 1212416 32768
35 1132544 1245184 32768
36 1165312 1277952 32768
37 1198080 1310720 32768
38 1230848 1343488 32768
39 1263616 1376256 32768
40 1296384 1409024 32768
41 1329152 1441792 32768
42 1361920 1474560 32768
43 1394688 1507328 32768
44 1427456 1540096 32768
45 1460224 1607680 1572863 32768
46 1492992 1640448 32768
47 1525760 1673216 32768
48 1558528 1705984 32768
49 1591296 1738752 32768
50 1624064 1771520 32768
51 1656832 1804288 32768
52 1689600 1837056 32768
53 1722368 1869824 32768
54 1755136 1902592 32768
55 1787904 1935360 32768
56 1820672 1968128 32768
57 1853440 2000896 32768
58 1886208 2033664 32768
59 1918976 2066432 30720
60 1949696 2129920 2097151 32768
61 1982464 2162688 32768
62 2015232 2195456 32768
63 2048000 2228224 32768
64 2080768 2260992 32768
65 2113536 2293760 32768
66 2146304 2326528 32768
67 2179072 2359296 32768
68 2211840 2392064 32768
69 2244608 2424832 32768
70 2277376 2457600 32768
71 2310144 2490368 32768
72 2342912 2523136 32768
73 2375680 2555904 32768
74 2408448 2588672 32768
75 2441216 2656256 2621439 32768
76 2473984 2689024 32768
77 2506752 2721792 32768
78 2539520 2754560 16384
/r1/bigfile: 12 extents found
p63:/r1#

=== USE OF -o nodelalloc WITH SOFTWARE RAID-0 (SPEED IMPROVEMENT)

p63:/r1# mount /dev/md0 -o data=writeback,nobarrier,nodelalloc /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 29.5299 s, 364 MB/s
0.02user 28.95system 0:29.56elapsed 98%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (3major+493minor)pagefaults 0swaps
p63:/r1#

While it does help, I have not been able to get > 400MiB/s, it stops at
roughly 350-360MiB/s.

=== FIRST ATTEMPT AT USING BLKTRACE

Following these docs:
http://git.kernel.org/?p=linux/kernel/git/axboe/blktrace.git;a=blob;f=README
http://github.com/znmeb/linux_perf_viz/raw/master/blktrace-howto/blktrace-howto.pdf
http://pdfedit.petricek.net/bt/file_download.php?file_id=17&type=bug
http://www.cse.unsw.edu.au/~aaronc/iosched/doc/blktrace.html

Options required in the kernel:

Kernel hacking:
| | [*] Debug Filesystem | |

Then the BLK_IO_TRACE (it has moved from where the old docs say to go)
Kernel Hacking:
| | [ ] Tracers ---> | |
| | [*] Support for tracing block IO actions | |

Compile new kernel, reboot.

New kernel configuration used (only enabled the options shown above)
http://home.comcast.net/~jpiszcz/20100228/config-2.6.33-blktrace.txt

Next step, create a fresh filesystem for the trace event:

p63:~# /usr/bin/time mkfs.ext4 /dev/md0
< .. >
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done

Reboot to new kernel.

Per:
http://pdfedit.petricek.net/bt/file_download.php?file_id=17&type=bug

Mount the debug filesystem/make sure it iss mounted:

p63:~# mount -t debugfs debugfs /sys/kernel/debug
mount: debugfs already mounted or /sys/kernel/debug busy
mount: according to mtab, debugfs is already mounted on /sys/kernel/debug
p63:~#

Then follow instructions on page 14 from:
http://github.com/znmeb/linux_perf_viz/raw/master/blktrace-howto/blktrace-howto.pdf

p63:/dev/shm/server# blktrace -l
server: waiting for connections...
server: connection from 192.168.168.113

p63:/dev/shm/client# blktrace -h 192.168.168.113 /dev/md0
blktrace: connecting to 192.168.168.113
blktrace: connected!

Mount filesystem with -o data=writeback,nobarrier, run test blktrace1.

p63:~# mount -o data=writeback,nobarrier /dev/md0 /r1
p63:~# cd /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 35.6317 s, 301 MB/s
0.03user 19.41system 0:35.67elapsed 54%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (2major+494minor)pagefaults 0swaps
p63:/r1# rm bigfile
p63:/r1# sync
p63:/r1# cd
p63:~# umount /r1
p63:~#

SERVER PROCESS:

p63:/dev/shm/server# blktrace -l
server: waiting for connections...
server: connection from 192.168.168.113
server: end of run for 192.168.168.113:md0
=== md0 ===
CPU 0: 1548634 events, 72593 KiB data
CPU 1: 1009268 events, 47310 KiB data
Total: 2557902 events (dropped 0), 119902 KiB data
p63:/dev/shm/server# ls

CLIENT PROCESS:

# blktrace -h 192.168.168.113 /dev/md0
blktrace: connecting to 192.168.168.113
blktrace: connected!
^C=== md0 ===
CPU 0: 1548634 events, 72593 KiB data
CPU 1: 1009268 events, 47310 KiB data
Total: 2557902 events (dropped 0), 119902 KiB data

From this test, the following resulted:
# du -sh *
56K 192.168.168.113-2010-02-28-13:10:48
118M 192.168.168.113-2010-02-28-13:14:00

Let this trace be called blktrace1.

p63:/dev/shm/server# du -sh blktrace1/*
56K blktrace1/192.168.168.113-2010-02-28-13:10:48
118M blktrace1/192.168.168.113-2010-02-28-13:14:00
p63:/dev/shm/server#

Mount with -o data=writeback,nobarrier,nodelalloc, run test blktrace2.

p63:~# umount /r1
p63:~# mount -o data=writeback,nobarrier,nodelalloc /dev/md0 /r1
p63:~# cd /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 30.6692 s, 350 MB/s
0.03user 29.55system 0:30.70elapsed 96%CPU (0avgtext+0avgdata 7312maxresident)k
0inputs+0outputs (1major+495minor)pagefaults 0swaps
p63:/r1# rm bigfile
p63:/r1# sync
p63:/r1# cd
p63:~# umount /r1
p63:~#

SERVER PROCESS:

p63:/dev/shm/server# blktrace -l
server: waiting for connections...
server: connection from 192.168.168.113
server: end of run for 192.168.168.113:md0
=== md0 ===
CPU 0: 50056 events, 2347 KiB data
CPU 1: 2478242 events, 116168 KiB data
Total: 2528298 events (dropped 0), 118515 KiB data

CLIENT PROCESS:

# blktrace -h 192.168.168.113 /dev/md0
blktrace: connecting to 192.168.168.113
blktrace: connected!
^C=== md0 ===
CPU 0: 50056 events, 2347 KiB data
CPU 1: 2478242 events, 116168 KiB data
Total: 2528298 events (dropped 0), 118515 KiB data
#

p63:/dev/shm/server# du -sh 192.168.168.113-2010-02-28-13\:17\:22/*
2.4M 192.168.168.113-2010-02-28-13:17:22/md0.blktrace.0
114M 192.168.168.113-2010-02-28-13:17:22/md0.blktrace.1

This is blktrace2.

One more time (blktrace3) with ordered.

p63:~# mount -o nobarrier /dev/md0 /r1
p63:~# dmesg | tail -n 2
[ 2788.928806] EXT4-fs (md0): barriers disabled
[ 2789.340573] EXT4-fs (md0): mounted filesystem with ordered data mode
p63:~# cd /r1
p63:/r1# /usr/bin/time dd if=/dev/zero of=bigfile bs=1M count=10240
10240+0 records in
10240+0 records out
10737418240 bytes (11 GB) copied, 36.2893 s, 296 MB/s
0.04user 19.29system 0:36.32elapsed 53%CPU (0avgtext+0avgdata 7296maxresident)k
0inputs+0outputs (1major+494minor)pagefaults 0swaps
p63:/r1# rm bigfile
p63:/r1# sync
p63:/r1# cd
p63:~# umount /r1
p63:~#

SERVER PROCESS:

p63:/dev/shm/server# blktrace -l
server: waiting for connections...
server: connection from 192.168.168.113
server: end of run for 192.168.168.113:md0
=== md0 ===
CPU 0: 1587087 events, 74395 KiB data
CPU 1: 970979 events, 45515 KiB data
Total: 2558066 events (dropped 0), 119910 KiB data
p63:/dev/shm/server#

CLIENT PROCESS:

# blktrace -h 192.168.168.113 /dev/md0
blktrace: connecting to 192.168.168.113
blktrace: connected!
=== md0 ===
CPU 0: 1587087 events, 74395 KiB data
CPU 1: 970979 events, 45515 KiB data
Total: 2558066 events (dropped 0), 119910 KiB data
#

TRACE OUTPUT TOTAL AND SUMMARY:


p63:~/results-20100228# du -sh *
570M blktrace1 => -o data=writeback,nobarrier
570M blktrace1-redo => -o data=writeback,nobarrier
563M blktrace2 => -o data=writeback,nobarrier,nodelalloc
570M blktrace3 => -o data=nobarrier
4.0K script
p63:~/results-20100228#

USING SCRIPT ON PAGE 24/30:

Running post-process.sh for each trace: blktrace{1,2,3}, the script itself
from page 24/30:

# cat /root/post-process.sh
#! /bin/bash
blkrawverify md0 # check data for errors
blkparse -d md0.bin -i md0 > md0.blkparse # merged binary, parsed
btt -i md0.bin --all-data > md0.btt # basic btt report
# now the whole enchilada!
btt -i md0.bin -o md0x --all-data --easy-parse-avgs \
--iostat=md0x.iostat \
--per-io-dump=md0x.pid \
--q2d-latencies=md0x \
--d2c-latencies=md0x \
--q2c-latencies=md0x \
--dump-blocknos=md0x_dbn \
--active-queue-depth=md0x \
--unplug-hist=md0x_uph \
--seeks=seeks \
--seeks-per-second=sps \
--per-io-trees=md0x_pit \
> md0x.btt # md0x.btt is empty

#

Before running any tests, backup raw data:

p63:/dev/shm# tar cf /root/server.tar server
p63:/dev/shm#

For each directory, run post-process:

blktrace1: (I must have waited too long in between steps here so it made two)

p63:/dev/shm/server/blktrace1# ls -1
192.168.168.113-2010-02-28-13:10:48
192.168.168.113-2010-02-28-13:14:00
p63:/dev/shm/server/blktrace1# cd *48
p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:10:48# /root/post-process.sh
Verifying md0
CPU 0
CPU 1
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:10:48# cd ../*00
p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:14:00# /root/post-process.sh
Verifying md0
CPU 0
CPU 1
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace1/192.168.168.113-2010-02-28-13:14:00#

I will make another blktrace1 and be faster this time so all data results
are of the same type, it is called blktrace1-redo:

blktrace1-redo:

p63:/dev/shm/server/blktrace1-redo/192.168.168.113-2010-02-28-13:35:45# /root/post-process.sh
Verifying md0
CPU 0
CPU 1
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace1-redo/192.168.168.113-2010-02-28-13:35:45#

blktrace2:

p63:/dev/shm/server/blktrace2/192.168.168.113-2010-02-28-13:17:22# /root/post-process.sh
Verifying md0
CPU 0
CPU 1
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace2/192.168.168.113-2010-02-28-13:17:22#

blktrace3:

p63:/dev/shm/server/blktrace3/192.168.168.113-2010-02-28-13:31:29# /root/post-process.sh
Verifying md0
CPU 0
CPU 1
Wrote output to md0.verify.out
p63:/dev/shm/server/blktrace3/192.168.168.113-2010-02-28-13:31:29#

------------

=== FINAL RESULTS


p63:~/results-20100228# du -sh */*
216K blktrace1/192.168.168.113-2010-02-28-13:10:48
570M blktrace1/192.168.168.113-2010-02-28-13:14:00
570M blktrace1-redo/192.168.168.113-2010-02-28-13:35:45
563M blktrace2/192.168.168.113-2010-02-28-13:17:22
570M blktrace3/192.168.168.113-2010-02-28-13:31:29
4.0K script/post-process.sh
p63:~/results-20100228#

I used 7zip to compress the results because it offers the best compression
ratio of any other utility, including the latest 'xz' utility:
http://fixunix.com/kernel/238089-response-kernel-compression-e-mail-few-months-ago.html

$ xz -9 linux-2.6.16.17.tar

$ du -sk * | sort -n
32392 linux-2.6.16.17.tar.7z
32404 linux-2.6.16.17.tar.xz
33520 linux-2.6.16.17.tar.lzma
33760 linux-2.6.16.17.tar.rar
38064 linux-2.6.16.17.tar.rz
39472 linux-2.6.16.17.tar.szip
39520 linux-2.6.16.17.tar.bz
39936 linux-2.6.16.17.tar.bz2
40000 linux-2.6.16.17.tar.bicom
40656 linux-2.6.16.17.tar.sit
47664 linux-2.6.16.17.tar.lha
49968 linux-2.6.16.17.tar.dzip
50000 linux-2.6.16.17.tar.gz
51344 linux-2.6.16.17.tar.arj
57552 linux-2.6.16.17.tar.lzo
57984 linux-2.6.16.17.tar.F
81136 linux-2.6.16.17.tar.Z
94544 linux-2.6.16.17.tar.zoo
101216 linux-2.6.16.17.tar.arc
228608 linux-2.6.16.17.tar

=== COMPRESSION RESULTS:

-rw-r--r-- 1 abc users 155M 2010-02-28 09:02 results-20100228.tar.7z
-rw-r--r-- 1 abc users 290M 2010-02-28 08:42 results-20100228.tar.bz2
-rw-r--r-- 1 abc users 2.3G 2010-02-28 08:42 results-20100228.tar

=== LOCATION:

http://liquidswords.org/~war/results-20100228.tar.7z
wget http://liquidswords.org/~war/results-20100228.tar.7z

=== MD5 CHECKSUM:

$ md5sum *
1db01600ce2700854b4bafcfd68f7028 results-20100228.tar.7z
35793b283edf5c0f38738276812aad52 results-20100228.tar

=== VERIFICATION: MAKE SURE IT WORKS FOR OTHERS:

$ wget http://liquidswords.org/~war/results-20100228.tar.7z
--2010-02-28 09:48:36-- http://liquidswords.org/~war/results-20100228.tar.7z
Resolving liquidswords.org... 71.6.165.232
Connecting to liquidswords.org|71.6.165.232|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 161814574 (154M) [application/x-tar]
Saving to: "results-20100228.tar.7z"

100%[======================================>] 161,814,574 2.00M/s in 69s

$

$ md5sum *7z
1db01600ce2700854b4bafcfd68f7028 results-20100228.tar.7z

CORRECT

$ 7z x results-20100228.tar.7z

7-Zip 4.58 beta Copyright (c) 1999-2008 Igor Pavlov 2008-05-05
p7zip Version 4.58 (locale=en_US,Utf16=on,HugeFiles=on,8 CPUs)

Processing archive: results-20100228.tar.7z

Extracting results-20100228.tar

Everything is Ok

Size: 2382561280
Compressed: 161814574

$ md5sum *tar
35793b283edf5c0f38738276812aad52 results-20100228.tar

CORRECT

Again, the trace information details:

p63:~/results-20100228# du -sh *
570M blktrace1 => -o data=writeback,nobarrier
570M blktrace1-redo => -o data=writeback,nobarrier
563M blktrace2 => -o data=writeback,nobarrier,nodelalloc
570M blktrace3 => -o data=nobarrier
4.0K script
p63:~/results-20100228#

Let me know if you need anything else, thanks.

Justin.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
From: Dave Chinner on
On Sat, Feb 27, 2010 at 06:36:37AM -0500, Justin Piszcz wrote:
> Besides large sequential I/O, ext4 seems to be MUCH faster than XFS when
> working with many small files.
>
> EXT4
>
> p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync'
> 0.18user 2.43system 0:02.86elapsed 91%CPU (0avgtext+0avgdata 5216maxresident)k
> 0inputs+0outputs (0major+971minor)pagefaults 0swaps
> linux-2.6.33 linux-2.6.33.tar
> p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync'
> 0.02user 0.98system 0:01.03elapsed 97%CPU (0avgtext+0avgdata 5216maxresident)k
> 0inputs+0outputs (0major+865minor)pagefaults 0swaps
>
> XFS
>
> p63:/r1# sync; /usr/bin/time bash -c 'tar xf linux-2.6.33.tar; sync'
> 0.20user 2.62system 1:03.90elapsed 4%CPU (0avgtext+0avgdata 5200maxresident)k
> 0inputs+0outputs (0major+970minor)pagefaults 0swaps
> p63:/r1# sync; /usr/bin/time bash -c 'rm -rf linux-2.6.33; sync'
> 0.03user 2.02system 0:29.04elapsed 7%CPU (0avgtext+0avgdata 5200maxresident)k
> 0inputs+0outputs (0major+864minor)pagefaults 0swaps

Mount XFS with "-o logbsize=262144". Metadata intensive workloads on
XFS are log IO bound, so larger log buffer size makes a big
difference. On 2.6.33 kernels on a single 15krpm SCSI drive I've
been getting ~21s for the untar, and 8s for the rm -rf with that
option set. Still slower than ext4, but nowhere near as bad.

> So I guess that's the tradeoff, for massive I/O you should use XFS, else,
> use EXT4?

I wouldn't consider writing an 11GB file "massive IO", nor would I
consider an 600MB/s massive, either, since you can get that out of a
sub-$10k server these days....

> I still would like to know however, why 350MiB/s seems to be the maximum
> performance I can get from two different md raids (that easily do 600MiB/s
> with XFS).

Check whether the dd process on ext4 is CPU bound....

Cheers,

Dave.
--
Dave Chinner
david(a)fromorbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/