From: Arcady Genkin on
On Mon, Jul 12, 2010 at 14:54, Aaron Toponce <aaron.toponce(a)gmail.com> wrote:
> Can you provide the commands from start to finish when building the volume?
>
> fdisk ...
> mdadm ...
> pvcreate ...
> vgcreate ...
> lvcreate ...

Hi, Aaron, I already provided all of the above commands in earlier
messages (except for fdisk, since we are giving the entire disks to
MD, not partitions). I'll repeat them here for your convenience:

Creating the ten 3-way RAID1 triplets - for N in 0 through 9:
mdadm --create /dev/mdN -v --raid-devices=3 --level=raid10 \
--layout=n3 --metadata=0 --bitmap=internal --bitmap-chunk=2048 \
--chunk=1024 /dev/sdX /dev/sdY /dev/sdZ

Then the big stripe:
mdadm --create /dev/md10 -v --raid-devices=10 --level=stripe \
--metadata=1.0 --chunk=1024 /dev/md{0,5,1,6,2,7,3,8,4,9}

Then the LVM business:
pvcreate /dev/md10
vgcreate vg0 /dev/md10
lvcreate -l 102389 vg0

Note that the file system is not being created on top of LVM at this
point, and I ran the test by simply dd-ing /dev/vg0/lvol0.

> My experience has been that LVM will introduce about a 1-2% performance
> hit compared to not using it

This is what we were expecting, it's encouraging.

> On a side note, I've never seen any reason to increase or decrease the
> chunk size with software RAID. However, you may want to match your chunk
> size with '-c' for 'lvcreate'.

We have tested a variety of chunk sizes (from 64K to 4MB) with
bonnie++ and found that 1MB chunks worked the best for our usage,
which is a general purpose NFS server, so it's mainly small random
reads. In this scenario it's best to tune the chunk size to increase
the probability that a small read from the stripe would result in only
one read from the disk. If the chunk size is too small, then a 1KB
read has a pretty high chance to be fragmented between two chunks,
and, thus, require two I/Os to service instead of one I/O (and, thus,
most likely two drive head seeks instead of just one). Modern
commodity drives can do about only 100-120 seeks per second. But this
is a side note for your side note. :))

>From the man page to 'lvcreate' it seems that the -c option sets the
chunk size for something snapshot-related, so it should have no
bearing in our performance testing, which involved no snapshots. Am I
misreading the man page?

Thanks!
--
Arcady Genkin


--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/AANLkTild4UMO3vAQ7h2FOkbsNT8XL2FYI-8VtnPFMEoW(a)mail.gmail.com
From: Aaron Toponce on
On 7/12/2010 1:45 PM, Arcady Genkin wrote:
> Creating the ten 3-way RAID1 triplets - for N in 0 through 9:
> mdadm --create /dev/mdN -v --raid-devices=3 --level=raid10 \
> --layout=n3 --metadata=0 --bitmap=internal --bitmap-chunk=2048 \
> --chunk=1024 /dev/sdX /dev/sdY /dev/sdZ
>
> Then the big stripe:
> mdadm --create /dev/md10 -v --raid-devices=10 --level=stripe \
> --metadata=1.0 --chunk=1024 /dev/md{0,5,1,6,2,7,3,8,4,9}

I must admit, that I haven't seen a software RAID implementation where
you create multiple devices from the same set of disks, then stripe
across those devices. As such, when using LVM, I'm not exactly sure how
the kernel will handle that- mostly if it will see the appropriate
amount of disk, and what physical extents it will use to place the data.
So for me, this is uncharted territory.

But, your commands look sound. I might suggest changing the default PE
size from 4MB to 1MB. That might help. Worth testing anyway. The PE size
can be changed with 'vgcreate -s 1M'.

However, do you really want --bitmap with your mdadm command? I
understand the benefits, but using 'internal' does come with a
performance hit.

> From the man page to 'lvcreate' it seems that the -c option sets the
> chunk size for something snapshot-related, so it should have no
> bearing in our performance testing, which involved no snapshots. Am I
> misreading the man page?

Ah yes, you are correct. I should probably pull up the man page before
replying. :)


--
. O . O . O . . O O . . . O .
. . O . O O O . O . O O . . O
O O O . O . . O O O O . O O O

From: Mike Bird on
On Mon July 12 2010 12:45:57 Arcady Genkin wrote:
> Creating the ten 3-way RAID1 triplets - for N in 0 through 9:
> mdadm --create /dev/mdN -v --raid-devices=3 --level=raid10 \
> --layout=n3 --metadata=0 --bitmap=internal --bitmap-chunk=2048 \
> --chunk=1024 /dev/sdX /dev/sdY /dev/sdZ

RAID 10 with three devices?

--Mike Bird


--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/201007121400.42233.mgb-debian(a)yosemite.net
From: Aaron Toponce on
On 7/12/2010 4:13 PM, Stan Hoeppner wrote:
> Is that a typo, or are you turning those 3 disk mdadm sets into RAID10 as
> shown above, instead of the 3-way mirror sets you stated previously? RAID 10
> requires a minimum of 4 disks, you have 3. Something isn't right here...

Incorrect. The Linux RAID implementation can do level 10 across 3 disks.
In fact, it can even do it across 2 disks.

http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10

--
. O . O . O . . O O . . . O .
. . O . O O O . O . O O . . O
O O O . O . . O O O O . O O O

From: Stan Hoeppner on
Arcady Genkin put forth on 7/12/2010 11:52 AM:
> On Mon, Jul 12, 2010 at 02:05, Stan Hoeppner <stan(a)hardwarefreak.com> wrote:
>
>> lvcreate -i 10 -I [stripe_size] -l 102389 vg0
>>
>> I believe you're losing 10x performance because you have a 10 "disk" mdadm
>> stripe but you didn't inform lvcreate about this fact.
>
> Hi, Stan:
>
> I believe that the -i and -I options are for using *LVM* to do the
> striping, am I wrong?

If this were the case, lvcreate would require the set of physical or pseudo
(mdadm) device IDs to stripe across wouldn't it? There are no options in
lvcreate to specify physical or pseudo devices. The only input to lvcreate is
a volume group ID. Therefor, lvcreate is ignorant of the physical devices
underlying it, is it not?

> In our case (when LVM sits on top of one RAID0
> MD stripe) the option -i does not seem to make sense:
>
> test4:~# lvcreate -i 10 -I 1024 -l 102380 vg0
> Number of stripes (10) must not exceed number of physical volumes (1)

It makes sense once you accept the fact that lvcreate is ignorant of the
underlying disk device count/configuration. Once you accept that fact, you
will realize the -i option is what allows one to educate lvcreate that there
are, in your case, 10 devices underlying it which one desires to stripe data
across. I believe the -i option exists merely to educate lvcreate about the
underlying device structure.

> My understanding is that LVM should be agnostic of what's underlying
> it as the physical storage, so it should treat the MD stripe as one
> large disk, and thus let the MD device to handle the load balancing
> (which it seems to be doing fine).

If lvcreate is agnostic of the underlying structure, why does it have stripe
width and stripe size options at all? As a parallel example of this,
filesystems such as XFS are ignorant of underlying disk structure as well.
mkfs.xfs has no less than 4 sub options to optimize its performance atop RAID
stripes. One of it's options, sw, specifies stripe width, which is the number
of physical or logical devices in the RAID stripe. In your case, if you use
xfs, this would be "-d sw=10". These options in lvcreate serve the same
function as those in mkfs.xfs, which is to optimize their performance atop a
RAID stripe.

> Besides, the speed we are getting from the LVM volume is more than
> twice slower than an individual component of the RAID10 stripe. Even
> if we assume that LVM manages somehow distribute its data so that it
> always hits only one physical disk (a disk triplet in our case), there
> would still be the question why it is doing it *that* slow. It's 57
> MB/s vs 134 MB/s that an individual triplet can do:

Forget comparing performance to one of your single mdadm mirror sets. What's
key here, and why I suggested "lvcreate -i 10 .." to begin with, is the fact
that your lvm performance is almost exactly 10 times lower than the underlying
mdadm device, which has exactly 10 physical stripes. Isn't that more than
just a bit coincidental? The 10x drop only occurs when talking to the lvm
device. Put on your Sherlock Holmes hat for a minute.

> We are using chunk size of 1024 (i.e. 1MB) with the MD devices. For
> the record, we used the following commands to create the md devices:
>
> For N in 0 through 9:
> mdadm --create /dev/mdN -v --raid-devices=3 --level=raid10 \
> --layout=n3 --metadata=0 --bitmap=internal --bitmap-chunk=2048 \
> --chunk=1024 /dev/sdX /dev/sdY /dev/sdZ

Is that a typo, or are you turning those 3 disk mdadm sets into RAID10 as
shown above, instead of the 3-way mirror sets you stated previously? RAID 10
requires a minimum of 4 disks, you have 3. Something isn't right here...

> Then the big stripe:
> mdadm --create /dev/md10 -v --raid-devices=10 --level=stripe \
> --metadata=1.0 --chunk=1024 /dev/md{0,5,1,6,2,7,3,8,4,9}

And I'm pretty sure this is the stripe lvcreate needs to know about to fix the
10x performance drop issue. Create a new lvm test volume with the lvcreate
options I've mentioned, and see how it performs against the current 400GB test
volume that's running slow.

--
Stan


--
To UNSUBSCRIBE, email to debian-user-REQUEST(a)lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmaster(a)lists.debian.org
Archive: http://lists.debian.org/4C3B937C.1080808(a)hardwarefreak.com