From: David Brown on
Rahul wrote:
> David Brown <david.brown(a)hesbynett.removethisbit.no> wrote in
> news:BtOdnakm8taiUsnWnZ2dnUVZ8qOdnZ2d(a)lyse.net:
>
> Thanks David!
>
>> Rahul wrote:
>>
>> LVM is for logical volume management, mdadm is for administering
>> multiple disk setups (i.e., software raid). LVM /can/ do basic
>
>> striping, in that if you have two physical volumes allocated to the
>> same volume group, then a logical volume can be striped across the two
>> physical volumes. As another poster has said, you won't notice a
>> performance difference between striping via LVM or mdadm. But you
>
> Will putting LVM on top of mdadm slow things down? Or does LVM not have a
> significant performance penalty?

LVM does have a performance penalty, but it is not normally significant.
If you have a number of logical partitions which you then grow a
number of times, you end up with the actual physical blocks of the
partitions rather scattered across the disk(s), which may impact
performance for streaming or large files. The flexibility you get is
normally worth the slight cost (IMHO).

>> My recommendation is that you use mdadm to create a raid from the raw
>> drives or partitions on the drives, and if you want the volume
>> management features of LVM (I find it very useful), put LVM on top of
>> mdadm raid.
>
> This is exactly what I was trying to do. BUt LVM asks "stripe" or :no
> stripe". THat I wasn;t sure about.
>
>
>> As for the type of raid to use, that depends on the number of disks
>> you have and the redundancy you want. raid5 is well-known to be
>> slower for writing, especially for smaller writes, and it can be risky
>> for large disks in critical applications
>
> Maybe if I explain my situation you can have some more comments.
>
> I have 3 physical "storage boxes" (MD-1000's from Dell). Each takes 15
> SAS 15k drives of 300 GB each. i.e. I have a total of 45 drives of 300 GB
> each. Redundancy is important but not critical. Performance was more
> imporntant.
>
> My original plan was to split each box into two RAID5 arrays of 7 disks
> each and leave 1 as a hot spare. Thus I get 6 RAID5 arrays in all. They
> are visible as /dev/sdb /dev/sdc etc. but I want to mount a single /home
> on it. That's where I introduced LVM. But then LVM again introduces a
> striping option. Should I be striping or not?
>

Don't do any striping with LVM - set up your raid arrays (with hardware
raid and/or mdadm) until you have a single "disk", and put LVM on that.

> That's where I am confuesd about what my best option is. It's hard to
> balance redundancy, performance and disk capacity.
>
>
> Any other creative options that come to mind?
>
>
>
>> (since rebuilding takes so
>> long, and wears the other disks). Mirroring is safer, and mdadmin can
>> happily do a raid10 (roughly a stripe of mirrors) on any number of
>> disks for high speed and mirrored redundancy.
>>
>> Booting from raids is complicated, but not as difficult as suggested
>
> Luckily I don't have to go down that path; I have a seperate drive to
> boot from.
>
From: David Brown on
Aragorn wrote:
> On Tuesday 19 January 2010 08:37 in comp.os.linux.misc, somebody
> identifying as Rahul wrote...
>
>> Aragorn <aragorn(a)chatfactory.invalid> wrote in news:hj1gta$2hp$5
>> @news.eternal-september.org:
>>
>> Thanks for the great explaination!
>
> Glad you appreciated it. ;-)
>

Unfortunately, there seems to me to be a number of misconceptions in
this post. I freely admit to having more theoretical knowledge from
trawling the net, reading mdadm documentation, etc., than personal
practical experience - so anyone reading this will have to judge for
themselves whether they think I am right, or Aragorn is right. Either
way, I hope to give you some things to think about.

>>> Writing to a RAID 5 is slower than writing to a single disk because
>>> with each write, the parity block must be updated, which means
>>> calculation of the parity data and writing that parity data to the
>>> pertaining disk.
>> This is where I get confused. Is writing to a RAID5 slower than a
>> single disk irrespective of how many disks I throw at the RAID5?
>
> Normally, yes, although it won't be *much* slower. But there is some
> overhead in the calculation of the parity, yes. This is why RAID 6 is
> even slower during writes: it stores *two* parity blocks per data
> segment (and as such, it requires a minimum of 4 disks).
>

Writing to RAID5 (or RAID6) /may/ be slower than writing to a single
disk - or it may be much faster (closer to RAID0 speeds). The actual
parity calculations are negligible with modern hardware, whether it be
the host CPU or a hardware raid card. What takes time is if existing
data has to be read in from the disks in order to calculate the parity -
this causes a definite delay. If you are writing a whole stripe, the
parity can be calculated directly and the write goes at N-1 speed as
each block in the stripe can be written in parallel. This is also the
case if the parts of the block are already in the cache from before.

Thus random writes are slow on RAID5 (and RAID6), but larger block
writes are full speed.

There can also be significant differences between the speed of mdadm
software RAID5, and hardware RAID5. With hardware raid, the card can
report a small write as "finished" before it has read in the block and
written out the data and new parity. This is safe for good hardware
with battery backup of its buffers, and gives fast writes (as far as the
host is concerned) even for small writes. Software raid5 cannot do
this. But on the other hand, software raid5 can take advantage of large
system memories for cache, and is thus far more likely to have the
required stripe data already in its cache (especially for metadata and
directory areas of the file system, which are commonly accessed but have
small writes).

This is perhaps also a good time to mention one of the risks of raid5
(and raid6) - the RAID5 Write Hole. When you are writing a stripe to
the disk, the system must write at least two blocks - data and the
updated parity block. These two writes cannot be done atomically - if
you get a system failure at this point, the blocks may be inconsistent
and the whole stripe is inconsistent and effectively becomes silent garbage.

>> I currently have a 7-disk RAID5. Will writing to this be slower than a
>> single disk?
>
> A little, yes. But reading from it will be significantly faster.
>

Not necessarily - writing will be slower if you do lots of small random
writes, but much faster if you write large blocks.

Also remember that with a 7 disk array under heavy use, you /will/ see a
disk failure at some point. Degraded performance of raid 5 is very
poor, and rebuilds are slow. Some people believe that the chance of a
second disk failure occurring during a rebuild is so large (rebuilds are
particular intensive for the other disks) that raid 5 should be
considered unsafe for large arrays. Raid 6 is better since it can
survive a second failure, but mirrored raids are safer still.

>> Isn't the parity calculation a fairly fast process especially if one
>> has a hardware based card?
>

A decent host processor will do the parity calculations /much/ faster
than the raid processor on most hardware cards. But the calculations
themselves are not the cause of the latency, it's the extra reads that
take time.

> Ah, but with a hardware-based RAID things are different. The actual
> writing process will still be somewhat slower than writing to a single
> disk, but considering that everything is taken care of by the hardware
> and that such adapters have a very large cache - often backed by a
> battery - this will not really have a noticeable performance impact.
>
> With hardware RAID, the kernel treats the entire array as a single disk
> and will simply write to the array. As far as the operating system is
> concerned, that's where it ends, and the array takes care of everything
> else from there, in a delayed fashion, but this is not something you
> notice as your actual CPU(s) are freed up again as soon as the data is
> transfered to the memory of the RAID adapter.
>

True, but see above for more information.

> It is however advised if you have a hardware RAID adapter to disable the
> write barriers. Write barriers are where the kernel forces the disks
> drives to flush their caches. Since a hardware RAID adapter must be in
> total control of the disk drives and has cache memory of its own, the
> operating system should never force the disk drives to flush their
> cache.
>

Make sure your raid controller has batteries, and that the whole system
is on an UPS!

>> And then if the write gets split into 6 parts shouldnt that speed up
>> the process since each disk is writing only 1/6th of the chunk?
>
> Yes, but the data has to be split up first - which is of course a lot
> faster on hardware RAID since it is done by a dedicated processor on
> the adapter itself then - and the parity has to be calculated. This is
> overhead which you do not have with a single disk.
>

Nonsense - a host CPU is perfectly capable of splitting a stripe into
its blocks in a fraction of a microsecond. It is also much faster at
doing the parity calculations - the host CPU typically runs at least ten
times as fast as the CPU or ASIC on the raid card. And again, the
splitting and parity calculations are not the bottleneck, it's the
latency of the reads needed to calculate the new parity that takes time.

Where a hardware raid card will win is if your IO is a bottleneck, which
can be the case for large fast arrays. In particular, if you have a
mirror raid with software raid, then the host CPU has to write out all
the data twice - with hardware raid, it's the raid card that doubles up
the data.

There are times when top-range hardware raid cards will beat software
raid on speed, but not often - especially with a fast multi-core modern
host cpu. It does, however, depend highly on your raid setup and the
type of load you have - there are no set answers here.

Software raid does of course have a reliability weak point - if your OS
crashes in the middle of a write, you have a bigger chance of hitting
the raid 5 write hole than you would with a hardware raid card with a
battery.

>>> In this case, you don't have any redundancy. Writing to the
>>> stripeset is faster than writing to a single disk, and the same
>>> applies for reading. It's not a 2:1 performance boost due to the
>>> overhead for splitting the data for writes and re-assembling it upon
>>> reads, but there is a significant performance improvement, and
>>> especially so if you use more than two disks.
>> Why doesn;t a similar boost come out of a RAID5 with a large number of
>> disks? Merely because of the parity calculation overhead?
>
> Yes, that is the main difference. Like I said, RAID 6 is even slower
> during writes (and has equal performance during reads).
>

Assuming (again!) that you are doing a small write and the old data and
parity blocks are not in the cache, then you have the latency of the
reads (two reads for a single block write on raid 5, and three reads for
raid 6).

For reading, especially for large reads, raid 5 is approximately like
N-1 raid 0 drives, while raid 6 is like N-2 raid 0.

>>> There are however a few considerations you should take into account
>>> with both of these approaches, i.e. that you should not put the
>>> filesystem which holds the kernels and /initrd/ - and preferably not
>>> the root filesystem either[1] - on a stripe, because the bootloader
>>> recognizes
>> Luckily that is not needed. I have a seperate drive to boot from. The
>> RAID is intended only for user /home dirs.
>
> Ah but wait a minute. As I understand it, you have a hardware RAID
> adapter card. In that case - assuming that it is a real hardware RAID
> adapter and not one of those on-board fake-RAID things - it doesn't
> matter, because to the operating system (and even to the BIOS), the
> entire array will be seen as a single disk. So then it is perfectly
> possible to have your bootloader, your "/boot" and your "/" living on
> the RAID array. (I am doing that myself on one of my machines, which
> has two RAID 5 arrays of four disks each.)
>
> And in this case - i.e. if you have a hardware RAID array - then your
> original question regarding software RAID 0 versus striping via LVM is
> also answered, because hardware RAID will always be a bit faster than
> software RAID or striped LVM. Additionally, since you mention seven
> disks, you could even opt for RAID 10 or 51 and even have a "hot spare"
> or "standby spare". (Or you could use the extra disk as an individual,
> standalone disk.)
>
> RAID 10 is where you have a mirror (i.e. RAID 1) which is striped to
> another mirror - you could instead also use RAID 01, which is a stripe
> which is mirrored on another stripe. RAID 10 is better than RAID 01
> though - there's a good article on Wikipedia about it. RAID 10 or 01
> require four disks in total. Performance is very good for both reading
> and writing *and* you have redundancy.
>

Yes, wikipedia /does/ have some useful information about raid - it's
worth reading.

One thing you are missing here is that Linux mdadm raid 10 is very much
more flexible than just a "stripe of mirrors", which is the standard
raid 10. In particular, you can use any number of disks (from 2
upwards), you can have more than 2 copies of each block (at the cost of
disk space, obviously) for greater redundancy, and you can have a layout
that optimises the throughput for different loads.

For example, a "f2" md raid 10 layout gives you full raid 0 performance
for large reads while being at least as fast as other raids for writing
and random reads (and much faster than raid 5 for small random writes).
It is normally the fastest raid layout with redundancy - though at a 50%
cost in disk space.

<http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10>

Raid10 performance is also much less affected by a disk failure, and
rebuilds are faster and less stressful on the system. And a single hot
spare will cover all the disks - you don't need a spare per


However, while a "f2" md raid 10 is probably the fastest setup for
directly connected drives, this is not what you have. You will also
suffer from bandwidth issues if you try to do all the mirroring of all
45 drives in software. In your case, I would recommend raid 10 on each
box - 7 raid1 pairs striped together with a hot spare (assuming the
hardware supports a common hot spare). Your host then sees these three
disks, which you should stripe together with mdadm raid0 - there is no
need for redundancy here, as that is handled at a lower level. Put your
LVM physical volume on top of this if you want the flexibility of LVM -
if you don't need it, don't bother.


> Similarly, RAID 51 is where you have a RAID 5 which is mirrored onto
> another RAID 5. Or you could use RAID 15, which is a RAID 5 comprised
> of mirrors. RAID 51 and 15 require a minimum of six disks.
> (Similarly, there is RAID 61 and 16, which require a minimum of eight
> disks.)
>

As a minor point, mdadm raid 5 can work on 2 disks (and raid 6 on three
disks). Such a 2-disk raid 5 is not much use in a working system, but
can be convenient when setting things up or upgrading drives, as you can
add more drives to the mdadm raid 5 later on. It's just an example of
how much more flexible mdadm is than hardware raid solutions.

> There is of course a trade-off. Except for RAID 0, which isn't really
> RAID because it has no redundancy, all RAID solutions are expensive in
> diskspace, and how expensive exactly depends on the chosen RAID type.
> In RAID 1, RAID 10 or RAID 01 set-up, you lose 50% of your storage
> capacity.
>
> With RAID 5, your storage capacity is reduced by the capacity of one
> disk in the array, and with RAID 6 by the capacity of two disks in the
> array. So, with a single RAID 5 array comprised of seven disks without
> a standby or hot spare, your total storage capacity is that of six
> disks.
>
> And then there's the lost capacity of the hot spare or standby spare - a
> hot spare is spinning but otherwise unused until one of the other disks
> starts to fail, while a standby spare is spun down until one of the
> other disks fails. Upon such failure, the array will be automatically
> rebuilt using the parity blocks to write the missing data to the spare
> disk.
>

I have never heard of a distinction between a "hot spare" that is
spinning, and a "standby spare" that is not spinning. Given that spinup
takes a few seconds, and a rebuild often takes many hours, I can't see
you have much to gain by keeping a spare drive spinning. To my mind, a
"hot spare" is a drive that will be used automatically to replace a dead
drive.

An "offline spare" is an extra drive that is physically attached, but
not in use automatically - in the event of a failure, it can be manually
assigned to a raid set. This makes sense if you have several hardware
raid sets defined and want to share a single spare, if the hardware raid
cannot support this (mdadm, of course, supports such a setup with a
shared hot spare).

> The bottom line...: A seven-disk RAID 0 would be faster than a RAID 5
> during writes, but not really significantly faster during reads, and
> you would have the full storage capacity of all disks in the array, but
> there would be no redundancy at all. So, considering that you have
> seven disks, I think you really should consider building in redundancy.
> After all, with RAID 0, if a single disk in the array fails, then
> you'll have lost all of your data. A RAID 5 would upon failure of a
> single disk run slower, but at least you'd still have access to your
> data.
>
From: Rahul on
David Brown <david.brown(a)hesbynett.removethisbit.no> wrote in
news:NKOdnXtWJIFpt8vWnZ2dnUVZ7radnZ2d(a)lyse.net:

>
> themselves whether they think I am right, or Aragorn is right. Either
> way, I hope to give you some things to think about.

An alternative viewpoint is always good!

> Thus random writes are slow on RAID5 (and RAID6), but larger block
> writes are full speed.

And if I did a RAID10 at hardware level (as you later suggest) I'd get
the speedup on random writes as well? (which are otherwise slow on a
RAID5?) What other way do I have to speed up random writes?

> There can also be significant differences between the speed of mdadm
> software RAID5, and hardware RAID5. With hardware raid, the card can
> report a small write as "finished" before it has read in the block and
> written out the data and new parity. This is safe for good hardware
> with battery backup of its buffers, and gives fast writes (as far as
> the host is concerned) even for small writes. Software raid5 cannot
> do this. But on the other hand, software raid5 can take advantage of
> large system memories for cache, and is thus far more likely to have
> the required stripe data already in its cache (especially for metadata
> and directory areas of the file system, which are commonly accessed
> but have small writes).

Yes, I do have a battery backed up cache on my Hardware card. But from
the point you make above there's something to be said about a software
(mdadm or LVM) on top of hardware approach? This way I get the best of
both worlds? LVM / mdadm will serve out from RAM (I've 48 Gigs of it)
and speed up reads. Writes will be speeded up due to the caches of the
Hardware card. Does this make sense?


> This is perhaps also a good time to mention one of the risks of raid5
> (and raid6) - the RAID5 Write Hole.

This risk is reduced by a battery backed-up cache, correct?

>
>>> I currently have a 7-disk RAID5. Will writing to this be slower than
>>> a single disk?
>>
>> A little, yes. But reading from it will be significantly faster.
>
> Not necessarily - writing will be slower if you do lots of small
> random writes, but much faster if you write large blocks.

And will the reads and large-sequential-writes be even faster if I did a
14 disk RAID5 instead of a 7-disk RAID5?

>
> Make sure your raid controller has batteries, and that the whole
> system is on an UPS!

Yes! Both.
>
> For reading, especially for large reads, raid 5 is approximately like
> N-1 raid 0 drives, while raid 6 is like N-2 raid 0.

Problem is I haven't seen a similar formula mentioned for writes. Neither
large nor small writes. What's a approximate design equation to use to
rate options?
>
> However, while a "f2" md raid 10 is probably the fastest setup for
> directly connected drives, this is not what you have. You will also
> suffer from bandwidth issues

Which bandwidth are we talking about? THe CPU-to-controller?

>if you try to do all the mirroring of all
> 45 drives in software. In your case, I would recommend raid 10 on
> each box - 7 raid1 pairs striped together with a hot spare (assuming
> the hardware supports a common hot spare). Your host then sees these
> three disks, which you should stripe together with mdadm raid0 - there
> is no need for redundancy here, as that is handled at a lower level.
> Put your LVM physical volume on top of this if you want the
> flexibility of LVM - if you don't need it, don't bother.

Ah! Thanks! That;s a creative solution I hadn't thought about.
>
> I have never heard of a distinction between a "hot spare" that is
> spinning, and a "standby spare" that is not spinning.

Me neither.

>
>> The bottom line...: A seven-disk RAID 0 would be faster than a RAID 5
>> during writes, but not really significantly faster during reads, and
>> you would have the full storage capacity of all disks in the array,
>> but there would be no redundancy at all. So, considering that you
>> have seven disks, I think you really should consider building in
>> redundancy. After all, with RAID 0, if a single disk in the array
>> fails, then you'll have lost all of your data. A RAID 5 would upon
>> failure of a single disk run slower, but at least you'd still have
>> access to your data.
>>

Or I could do the RAID10 that you suggest and stripe on top of three such
arrays using mdadm. I'm thinking about this very interesting option.
Thanks!


--
Rahul
From: unruh on
On 2010-01-19, Rahul <nospam(a)nospam.invalid> wrote:
> Aragorn <aragorn(a)chatfactory.invalid> wrote in
> news:hj52h6$lr7$2(a)news.eternal-september.org:
>
>>
>> I would personally not use all of them for "/home". You mention three
>> arrays, so I would suggest the following...:
>>
>> ?? First array:
>> - /boot
>> - /
>> - /usr
>> - /usr/local
>> - /opt
>> - an optional rescue/emergency root filesystem
>>
>> ?? Second array:
>> - /var
>> - /tmp (Note: you can also make this a /tmpfs/
>> instead.) - /srv (Note: use at your own discretion.)
>>
>> ?? Third array:
>> - /home
>>
>
> Sorry, I should have clarified. For /boot /usr etc. all I have a
> seperate mirrored SAS drive. So those are taken care of. Besides 15x
> 300GB would be too much storage for any of those trees.
>
> I have all 45 drives bought just to provide a high performance /home.
> The question is how best to configure them:
>
> 1. What RAID pattern?

Do you want speed or do you want size or do you want redundancy?

I have just instituted raid0 ( striped) across two partitions on two
disks ( the disks are identical, and the partitioning of them is
identical). There are 500GB WD disks 7200 SATA. hdparm -t gives about
82MB/s

I used mdadm to set up a raid0 ( first bringing in the raid0 module)
on two 450GB patitions, one on each of the drives and mounted the
resultant /dev/md0 after formatting as ext3 onto /local
I then did
cat /dev/null>/local/a
for 12 sec, and a was then a 2GB file, so writing to that disk (assuming
writing all 0 from cat does not produce some sort of sparse file) went
at about 160MB/s, ie twice as fast as reading from a single disk.

> 2. Do I add LVM on top? THis is cleaner than arbitrarily mounting /home1
> /home2 etc. But the overhead of LVM worries me 3. Do I use LVM striping
> or not? etc.

You want to use lvm why?


>
>
From: Rahul on
unruh <unruh(a)wormhole.physics.ubc.ca> wrote in
news:slrnhlchar.4le.unruh(a)wormhole.physics.ubc.ca:

Thanks unruh!

> Do you want speed or do you want size or do you want redundancy?

Mainly speed. Size and Redundancy are good but lesser goals. I guess its
always a tradeoff between all 3.

>
>> 2. Do I add LVM on top? THis is cleaner than arbitrarily mounting /home1
>> /home2 etc. But the overhead of LVM worries me 3. Do I use LVM striping
>> or not? etc.
>
> You want to use lvm why?
>

Because I have 3 different "storage boxes" with 15 drives each. At best I
see three devices /dev/sda /dev/sdb /dev/sdc after I use the hardware RIAD
controllers. Logically I just want to mount /home on them.

At worst (If I do 7 disk RAID5's) I might see 6 physical drives. Then again
LVM would aggregate them and I could mount /home.

I am open to other sugesstions.


--
Rahul