difference between striping using mdadm and LVM [General Linux]

Prev: Variable I/O Performance with dd vs. cat
Next: ? Recommend PC for learning RHEL ?

From: Aragorn on 20 Jan 2010 06:58

On Tuesday 19 January 2010 22:57 in comp.os.linux.misc, somebody
identifying as David Brown wrote...

> Aragorn wrote:
>
>> On Tuesday 19 January 2010 08:37 in comp.os.linux.misc, somebody
>> identifying as Rahul wrote...
>>
>>> Aragorn <aragorn(a)chatfactory.invalid> wrote in news:hj1gta$2hp$5
>>> @news.eternal-september.org:
>>>
>>> Thanks for the great explaination!
>>
>> Glad you appreciated it. ;-)
>
> Unfortunately, there seems to me to be a number of misconceptions in
> this post. I freely admit to having more theoretical knowledge from
> trawling the net, reading mdadm documentation, etc., than personal
> practical experience - so anyone reading this will have to judge for
> themselves whether they think I am right, or Aragorn is right. Either
> way, I hope to give you some things to think about.

Having read your reply, I agree with most of it. I obviously made a
thinko in assuming that it was the parity calculation that slowed
things down, but I was writing my reply in a rather abstracted set of
mind.

All things considered, you and I are both right. The difference in our
view is that you further dissected the slowdown to the reads needed in
order to calculate the parity, while in my abstraction, I did not get
into this any further. ;-)

>>>> Writing to a RAID 5 is slower than writing to a single disk because
>>>> with each write, the parity block must be updated, which means
>>>> calculation of the parity data and writing that parity data to the
>>>> pertaining disk.
>>>
>>> This is where I get confused. Is writing to a RAID5 slower than a
>>> single disk irrespective of how many disks I throw at the RAID5?
>>
>> Normally, yes, although it won't be *much* slower. But there is some
>> overhead in the calculation of the parity, yes. This is why RAID 6
>> is even slower during writes: it stores *two* parity blocks per data
>> segment (and as such, it requires a minimum of 4 disks).
>
> Writing to RAID5 (or RAID6) /may/ be slower than writing to a single
> disk - or it may be much faster (closer to RAID0 speeds). The actual
> parity calculations are negligible with modern hardware, whether it be
> the host CPU or a hardware raid card. What takes time is if existing
> data has to be read in from the disks in order to calculate the parity
> - this causes a definite delay. If you are writing a whole stripe,
> the parity can be calculated directly and the write goes at N-1 speed
> as each block in the stripe can be written in parallel. This is also
> the case if the parts of the block are already in the cache from
> before.
>
> Thus random writes are slow on RAID5 (and RAID6), but larger block
> writes are full speed.

I agree with this.

> There can also be significant differences between the speed of mdadm
> software RAID5, and hardware RAID5. With hardware raid, the card can
> report a small write as "finished" before it has read in the block and
> written out the data and new parity. This is safe for good hardware
> with battery backup of its buffers, and gives fast writes (as far as
> the host is concerned) even for small writes. Software raid5 cannot
> do this.

Correct.

> But on the other hand, software raid5 can take advantage of large
> system memories for cache, and is thus far more likely to have the
> required stripe data already in its cache (especially for metadata
> and directory areas of the file system, which are commonly accessed
> but have small writes).

Also correct, but dependent on available system RAM, of course. I'm not
sure how much RAM the RAID controllers in my two servers have - I think
one has 256 MB and the other one 128 MB - but on a system with, say, 4
GB of system RAM, caching capacity is of course much higher.

> This is perhaps also a good time to mention one of the risks of raid5
> (and raid6) - the RAID5 Write Hole. When you are writing a stripe to
> the disk, the system must write at least two blocks - data and the
> updated parity block. These two writes cannot be done atomically - if
> you get a system failure at this point, the blocks may be inconsistent
> and the whole stripe is inconsistent and effectively becomes silent
> garbage.

Indeed, the infamous RAID 5/6 write hole. I left that bit of
information out as my advice to the OP was not to use RAID 5 or RAID 6
but to use RAID 10 instead, considering the gigantic amount of drives
he has at his disposal. ;-)

On the other hand - and purely in theory, as RAID 5/6 is not the right
solution for the OP - a battery-backed hardware RAID controller on a
machine hooked up to a UPS should be able to avoid the RAID 5/6 writing
hole, since the controller itself has its own processor. I even
believe that my Adaptec SAS RAID controller runs a Linux kernel of its
own.

>>> I currently have a 7-disk RAID5. Will writing to this be slower than
>>> a single disk?
>>
>> A little, yes. But reading from it will be significantly faster.
>
> Not necessarily - writing will be slower if you do lots of small
> random writes, but much faster if you write large blocks.

Yes, of course, it all depends on the usage. That's why different types
of servers use different types of RAID configurations, and the same can
be said about the choice of filesystem types. There is no "one size
fits all". ;-)

> Also remember that with a 7 disk array under heavy use, you /will/ see
> a disk failure at some point.

Also correct. The risk of disk failure will increase with the amount of
disks involved.

> Degraded performance of raid 5 is very poor, and rebuilds are slow.

Correct. RAID 5 and RAID 6 are trade-offs between redundancy, diskspace
consumption and performance. RAID 10 offers the best performance and
redundancy, but is the most costly in terms of wasted diskspace.

> Some people believe that the chance of a
> second disk failure occurring during a rebuild is so large (rebuilds
> are particular intensive for the other disks) that raid 5 should be
> considered unsafe for large arrays. Raid 6 is better since it can
> survive a second failure, but mirrored raids are safer still.
>
>>> Isn't the parity calculation a fairly fast process especially if one
>>> has a hardware based card?
>
> A decent host processor will do the parity calculations /much/ faster
> than the raid processor on most hardware cards. But the calculations
> themselves are not the cause of the latency, it's the extra reads that
> take time.

Correct.

>> It is however advised if you have a hardware RAID adapter to disable
>> the write barriers. Write barriers are where the kernel forces the
>> disks drives to flush their caches. Since a hardware RAID adapter
>> must be in total control of the disk drives and has cache memory of
>> its own, the operating system should never force the disk drives to
>> flush their cache.
>>
>
> Make sure your raid controller has batteries, and that the whole
> system is on an UPS!

If data integrity is important, then I consider a UPS a necessity, even
without RAID. ;-)

>>> And then if the write gets split into 6 parts shouldnt that speed up
>>> the process since each disk is writing only 1/6th of the chunk?
>>
>> Yes, but the data has to be split up first - which is of course a lot
>> faster on hardware RAID since it is done by a dedicated processor on
>> the adapter itself then - and the parity has to be calculated. This
>> is overhead which you do not have with a single disk.
>
> Nonsense - a host CPU is perfectly capable of splitting a stripe into
> its blocks in a fraction of a microsecond. It is also much faster at
> doing the parity calculations - the host CPU typically runs at least
> ten times as fast as the CPU or ASIC on the raid card.

Yes, but the host CPU also has other things to take care off, while the
CPU or ASIC on a RAID controller is dedicated to just that one task.

> And again, the splitting and parity calculations are not the
> bottleneck, it's the latency of the reads needed to calculate the new
> parity that takes time.

True, but as I stated higher up, I considered the reads to be part of
the parity calculation process. You need those reads in order to
calculate the parity, so it's a matter of semantics. ;-)

> There are times when top-range hardware raid cards will beat software
> raid on speed, but not often - especially with a fast multi-core
> modern host cpu. It does, however, depend highly on your raid setup
> and the type of load you have - there are no set answers here.

Considering that many hardware RAID adapters have a battery-backed
cache, I'd say that's another argument in favor of true hardware RAID.

> Software raid does of course have a reliability weak point - if your
> OS crashes in the middle of a write, you have a bigger chance of
> hitting the raid 5 write hole than you would with a hardware raid card
> with a battery.

I think this is an important thing to consider for anyone looking into a
RAID 5 solution.

>>>> There are however a few considerations you should take into account
>>>> with both of these approaches, i.e. that you should not put the
>>>> filesystem which holds the kernels and /initrd/ - and preferably
>>>> not the root filesystem either[1] - on a stripe, because the
>>>> bootloader recognizes [...]
>>>
>>> Luckily that is not needed. I have a seperate drive to boot from.
>>> The RAID is intended only for user /home dirs.
>>
>> Ah but wait a minute. As I understand it, you have a hardware RAID
>> adapter card. In that case - assuming that it is a real hardware
>> RAID adapter and not one of those on-board fake-RAID things - it
>> doesn't matter, because to the operating system (and even to the
>> BIOS), the entire array will be seen as a single disk. So then it is
>> perfectly possible to have your bootloader, your "/boot" and your "/"
>> living on the RAID array. (I am doing that myself on one of my
>> machines, which has two RAID 5 arrays of four disks each.)
>>
>> And in this case - i.e. if you have a hardware RAID array - then your
>> original question regarding software RAID 0 versus striping via LVM
>> is also answered, because hardware RAID will always be a bit faster
>> than software RAID or striped LVM. Additionally, since you mention
>> seven disks, you could even opt for RAID 10 or 51 and even have
>> a "hot spare" or "standby spare". (Or you could use the extra disk
>> as an individual, standalone disk.)
>>
>> RAID 10 is where you have a mirror (i.e. RAID 1) which is striped to
>> another mirror - you could instead also use RAID 01, which is a
>> stripe which is mirrored on another stripe. RAID 10 is better than
>> RAID 01 though - there's a good article on Wikipedia about it. RAID
>> 10 or 01 require four disks in total. Performance is very good for
>> both reading and writing *and* you have redundancy.
>
> Yes, wikipedia /does/ have some useful information about raid - it's
> worth reading.

You are preaching to the choir. ;-)

> One thing you are missing here is that Linux mdadm raid 10 is very
> much more flexible than just a "stripe of mirrors", which is the
> standard raid 10. In particular, you can use any number of disks
> (from 2 upwards), you can have more than 2 copies of each block (at
> the cost of disk space, obviously) for greater redundancy, and you can
> have a layout that optimises the throughput for different loads.

I'm not missing that, but since the OP had confirmed that he has a
hardware RAID set-up, I was addressing that aspect only. Linux
software RAID is applied on a partition basis and is thus indeed more
flexible than hardware RAID, which is applied on an entire disk basis.

> Raid10 performance is also much less affected by a disk failure, and
> rebuilds are faster and less stressful on the system. And a single
> hot spare will cover all the disks - you don't need a spare per

I think RAID 10 (at the hardware level) would be ideal for the OP.

> [...] Put your LVM physical volume on top of this if you want the
> flexibility of LVM - if you don't need it, don't bother.

He might want to use LVM in order to combine what the operating sees as
being three independent drives - and which are in reality three RAID
arrays - into a single "/home" volume, but then again, he could instead
also create that "/home" as an /mdadm/ stripe.

Either way, he needs to choose between /mdadm/ and LVM for how he wants
to set up the "/home" volume from three separate arrays, be it striped
or linear - alias the JBOD approach - but not /mdadm/ and LVM both.
That would be unnecessary overhead.

>> Similarly, RAID 51 is where you have a RAID 5 which is mirrored onto
>> another RAID 5. Or you could use RAID 15, which is a RAID 5
>> comprised of mirrors. RAID 51 and 15 require a minimum of six disks.
>> (Similarly, there is RAID 61 and 16, which require a minimum of eight
>> disks.)
>
> As a minor point, mdadm raid 5 can work on 2 disks (and raid 6 on
> three disks). Such a 2-disk raid 5 is not much use in a working
> system, but can be convenient when setting things up or upgrading
> drives, as you can add more drives to the mdadm raid 5 later on. It's
> just an example of how much more flexible mdadm is than hardware raid
> solutions.

True, but considering the enormous amount of drives involved and
hardware RAID already being present, I think the wisest approach would
be to use hardware RAID at the lower levels and only use /mdadm/ at the
final level.

>> [...]
>> With RAID 5, your storage capacity is reduced by the capacity of one
>> disk in the array, and with RAID 6 by the capacity of two disks in
>> the array. So, with a single RAID 5 array comprised of seven disks
>> without a standby or hot spare, your total storage capacity is that
>> of six disks.
>>
>> And then there's the lost capacity of the hot spare or standby spare
>> - a hot spare is spinning but otherwise unused until one of the other
>> disks starts to fail, while a standby spare is spun down until one of
>> the other disks fails. Upon such failure, the array will be
>> automatically rebuilt using the parity blocks to write the missing
>> data to the spare disk.
>
> I have never heard of a distinction between a "hot spare" that is
> spinning, and a "standby spare" that is not spinning.

This is quite a common distinction, mind you. There is even a "live
spare" solution, but to my knowledge this is specific to Adaptec - they
call it RAID 5E.

In a "live spare" scenario, the spare disk is not used as such but is
part of the live array, and both data and parity blocks are being
written to it, but with the distinction that each disk in the array
will also have empty blocks for the total capacity of a standard spare
disk. These empty blocks are thus distributed across all disks in the
array and are used for array reconstruction in the event of a disk
failure.

> Given that spinup takes a few seconds, and a rebuild often takes many
> hours, I can't see you have much to gain by keeping a spare drive
> spinning.

It might be required for some software RAID solutions where the spare
disk cannot be spun down via software. For instance in the event of
parallel SCSI drives in a software RAID array.

> To my mind, a "hot spare" is a drive that will be used automatically
> to replace a dead drive.

Semantics. ;-)

> An "offline spare" is an extra drive that is physically attached, but
> not in use automatically - in the event of a failure, it can be
> manually assigned to a raid set. This makes sense if you have several
> hardware raid sets defined and want to share a single spare, if the
> hardware raid cannot support this (mdadm, of course, supports such a
> setup with a shared hot spare).

Most modern hardware RAID controllers support this.

--
*Aragorn*
(registered GNU/Linux user #223157)

From: David Brown on 20 Jan 2010 07:17

Rahul wrote:
> David Brown <david.brown(a)hesbynett.removethisbit.no> wrote in
> news:NKOdnXtWJIFpt8vWnZ2dnUVZ7radnZ2d(a)lyse.net:
>
>> themselves whether they think I am right, or Aragorn is right. Either
>> way, I hope to give you some things to think about.
>
> An alternative viewpoint is always good!
>
>> Thus random writes are slow on RAID5 (and RAID6), but larger block
>> writes are full speed.
>
> And if I did a RAID10 at hardware level (as you later suggest) I'd get
> the speedup on random writes as well? (which are otherwise slow on a
> RAID5?) What other way do I have to speed up random writes?
>

Random writes will always be fairly fast with raid10, whether software
or hardware - there are no blocks that have to be read. With mirroring
(raid1 or raid10), you have to do twice as many writes as with raid0,
but they are to different drives and can thus be written in parallel.

>> There can also be significant differences between the speed of mdadm
>> software RAID5, and hardware RAID5. With hardware raid, the card can
>> report a small write as "finished" before it has read in the block and
>> written out the data and new parity. This is safe for good hardware
>> with battery backup of its buffers, and gives fast writes (as far as
>> the host is concerned) even for small writes. Software raid5 cannot
>> do this. But on the other hand, software raid5 can take advantage of
>> large system memories for cache, and is thus far more likely to have
>> the required stripe data already in its cache (especially for metadata
>> and directory areas of the file system, which are commonly accessed
>> but have small writes).
>
> Yes, I do have a battery backed up cache on my Hardware card. But from
> the point you make above there's something to be said about a software
> (mdadm or LVM) on top of hardware approach? This way I get the best of
> both worlds? LVM / mdadm will serve out from RAM (I've 48 Gigs of it)
> and speed up reads. Writes will be speeded up due to the caches of the
> Hardware card. Does this make sense?
>

The speedup you get with mdadm having large caches is only relevant for
raid5 (or raid6). The trouble with small raid5 writes is that you need
to read in the old data and parity block before you can write them anew,
and here a large cache increases the chance of having these blocks in
the cache. Once you are beyond the raid5 level (for example, if you
have the raid5 in hardware), caches on the host will not help.

If you have your three boxes set up with raid5 in hardware, then you
should stripe them (raid0) to form your final "disk". It is unlikely to
make any noticeable performance difference doing this in hardware or
mdadm, but mdadm is probably more flexible (though there is not much you
can do with raid0).

The worst choice you could make is to have raid5 on the host (software
or hardware). The issue here is that each stripe is going to be very
large, since it will cover all 45 disks. Since raid5 writes are slow
unless they cover an entire stripe, even fairly large writes are going
to be parts of a stripe and therefore slow.

>
>> This is perhaps also a good time to mention one of the risks of raid5
>> (and raid6) - the RAID5 Write Hole.
>
> This risk is reduced by a battery backed-up cache, correct?
>

Yes, if the disks themselves also have battery backup (i.e., an UPS).
If the controller is able to complete writes safely even in the event of
a power cut or a host OS crash, then the raid5 write hole should not be
an issue.

>>>> I currently have a 7-disk RAID5. Will writing to this be slower than
>>>> a single disk?
>>> A little, yes. But reading from it will be significantly faster.
>> Not necessarily - writing will be slower if you do lots of small
>> random writes, but much faster if you write large blocks.
>
> And will the reads and large-sequential-writes be even faster if I did a
> 14 disk RAID5 instead of a 7-disk RAID5?
>

You have to try to think what is happening when you are doing different
types of access. Lets go through this for some different setups,
considering small and large reads and writes with N drives and raid0,
raid1, raid10, mdadm "far" raid10, and raid5.

For raid0, you have a layout like this:

1 2 3 4
5 6 7 8

A small read will require a single seek on one of the disks, followed by
a read - it will give the same performance as a single drive (though you
will get better average throughput if you have lots of unrelated reads
in parallel). Similarly with a small write. For large reads and
writes, you can read or write to all the disks in parallel -
theoretically you get N times the throughput.

For raid1, you have this:

1a 1b
2a 2b
3a 3b
4a 4b

(Numbers are the data blocks, letters "a" and "b" indicate the
duplications).

With more disks but keeping two copies, this is effectively standard
raid10 (and raid01 is identical to raid10 performance-wise) :

1a 1b 2a 2b
3a 3b 4a 4b
5a 5b 6a 6b
7a 7b 8a 8b

Small reads are, as usual, a seek followed by a read. Seeks may be a
little faster than for 1 disk, since either half of the mirror can be
used - the one with the closest head can be picked. Small writes are
similar, though the same data must be written twice. However, since the
two copies are on different disks, these are done in parallel. For
large reads, you basically have half the disks running sequentially for
(N/2) speed - parallel reads from the second copy are only really useful
for reads of up to three stripes in size. Bulk writing is at up to N/2
speed.

mdadm "far" raid10 is a little different:

1a 2a 3a 4a
5a 6a 7a 8a
....
2b 1b 4b 3b
6b 5b 8b 7b

As with raid1, small reads are a seek followed by a read. Seeks may be
a little faster than for 1 disk, since either half of the mirror can be
used - the one with the closest head can be picked. Small writes are
similar, though the same data must be written twice in parallel. For
large reads - and this is a key difference from standard raid10 - the
layout looks like raid0, and runs at full N speed. Bulk writing is
again at up to N/2 speed.

raid5 looks like this:

1 2 3 p123
4 5 p456 6

Small reads are the same as for a single disk. Large reads are similar
to (N-1), but don't quite make it - the parity blocks disrupt the flow
of the sequential reads. Large writes - full stripes - are close to
(N-1) since the parity can be calculated on the host or controller and
written out directly. The killer is for small writes - imagine trying
to write to block 3 here. The host or controller must also calculate
the new p123, either by reading in block 1 (for 3-disk raid5) or by
reading the old block 3 and p123 (for more than 3 disks). Then it
calculates the parity (a quick task), and writes out blocks 3 and p123.
Waiting for these reads is what stalls the write process and gives
long latency on random writes.

So in answer to your question (which you should be able to see yourself
now), large reads and writes scale with the number of disks for raid5.
But the cutoff point for a write to be "large" or "small", i.e., the
stripe size, is larger when you have more disks.

>> Make sure your raid controller has batteries, and that the whole
>> system is on an UPS!
>
> Yes! Both.
>> For reading, especially for large reads, raid 5 is approximately like
>> N-1 raid 0 drives, while raid 6 is like N-2 raid 0.
>
> Problem is I haven't seen a similar formula mentioned for writes. Neither
> large nor small writes. What's a approximate design equation to use to
> rate options?

Large writes are approximately N-1. For small writes, you have longer
latency than for a simple single disk.

>> However, while a "f2" md raid 10 is probably the fastest setup for
>> directly connected drives, this is not what you have. You will also
>> suffer from bandwidth issues
>
> Which bandwidth are we talking about? THe CPU-to-controller?

The bandwidth between the host memory, through the DMA controller (and
possibly the cpu), to the SAS controller. Simply put, with software
mirroring the host has to write the same data twice, using twice the
bandwidth.

>
>> if you try to do all the mirroring of all
>> 45 drives in software. In your case, I would recommend raid 10 on
>> each box - 7 raid1 pairs striped together with a hot spare (assuming
>> the hardware supports a common hot spare). Your host then sees these
>> three disks, which you should stripe together with mdadm raid0 - there
>> is no need for redundancy here, as that is handled at a lower level.
>> Put your LVM physical volume on top of this if you want the
>> flexibility of LVM - if you don't need it, don't bother.
>
> Ah! Thanks! That;s a creative solution I hadn't thought about.
>> I have never heard of a distinction between a "hot spare" that is
>> spinning, and a "standby spare" that is not spinning.
>
> Me neither.
>
>>> The bottom line...: A seven-disk RAID 0 would be faster than a RAID 5
>>> during writes, but not really significantly faster during reads, and
>>> you would have the full storage capacity of all disks in the array,
>>> but there would be no redundancy at all. So, considering that you
>>> have seven disks, I think you really should consider building in
>>> redundancy. After all, with RAID 0, if a single disk in the array
>>> fails, then you'll have lost all of your data. A RAID 5 would upon
>>> failure of a single disk run slower, but at least you'd still have
>>> access to your data.
>>>
>
> Or I could do the RAID10 that you suggest and stripe on top of three such
> arrays using mdadm. I'm thinking about this very interesting option.
> Thanks!
>
>

From: David Brown on 20 Jan 2010 08:07

Aragorn wrote:
> On Tuesday 19 January 2010 22:57 in comp.os.linux.misc, somebody
> identifying as David Brown wrote...
>
>> Aragorn wrote:
>>
>>> On Tuesday 19 January 2010 08:37 in comp.os.linux.misc, somebody
>>> identifying as Rahul wrote...
>>>
>>>> Aragorn <aragorn(a)chatfactory.invalid> wrote in news:hj1gta$2hp$5
>>>> @news.eternal-september.org:
>>>>
>>>> Thanks for the great explaination!
>>> Glad you appreciated it. ;-)
>> Unfortunately, there seems to me to be a number of misconceptions in
>> this post. I freely admit to having more theoretical knowledge from
>> trawling the net, reading mdadm documentation, etc., than personal
>> practical experience - so anyone reading this will have to judge for
>> themselves whether they think I am right, or Aragorn is right. Either
>> way, I hope to give you some things to think about.
>
> Having read your reply, I agree with most of it. I obviously made a
> thinko in assuming that it was the parity calculation that slowed
> things down, but I was writing my reply in a rather abstracted set of
> mind.
>
> All things considered, you and I are both right. The difference in our
> view is that you further dissected the slowdown to the reads needed in
> order to calculate the parity, while in my abstraction, I did not get
> into this any further. ;-)
>

It is also the case that calculating parity used to be significant in
the timing. When host cpus ran at a few hundred MHz, doing parity
calculations in software was slow, and took all of the host cpu
capacity. But since then, the host cpus are orders of magnitude faster
(and are even better suited to the sorts of streaming calculations
needed), while hard disk speeds have only increased a few times. And
with multiple cores on the host as standard, most setups can handle the
load without noticing.

Another point here is that you can improve some things on the host side
by using a faster processor and more ram (you can never have too much
ram in a file server, assuming you are not using windows). It is often
cheaper and easier to boost the host in this way to improve your
software raid than it is to change the hardware raid setup.

>>>>> Writing to a RAID 5 is slower than writing to a single disk because
>>>>> with each write, the parity block must be updated, which means
>>>>> calculation of the parity data and writing that parity data to the
>>>>> pertaining disk.
>>>> This is where I get confused. Is writing to a RAID5 slower than a
>>>> single disk irrespective of how many disks I throw at the RAID5?
>>> Normally, yes, although it won't be *much* slower. But there is some
>>> overhead in the calculation of the parity, yes. This is why RAID 6
>>> is even slower during writes: it stores *two* parity blocks per data
>>> segment (and as such, it requires a minimum of 4 disks).
>> Writing to RAID5 (or RAID6) /may/ be slower than writing to a single
>> disk - or it may be much faster (closer to RAID0 speeds). The actual
>> parity calculations are negligible with modern hardware, whether it be
>> the host CPU or a hardware raid card. What takes time is if existing
>> data has to be read in from the disks in order to calculate the parity
>> - this causes a definite delay. If you are writing a whole stripe,
>> the parity can be calculated directly and the write goes at N-1 speed
>> as each block in the stripe can be written in parallel. This is also
>> the case if the parts of the block are already in the cache from
>> before.
>>
>> Thus random writes are slow on RAID5 (and RAID6), but larger block
>> writes are full speed.
>
> I agree with this.
>
>> There can also be significant differences between the speed of mdadm
>> software RAID5, and hardware RAID5. With hardware raid, the card can
>> report a small write as "finished" before it has read in the block and
>> written out the data and new parity. This is safe for good hardware
>> with battery backup of its buffers, and gives fast writes (as far as
>> the host is concerned) even for small writes. Software raid5 cannot
>> do this.
>
> Correct.
>
>> But on the other hand, software raid5 can take advantage of large
>> system memories for cache, and is thus far more likely to have the
>> required stripe data already in its cache (especially for metadata
>> and directory areas of the file system, which are commonly accessed
>> but have small writes).
>
> Also correct, but dependent on available system RAM, of course. I'm not
> sure how much RAM the RAID controllers in my two servers have - I think
> one has 256 MB and the other one 128 MB - but on a system with, say, 4
> GB of system RAM, caching capacity is of course much higher.
>

You will probably find it is cheaper to add another 4 GB system ram to
the host than another 128 MB to the hardware raid controller.

And of course more system ram means more data held in the file-cache (as
well as the low-level block caches useful for raid5), reducing the reads
from the disk.

>> This is perhaps also a good time to mention one of the risks of raid5
>> (and raid6) - the RAID5 Write Hole. When you are writing a stripe to
>> the disk, the system must write at least two blocks - data and the
>> updated parity block. These two writes cannot be done atomically - if
>> you get a system failure at this point, the blocks may be inconsistent
>> and the whole stripe is inconsistent and effectively becomes silent
>> garbage.
>
> Indeed, the infamous RAID 5/6 write hole. I left that bit of
> information out as my advice to the OP was not to use RAID 5 or RAID 6
> but to use RAID 10 instead, considering the gigantic amount of drives
> he has at his disposal. ;-)
>

Agreed.

> On the other hand - and purely in theory, as RAID 5/6 is not the right
> solution for the OP - a battery-backed hardware RAID controller on a
> machine hooked up to a UPS should be able to avoid the RAID 5/6 writing
> hole, since the controller itself has its own processor. I even
> believe that my Adaptec SAS RAID controller runs a Linux kernel of its
> own.
>

That's often the case these days - "hardware" raid controllers are
frequently host processors running Linux (or sometimes other systems)
and software raid. This is especially true of SAN boxes.

>>>> I currently have a 7-disk RAID5. Will writing to this be slower than
>>>> a single disk?
>>> A little, yes. But reading from it will be significantly faster.
>> Not necessarily - writing will be slower if you do lots of small
>> random writes, but much faster if you write large blocks.
>
> Yes, of course, it all depends on the usage. That's why different types
> of servers use different types of RAID configurations, and the same can
> be said about the choice of filesystem types. There is no "one size
> fits all". ;-)
>

Yes - perhaps the OP will give more details on his expected usage
patterns. There are many other factors we haven't discussed that can
affect the "best" setup, such as what requirements he has for future
expansion.

>> Also remember that with a 7 disk array under heavy use, you /will/ see
>> a disk failure at some point.
>
> Also correct. The risk of disk failure will increase with the amount of
> disks involved.
>
>> Degraded performance of raid 5 is very poor, and rebuilds are slow.
>
> Correct. RAID 5 and RAID 6 are trade-offs between redundancy, diskspace
> consumption and performance. RAID 10 offers the best performance and
> redundancy, but is the most costly in terms of wasted diskspace.
>
>> Some people believe that the chance of a
>> second disk failure occurring during a rebuild is so large (rebuilds
>> are particular intensive for the other disks) that raid 5 should be
>> considered unsafe for large arrays. Raid 6 is better since it can
>> survive a second failure, but mirrored raids are safer still.
>>
>>>> Isn't the parity calculation a fairly fast process especially if one
>>>> has a hardware based card?
>> A decent host processor will do the parity calculations /much/ faster
>> than the raid processor on most hardware cards. But the calculations
>> themselves are not the cause of the latency, it's the extra reads that
>> take time.
>
> Correct.
>
>>> It is however advised if you have a hardware RAID adapter to disable
>>> the write barriers. Write barriers are where the kernel forces the
>>> disks drives to flush their caches. Since a hardware RAID adapter
>>> must be in total control of the disk drives and has cache memory of
>>> its own, the operating system should never force the disk drives to
>>> flush their cache.
>>>
>> Make sure your raid controller has batteries, and that the whole
>> system is on an UPS!
>
> If data integrity is important, then I consider a UPS a necessity, even
> without RAID. ;-)
>

And assuming the data is important, the OP must also think about backup
solutions. But that's worth its own thread.

>>>> And then if the write gets split into 6 parts shouldnt that speed up
>>>> the process since each disk is writing only 1/6th of the chunk?
>>> Yes, but the data has to be split up first - which is of course a lot
>>> faster on hardware RAID since it is done by a dedicated processor on
>>> the adapter itself then - and the parity has to be calculated. This
>>> is overhead which you do not have with a single disk.
>> Nonsense - a host CPU is perfectly capable of splitting a stripe into
>> its blocks in a fraction of a microsecond. It is also much faster at
>> doing the parity calculations - the host CPU typically runs at least
>> ten times as fast as the CPU or ASIC on the raid card.
>
> Yes, but the host CPU also has other things to take care off, while the
> CPU or ASIC on a RAID controller is dedicated to just that one task.
>

That's true, but not really relevant for modern CPUs - when you've got 4
or 8 cores running at ten times the speed of the raid controller's chip,
you are not talking about a significant load.

>> And again, the splitting and parity calculations are not the
>> bottleneck, it's the latency of the reads needed to calculate the new
>> parity that takes time.
>
> True, but as I stated higher up, I considered the reads to be part of
> the parity calculation process. You need those reads in order to
> calculate the parity, so it's a matter of semantics. ;-)
>

With those definitions, then I agree, of course.

>> There are times when top-range hardware raid cards will beat software
>> raid on speed, but not often - especially with a fast multi-core
>> modern host cpu. It does, however, depend highly on your raid setup
>> and the type of load you have - there are no set answers here.
>
> Considering that many hardware RAID adapters have a battery-backed
> cache, I'd say that's another argument in favor of true hardware RAID.
>
>> Software raid does of course have a reliability weak point - if your
>> OS crashes in the middle of a write, you have a bigger chance of
>> hitting the raid 5 write hole than you would with a hardware raid card
>> with a battery.
>
> I think this is an important thing to consider for anyone looking into a
> RAID 5 solution.
>
>>>>> There are however a few considerations you should take into account
>>>>> with both of these approaches, i.e. that you should not put the
>>>>> filesystem which holds the kernels and /initrd/ - and preferably
>>>>> not the root filesystem either[1] - on a stripe, because the
>>>>> bootloader recognizes [...]
>>>> Luckily that is not needed. I have a seperate drive to boot from.
>>>> The RAID is intended only for user /home dirs.
>>> Ah but wait a minute. As I understand it, you have a hardware RAID
>>> adapter card. In that case - assuming that it is a real hardware
>>> RAID adapter and not one of those on-board fake-RAID things - it
>>> doesn't matter, because to the operating system (and even to the
>>> BIOS), the entire array will be seen as a single disk. So then it is
>>> perfectly possible to have your bootloader, your "/boot" and your "/"
>>> living on the RAID array. (I am doing that myself on one of my
>>> machines, which has two RAID 5 arrays of four disks each.)
>>>
>>> And in this case - i.e. if you have a hardware RAID array - then your
>>> original question regarding software RAID 0 versus striping via LVM
>>> is also answered, because hardware RAID will always be a bit faster
>>> than software RAID or striped LVM. Additionally, since you mention
>>> seven disks, you could even opt for RAID 10 or 51 and even have
>>> a "hot spare" or "standby spare". (Or you could use the extra disk
>>> as an individual, standalone disk.)
>>>
>>> RAID 10 is where you have a mirror (i.e. RAID 1) which is striped to
>>> another mirror - you could instead also use RAID 01, which is a
>>> stripe which is mirrored on another stripe. RAID 10 is better than
>>> RAID 01 though - there's a good article on Wikipedia about it. RAID
>>> 10 or 01 require four disks in total. Performance is very good for
>>> both reading and writing *and* you have redundancy.
>> Yes, wikipedia /does/ have some useful information about raid - it's
>> worth reading.
>
> You are preaching to the choir. ;-)
>
>> One thing you are missing here is that Linux mdadm raid 10 is very
>> much more flexible than just a "stripe of mirrors", which is the
>> standard raid 10. In particular, you can use any number of disks
>> (from 2 upwards), you can have more than 2 copies of each block (at
>> the cost of disk space, obviously) for greater redundancy, and you can
>> have a layout that optimises the throughput for different loads.
>
> I'm not missing that, but since the OP had confirmed that he has a
> hardware RAID set-up, I was addressing that aspect only. Linux
> software RAID is applied on a partition basis and is thus indeed more
> flexible than hardware RAID, which is applied on an entire disk basis.
>

I was discussing raid a little more generally, since the OP was asking
about mdadm and LVM, while I think you were talking more about hardware
raid since he has hardware raid devices already (mdadm raid might be
better value for money than hardware raid for most setups - but not if
you already have the hardware!). Just a difference of emphasis, really.

>> Raid10 performance is also much less affected by a disk failure, and
>> rebuilds are faster and less stressful on the system. And a single
>> hot spare will cover all the disks - you don't need a spare per
>
> I think RAID 10 (at the hardware level) would be ideal for the OP.
>
>> [...] Put your LVM physical volume on top of this if you want the
>> flexibility of LVM - if you don't need it, don't bother.
>
> He might want to use LVM in order to combine what the operating sees as
> being three independent drives - and which are in reality three RAID
> arrays - into a single "/home" volume, but then again, he could instead
> also create that "/home" as an /mdadm/ stripe.
>
> Either way, he needs to choose between /mdadm/ and LVM for how he wants
> to set up the "/home" volume from three separate arrays, be it striped
> or linear - alias the JBOD approach - but not /mdadm/ and LVM both.
> That would be unnecessary overhead.
>

I'd recommend combining the three "drives" using mdadm raid0 rather than
LVM striping - it's a cleaner solution, and easier to get right (with
LVM striping it's all too easy to make a logical partition that is not
striped, since the striping must be stated explicitly in the lvcreate
command).

The point of LVM is for features such as resizing partitions, snapshots,
migration of partitions, etc. If you don't need these, don't use LVM.
LVM does not have much overhead, but it still leads to some slowdown.

LVM is fine for JBOD or linear setups - adding a new drive to your
volume group for more disk space. But I believe mdadm does a better job
- it is more dedicated to the task. I don't have any numbers, but I
have found LVM striping to be slower than expected.

>>> Similarly, RAID 51 is where you have a RAID 5 which is mirrored onto
>>> another RAID 5. Or you could use RAID 15, which is a RAID 5
>>> comprised of mirrors. RAID 51 and 15 require a minimum of six disks.
>>> (Similarly, there is RAID 61 and 16, which require a minimum of eight
>>> disks.)
>> As a minor point, mdadm raid 5 can work on 2 disks (and raid 6 on
>> three disks). Such a 2-disk raid 5 is not much use in a working
>> system, but can be convenient when setting things up or upgrading
>> drives, as you can add more drives to the mdadm raid 5 later on. It's
>> just an example of how much more flexible mdadm is than hardware raid
>> solutions.
>
> True, but considering the enormous amount of drives involved and
> hardware RAID already being present, I think the wisest approach would
> be to use hardware RAID at the lower levels and only use /mdadm/ at the
> final level.
>

Indeed - that was another one of my general points, rather than specific
advice to the OP. I have perhaps mixed these up a bit in my posts.

>>> [...]
>>> With RAID 5, your storage capacity is reduced by the capacity of one
>>> disk in the array, and with RAID 6 by the capacity of two disks in
>>> the array. So, with a single RAID 5 array comprised of seven disks
>>> without a standby or hot spare, your total storage capacity is that
>>> of six disks.
>>>
>>> And then there's the lost capacity of the hot spare or standby spare
>>> - a hot spare is spinning but otherwise unused until one of the other
>>> disks starts to fail, while a standby spare is spun down until one of
>>> the other disks fails. Upon such failure, the array will be
>>> automatically rebuilt using the parity blocks to write the missing
>>> data to the spare disk.
>> I have never heard of a distinction between a "hot spare" that is
>> spinning, and a "standby spare" that is not spinning.
>
> This is quite a common distinction, mind you. There is even a "live
> spare" solution, but to my knowledge this is specific to Adaptec - they
> call it RAID 5E.
>
> In a "live spare" scenario, the spare disk is not used as such but is
> part of the live array, and both data and parity blocks are being
> written to it, but with the distinction that each disk in the array
> will also have empty blocks for the total capacity of a standard spare
> disk. These empty blocks are thus distributed across all disks in the
> array and are used for array reconstruction in the event of a disk
> failure.
>

Is there any real advantage of such a setup compared to using raid 6 (in
which case, the "empty" blocks are second parity blocks)? There would
be a slightly greater write overhead (especially for small writes), but
that would not be seen by the host if there is enough cache on the
controller.

>> Given that spinup takes a few seconds, and a rebuild often takes many
>> hours, I can't see you have much to gain by keeping a spare drive
>> spinning.
>
> It might be required for some software RAID solutions where the spare
> disk cannot be spun down via software. For instance in the event of
> parallel SCSI drives in a software RAID array.
>

Not all drive systems (controller and/or drives) support spin down or
idle drives.

>> To my mind, a "hot spare" is a drive that will be used automatically
>> to replace a dead drive.
>
> Semantics. ;-)
>

Yes.

>> An "offline spare" is an extra drive that is physically attached, but
>> not in use automatically - in the event of a failure, it can be
>> manually assigned to a raid set. This makes sense if you have several
>> hardware raid sets defined and want to share a single spare, if the
>> hardware raid cannot support this (mdadm, of course, supports such a
>> setup with a shared hot spare).
>
> Most modern hardware RAID controllers support this.
>

OK.

It looks like we agree on most things here - we just had a little
difference on the areas we wrote about (specific information for the OP,
or more general RAID discussions), and a few small differences in
terminology.

mvh.,

David

From: Aragorn on 20 Jan 2010 08:44

On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody
identifying as David Brown wrote...

> Aragorn wrote:
>
>> [David Brown wrote:]
>>
>>> But on the other hand, software raid5 can take advantage of large
>>> system memories for cache, and is thus far more likely to have the
>>> required stripe data already in its cache (especially for metadata
>>> and directory areas of the file system, which are commonly accessed
>>> but have small writes).
>>
>> Also correct, but dependent on available system RAM, of course. I'm
>> not sure how much RAM the RAID controllers in my two servers have - I
>> think one has 256 MB and the other one 128 MB - but on a system with,
>> say, 4 GB of system RAM, caching capacity is of course much higher.
>
> You will probably find it is cheaper to add another 4 GB system ram to
> the host than another 128 MB to the hardware raid controller.

Well, that all depends... On a system that uses ECC registered RAM -
such as a genuine server - the cost of adding more RAM may be quite
daunting.

On the other hand, I'm not so sure whether a hardware RAID adapter can
be retrofitted with more memory than it already has out of the box.

>> On the other hand - and purely in theory, as RAID 5/6 is not the
>> right solution for the OP - a battery-backed hardware RAID controller
>> on a machine hooked up to a UPS should be able to avoid the RAID 5/6
>> writing hole, since the controller itself has its own processor. I
>> even believe that my Adaptec SAS RAID controller runs a Linux kernel
>> of its own.
>
> That's often the case these days - "hardware" raid controllers are
> frequently host processors running Linux (or sometimes other systems)
> and software raid. This is especially true of SAN boxes.

Well, the line between hardware RAID and software RAID is rather blurry
in the event of a modern hardware RAID controller. Sure, it's all
firmware, but there is a software component involved as well,
presumably because of certain efficiencies in scheduling with set-ups
employing multiple disks, as with the nested RAID solutions.

>>> Make sure your raid controller has batteries, and that the whole
>>> system is on an UPS!
>>
>> If data integrity is important, then I consider a UPS a necessity,
>> even without RAID. ;-)
>
> And assuming the data is important, the OP must also think about
> backup solutions. But that's worth its own thread.

Ahh, but that is the First Rule in the Bible of any sysadmin: "Thou
shalt make backups, and lots of them too!" :p

>>> [...] a host CPU is perfectly capable of splitting a stripe
>>> into its blocks in a fraction of a microsecond. It is also much
>>> faster at doing the parity calculations - the host CPU typically
>>> runs at least ten times as fast as the CPU or ASIC on the raid card.
>>
>> Yes, but the host CPU also has other things to take care off, while
>> the CPU or ASIC on a RAID controller is dedicated to just that one
>> task.
>
> That's true, but not really relevant for modern CPUs - when you've got
> 4 or 8 cores running at ten times the speed of the raid controller's
> chip, you are not talking about a significant load.

Well, 8 cores might be a bit of a stretch, and not everyone has quadcore
CPUs yet, either. My Big Machine for instance has two dualcore
Opterons in it, so that makes for four cores in total. (The machine
also has a SAS/SATA RAID controller, so the RAID discussion is moot
here, but I'm just mentioning it.)

Another thing which must not be overlooked is that the CPU or ASIC on a
hardware RAID controller is typically a RISC chip, and so comparing
clock speeds would not really give an accurate impression of its
performance versus a mainboard processor chip. For instance, a MIPS or
Alpha processor running at 800 MHz still outperforms most (single core)
2+ GHz processors.

>>> I have never heard of a distinction between a "hot spare" that is
>>> spinning, and a "standby spare" that is not spinning.
>>
>> This is quite a common distinction, mind you. There is even a "live
>> spare" solution, but to my knowledge this is specific to Adaptec -
>> they call it RAID 5E.
>>
>> In a "live spare" scenario, the spare disk is not used as such but is
>> part of the live array, and both data and parity blocks are being
>> written to it, but with the distinction that each disk in the array
>> will also have empty blocks for the total capacity of a standard
>> spare disk. These empty blocks are thus distributed across all disks
>> in the array and are used for array reconstruction in the event of a
>> disk failure.
>
> Is there any real advantage of such a setup compared to using raid 6
> (in which case, the "empty" blocks are second parity blocks)? There
> would be a slightly greater write overhead (especially for small
> writes), but that would not be seen by the host if there is enough
> cache on the controller.

Well, the advantage of this set-up is that you don't need to replace a
failing disk, since there is already sufficient diskspace left blank on
all disks in the array, and so the array can recreate itself using that
extra blank diskspace. This is of course all nice in theory, but in
practice one would eventually replace the disk anyway.

In terms of performance, it would be similar to RAID 6 for reads -
because the empty blocks have to be skipped in sequential reads - but
for writing it would be slightly better than RAID 6 since only one set
of parity data per stripe needs to be (re)calculated and (re)written.

It does of course remain a single-point-of-failure set-up, whereas RAID
6 offers a two-points-of-failure set-up.

>>> An "offline spare" is an extra drive that is physically attached,
>>> but not in use automatically - in the event of a failure, it can be
>>> manually assigned to a raid set. This makes sense if you have
>>> several hardware raid sets defined and want to share a single spare,
>>> if the hardware raid cannot support this (mdadm, of course, supports
>>> such a setup with a shared hot spare).
>>
>> Most modern hardware RAID controllers support this.
>
> OK.
>
> It looks like we agree on most things here - we just had a little
> difference on the areas we wrote about (specific information for the
> OP, or more general RAID discussions), and a few small differences in
> terminology.

Well, you've made me reconsider my usage of RAID 5, though. I am not
contemplating on using two RAID 10 arrays instead of two RAID 5 arrays,
since each of the arrays has four disks. They are both different
arrays, though. They're connected to the same RAID controller but the
first array is comprised of 147 GB 15k Hitachi SAS disks and the second
array is comprised of 1 TB 7.2k Western Digital RAID Edition SATA-2
disks on a hotswap backplane.

I had always considered RAID 5 to be the best trade-off, considering the
loss of diskspace involved versus the retail price of the hard disks -
especially the SAS disks - but considering that the SAS array will be
used to house the main systems in a virtualized set-up (on Xen) and
will probably endure the most small and random writes, RAID 10 might
actually be a better solution. The cost of the lost diskspace on the
SATA-2 disks is smaller since this type of disks is far less expensive
than SAS.

See, this is one of the advantages of Usenet. People get to share not
only knowledge but also differing views and strategies, and in the end,
everyone will have gleaned something useful. ;-)

--
*Aragorn*
(registered GNU/Linux user #223157)

From: Aragorn on 20 Jan 2010 08:49

On Wednesday 20 January 2010 14:44 in comp.os.linux.misc, somebody
identifying as Aragorn wrote...

> On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody
> identifying as David Brown wrote...
>
> Well, you've made me reconsider my usage of RAID 5, though. I am not
^^^
> contemplating on using two RAID 10 arrays instead of two RAID 5
> arrays, [...]

That should read "now" instead of "not". Typo again. ;-)

--
*Aragorn*
(registered GNU/Linux user #223157)

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9
Prev: Variable I/O Performance with dd vs. cat
Next: ? Recommend PC for learning RHEL ?