From: David Brown on
Aragorn wrote:
> On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody
> identifying as David Brown wrote...
>
>> Aragorn wrote:
>>
>>> [David Brown wrote:]
>>>
>>>> But on the other hand, software raid5 can take advantage of large
>>>> system memories for cache, and is thus far more likely to have the
>>>> required stripe data already in its cache (especially for metadata
>>>> and directory areas of the file system, which are commonly accessed
>>>> but have small writes).
>>> Also correct, but dependent on available system RAM, of course. I'm
>>> not sure how much RAM the RAID controllers in my two servers have - I
>>> think one has 256 MB and the other one 128 MB - but on a system with,
>>> say, 4 GB of system RAM, caching capacity is of course much higher.
>> You will probably find it is cheaper to add another 4 GB system ram to
>> the host than another 128 MB to the hardware raid controller.
>
> Well, that all depends... On a system that uses ECC registered RAM -
> such as a genuine server - the cost of adding more RAM may be quite
> daunting.
>
> On the other hand, I'm not so sure whether a hardware RAID adapter can
> be retrofitted with more memory than it already has out of the box.
>

"Daunting" is less than "impossible" :-) Of course, it depends on your
setup and your needs - there are no fixed answers (I think that's been
mentioned before...)

>>> On the other hand - and purely in theory, as RAID 5/6 is not the
>>> right solution for the OP - a battery-backed hardware RAID controller
>>> on a machine hooked up to a UPS should be able to avoid the RAID 5/6
>>> writing hole, since the controller itself has its own processor. I
>>> even believe that my Adaptec SAS RAID controller runs a Linux kernel
>>> of its own.
>> That's often the case these days - "hardware" raid controllers are
>> frequently host processors running Linux (or sometimes other systems)
>> and software raid. This is especially true of SAN boxes.
>
> Well, the line between hardware RAID and software RAID is rather blurry
> in the event of a modern hardware RAID controller. Sure, it's all
> firmware, but there is a software component involved as well,
> presumably because of certain efficiencies in scheduling with set-ups
> employing multiple disks, as with the nested RAID solutions.
>
>>>> Make sure your raid controller has batteries, and that the whole
>>>> system is on an UPS!
>>> If data integrity is important, then I consider a UPS a necessity,
>>> even without RAID. ;-)
>> And assuming the data is important, the OP must also think about
>> backup solutions. But that's worth its own thread.
>
> Ahh, but that is the First Rule in the Bible of any sysadmin: "Thou
> shalt make backups, and lots of them too!" :p
>

The zeroth rule, which is often forgotten (until you learn the hard
way!), is "thou shalt make a plan for restoring from backups, test that
plan, document that plan, and find a way to ensure that all backups are
tested and restoreable in this way". /Then/ you can start making your
actual backups!

And the second rule is "thou shalt make backups of your backups",
followed by "thou shalt have backups of critical hardware". (That's
another bonus of software raid - if your hardware raid card dies, you
may have to replace it with exactly the same type of card to get your
raid working again - with mdadm raid, you can use any PC.)



>>>> [...] a host CPU is perfectly capable of splitting a stripe
>>>> into its blocks in a fraction of a microsecond. It is also much
>>>> faster at doing the parity calculations - the host CPU typically
>>>> runs at least ten times as fast as the CPU or ASIC on the raid card.
>>> Yes, but the host CPU also has other things to take care off, while
>>> the CPU or ASIC on a RAID controller is dedicated to just that one
>>> task.
>> That's true, but not really relevant for modern CPUs - when you've got
>> 4 or 8 cores running at ten times the speed of the raid controller's
>> chip, you are not talking about a significant load.
>
> Well, 8 cores might be a bit of a stretch, and not everyone has quadcore
> CPUs yet, either. My Big Machine for instance has two dualcore
> Opterons in it, so that makes for four cores in total. (The machine
> also has a SAS/SATA RAID controller, so the RAID discussion is moot
> here, but I'm just mentioning it.)
>
> Another thing which must not be overlooked is that the CPU or ASIC on a
> hardware RAID controller is typically a RISC chip, and so comparing
> clock speeds would not really give an accurate impression of its
> performance versus a mainboard processor chip. For instance, a MIPS or
> Alpha processor running at 800 MHz still outperforms most (single core)
> 2+ GHz processors.
>

As already mentioned, "hardware" raid is often done now with a general
purpose processor rather than an ASIC - and MIPS is a particularly
popular core for the job. But while you get a lot more work out of an
800 MHz for a given price, size or power than you do with an x86, you
don't get more for a given clock rate. Parity calculations are really
just a big stream of "xor"'s, and a modern x86 will chew through these
as fast as memory bandwidth allows. Internally, x86 assembly is mostly
converted to wide-word RISC-style instructions, so a decently written
parity function will be as efficient per clock on an x86 as it is on MIPS.

There are plenty of situations where a slower clock but cleaner
architecture gives more true speed, especially if latency is more
important than throughput, but this isn't one of them.

>>>> I have never heard of a distinction between a "hot spare" that is
>>>> spinning, and a "standby spare" that is not spinning.
>>> This is quite a common distinction, mind you. There is even a "live
>>> spare" solution, but to my knowledge this is specific to Adaptec -
>>> they call it RAID 5E.
>>>
>>> In a "live spare" scenario, the spare disk is not used as such but is
>>> part of the live array, and both data and parity blocks are being
>>> written to it, but with the distinction that each disk in the array
>>> will also have empty blocks for the total capacity of a standard
>>> spare disk. These empty blocks are thus distributed across all disks
>>> in the array and are used for array reconstruction in the event of a
>>> disk failure.
>> Is there any real advantage of such a setup compared to using raid 6
>> (in which case, the "empty" blocks are second parity blocks)? There
>> would be a slightly greater write overhead (especially for small
>> writes), but that would not be seen by the host if there is enough
>> cache on the controller.
>
> Well, the advantage of this set-up is that you don't need to replace a
> failing disk, since there is already sufficient diskspace left blank on
> all disks in the array, and so the array can recreate itself using that
> extra blank diskspace. This is of course all nice in theory, but in
> practice one would eventually replace the disk anyway.
>

The same is true of raid6 - if one disk dies, the degraded raid6 is very
similar to raid5 until you replace the disk.

And I still don't see any significant advantage of spreading the wholes
around the drives rather than having them all on the one drive (i.e., a
normal hot spare). The rebuild still has to do as many reads and
writes, and takes as long. The rebuild writes will be spread over all
the disks rather than just on the one disk, but I can't see any
advantage in that.

I suppose read performance, especially for many parallel small reads,
will be slightly higher than for a normal hot spare, since you have more
disks with active data and therefore higher chances of parallelising
these accesses. But you get the same advantage with raid6.

> In terms of performance, it would be similar to RAID 6 for reads -
> because the empty blocks have to be skipped in sequential reads - but
> for writing it would be slightly better than RAID 6 since only one set
> of parity data per stripe needs to be (re)calculated and (re)written.
>
> It does of course remain a single-point-of-failure set-up, whereas RAID
> 6 offers a two-points-of-failure set-up.
>




>>>> An "offline spare" is an extra drive that is physically attached,
>>>> but not in use automatically - in the event of a failure, it can be
>>>> manually assigned to a raid set. This makes sense if you have
>>>> several hardware raid sets defined and want to share a single spare,
>>>> if the hardware raid cannot support this (mdadm, of course, supports
>>>> such a setup with a shared hot spare).
>>> Most modern hardware RAID controllers support this.
>> OK.
>>
>> It looks like we agree on most things here - we just had a little
>> difference on the areas we wrote about (specific information for the
>> OP, or more general RAID discussions), and a few small differences in
>> terminology.
>
> Well, you've made me reconsider my usage of RAID 5, though. I am now
> contemplating on using two RAID 10 arrays instead of two RAID 5 arrays,
> since each of the arrays has four disks. They are both different
> arrays, though. They're connected to the same RAID controller but the
> first array is comprised of 147 GB 15k Hitachi SAS disks and the second
> array is comprised of 1 TB 7.2k Western Digital RAID Edition SATA-2
> disks on a hotswap backplane.
>
> I had always considered RAID 5 to be the best trade-off, considering the
> loss of diskspace involved versus the retail price of the hard disks -
> especially the SAS disks - but considering that the SAS array will be
> used to house the main systems in a virtualized set-up (on Xen) and
> will probably endure the most small and random writes, RAID 10 might
> actually be a better solution. The cost of the lost diskspace on the
> SATA-2 disks is smaller since this type of disks is far less expensive
> than SAS.
>

I gather that raid 10 (hardware or software) is now often considered a
better choice - raid 5 is often viewed as unreliable due to the risks of
a second failure during rebuilds, which are increasingly time-consuming
with larger disks. Where practical, I think mdadm "far" raid 10 is the
optimal if you are happy with losing 50% of your disk space - it is
faster than other redundant setups in many situations, and has a great
deal of flexibility. If you want more redundancy, you can use double
mirrors for 33% disk space and still have full speed. If you have the
chance, it would be very nice to try out some different arrangements and
see which is fastest in reality, not just in theory!

The other option is to go for a file system that handles multiple disks
and redundancy directly - ZFS is the best known, with btrfs the
experimental choice on Linux.


> See, this is one of the advantages of Usenet. People get to share not
> only knowledge but also differing views and strategies, and in the end,
> everyone will have gleaned something useful. ;-)
>

Absolutely - that's also why it's good to have a general discussion
every now and again, rather than just answering a poster's questions.
Good questions (such as in this thread) inspire an exchange of
information for many people's benefits (I've learned things here too).
From: Rahul on
David Brown <david(a)westcontrol.removethisbit.com> wrote in
news:4b570006$0$6251$8404b019(a)news.wineasy.se:

Thanks both Aragorn and David! This is one of the most comprehensive
advice about RAID issues that I ever got. If you were ever in Madison,
WI I owe you guys a beer! :)

> Aragorn wrote:
>>
>
> Yes - perhaps the OP will give more details on his expected usage
> patterns. There are many other factors we haven't discussed that can
> affect the "best" setup, such as what requirements he has for future
> expansion.

Sure. More details: Its a mixed bad of I/O actually. This is a part of a
High Performance Compute Cluster. So a wide variety of codes are in use.
We have tracked I/O nature. Some of them have large sequential writes.
Others are dominated by random seeks. Which is why I am not really fine
tuning my setup for a particular access pattern but going for the best
overall performance. RAID5 and RAID10 fit the bill it seems. RAID10 even
more so. I do have the luxary of excess storage right now so I am
convinced I ought to do a RAID10 like you guys suggested (at the HW
level).

For combining the 3 RAID10's I am still split between LVM and mdadm. The
performance advantages convince me towards mdadm. But the ease of
partition resizing etc. make LVM attractive.


>
> And assuming the data is important, the OP must also think about
> backup solutions. But that's worth its own thread.

Actually I am lucky. The data will *not* be backed up on tape. You might
think this strange but this is meant to be a store for jobs that are
staging or running. So people are expected to remove data to more secure
storage in ~10 days. Worst case we take a 10 day hit which is Ok for our
scientific computing needs.

>
> I was discussing raid a little more generally, since the OP was asking
> about mdadm and LVM, while I think you were talking more about
> hardware raid since he has hardware raid devices already (mdadm raid
> might be better value for money than hardware raid for most setups -
> but not if you already have the hardware!). Just a difference of
> emphasis, really.

Absolutely. I appreciate the improvement in my overall RAID
understanding.


--
Rahul
From: Aragorn on
On Wednesday 20 January 2010 15:48 in comp.os.linux.misc, somebody
identifying as David Brown wrote...

> Aragorn wrote:
>
>> On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody
>> identifying as David Brown wrote...
>>
>>> You will probably find it is cheaper to add another 4 GB system ram
>>> to the host than another 128 MB to the hardware raid controller.
>>
>> Well, that all depends... On a system that uses ECC registered RAM -
>> such as a genuine server - the cost of adding more RAM may be quite
>> daunting.
>>
>> On the other hand, I'm not so sure whether a hardware RAID adapter
>> can be retrofitted with more memory than it already has out of the
>> box.
>
> "Daunting" is less than "impossible" :-) Of course, it depends on
> your setup and your needs - there are no fixed answers (I think that's
> been mentioned before...)

Yeah... Most adapters I know of come with either 128 MB or 256 MB. I'd
have to check the specs for my Adaptec SAS RAID adapter again, but my
U320 RAID adapter - also from Adaptec - has only 128 MB.

The sad news is that the battery packs are often optional, so you need
to pay attention when ordering or buying such an adapter card.

>>> And assuming the data is important, the OP must also think about
>>> backup solutions. But that's worth its own thread.
>>
>> Ahh, but that is the First Rule in the Bible of any sysadmin: "Thou
>> shalt make backups, and lots of them too!" :p
>
> The zeroth rule, which is often forgotten (until you learn the hard
> way!), is "thou shalt make a plan for restoring from backups, test
> that plan, document that plan, and find a way to ensure that all
> backups are tested and restoreable in this way". /Then/ you can start
> making your actual backups!

Well, so far I've always used the tested and tried approach of tar'ing
in conjunction with bzip2. Can't get any cleaner than that. ;-)

> And the second rule is "thou shalt make backups of your backups",
> followed by "thou shalt have backups of critical hardware". (That's
> another bonus of software raid - if your hardware raid card dies, you
> may have to replace it with exactly the same type of card to get your
> raid working again - with mdadm raid, you can use any PC.)

Well, considering that my Big Machine has drained my piggy bank for
about 17'000 Euros worth of hardware, having a duplicate machine is not
really an option. The piggy bank's on a diet now. :-)

I do on the other hand still have a slightly older dual Xeon machine
with 4 GB of RAM and an U320 SCSI RAID 1 (with two 73 GB disks), which
I will be setting up as an emergency replacement server, and to store
additional backups on - I store my other backups on Iomega REV disks.

>> Another thing which must not be overlooked is that the CPU or ASIC on
>> a hardware RAID controller is typically a RISC chip, and so comparing
>> clock speeds would not really give an accurate impression of its
>> performance versus a mainboard processor chip. For instance, a MIPS
>> or Alpha processor running at 800 MHz still outperforms most (single
>> core) 2+ GHz processors.
>
> As already mentioned, "hardware" raid is often done now with a general
> purpose processor rather than an ASIC - and MIPS is a particularly
> popular core for the job.

I'm not sure on the one on my SAS RAID adapter, but I think it's an
Intel RISC processor. It's not a MIPS or an Alpha, that much I am
certain of.

> But while you get a lot more work out of an 800 MHz for a given price,
> size or power than you do with an x86, you don't get more for a given
> clock rate. Parity calculations are really just a big stream
> of "xor"'s, and a modern x86 will chew through these as fast as memory
> bandwidth allows. Internally, x86 assembly is mostly converted to
> wide-word RISC-style instructions, so a decently written parity
> function will be as efficient per clock on an x86 as it is on MIPS.

True.

>>>>> I have never heard of a distinction between a "hot spare" that is
>>>>> spinning, and a "standby spare" that is not spinning.
>>>>
>>>> This is quite a common distinction, mind you. There is even a
>>>> "live spare" solution, but to my knowledge this is specific to
>>>> Adaptec - they call it RAID 5E.
>>>>
>>>> In a "live spare" scenario, the spare disk is not used as such but
>>>> is part of the live array, and both data and parity blocks are
>>>> being written to it, but with the distinction that each disk in the
>>>> array will also have empty blocks for the total capacity of a
>>>> standard spare disk. These empty blocks are thus distributed
>>>> across all disks in the array and are used for array reconstruction
>>>> in the event of a disk failure.
>>>
>>> Is there any real advantage of such a setup compared to using raid 6
>>> (in which case, the "empty" blocks are second parity blocks)? There
>>> would be a slightly greater write overhead (especially for small
>>> writes), but that would not be seen by the host if there is enough
>>> cache on the controller.
>>
>> Well, the advantage of this set-up is that you don't need to replace
>> a failing disk, since there is already sufficient diskspace left
>> blank on all disks in the array, and so the array can recreate itself
>> using that extra blank diskspace. This is of course all nice in
>> theory, but in practice one would eventually replace the disk anyway.
>
> The same is true of raid6 - if one disk dies, the degraded raid6 is
> very similar to raid5 until you replace the disk.
>
> And I still don't see any significant advantage of spreading the
> wholes around the drives rather than having them all on the one drive
> (i.e., a normal hot spare). The rebuild still has to do as many reads
> and writes, and takes as long. The rebuild writes will be spread over
> all the disks rather than just on the one disk, but I can't see any
> advantage in that.

Well, the idea is simply to give the spare disk some exercise, i.e. to
use it as part of the live array while still offering the extra
redundancy of a spare. So in the event of a failure, the array can be
fully rebuilt without the need to replace the broken drive, as opposed
to that the array would stay in degraded mode until the broken drive is
replaced.

> I suppose read performance, especially for many parallel small reads,
> will be slightly higher than for a normal hot spare, since you have
> more disks with active data and therefore higher chances of
> parallelising these accesses. But you get the same advantage with
> raid6.

Yes, but RAID 6 would be slower for small writes, and if one of the
drives fails, the array stays in degraded mode (since it considers
itself to be a RAID 6, not a RAID 5E).

>>> It looks like we agree on most things here - we just had a little
>>> difference on the areas we wrote about (specific information for the
>>> OP, or more general RAID discussions), and a few small differences
>>> in terminology.
>>
>> Well, you've made me reconsider my usage of RAID 5, though. I am now
>> contemplating on using two RAID 10 arrays instead of two RAID 5
>> arrays, since each of the arrays has four disks. They are both
>> different arrays, though. They're connected to the same RAID
>> controller but the first array is comprised of 147 GB 15k Hitachi SAS
>> disks and the second array is comprised of 1 TB 7.2k Western Digital
>> RAID Edition SATA-2 disks on a hotswap backplane.
>>
>> I had always considered RAID 5 to be the best trade-off, considering
>> the loss of diskspace involved versus the retail price of the hard
>> disks - especially the SAS disks - but considering that the SAS array
>> will be used to house the main systems in a virtualized set-up (on
>> Xen) and will probably endure the most small and random writes, RAID
>> 10 might actually be a better solution. The cost of the lost
>> diskspace on the SATA-2 disks is smaller since this type of disks is
>> far less expensive than SAS.
>
> I gather that raid 10 (hardware or software) is now often considered a
> better choice - raid 5 is often viewed as unreliable due to the risks
> of a second failure during rebuilds, which are increasingly
> time-consuming with larger disks. Where practical, I think
> mdadm "far" raid 10 is the optimal if you are happy with losing 50% of
> your disk space - it is faster than other redundant setups in many
> situations, and has a great deal of flexibility.

Well, 50% is the minimum storage capacity one loses when using any kind
of mirroring, be it RAID 1, RAID 10, RAID 0+1, RAID 50 or whatever.

> If you want more redundancy, you can use double mirrors for 33% disk
> space and still have full speed.

Yes, but that's a set-up which, due to understandable financial
considerations, would be reserved only for the corporate world. Many
people already consider me certifiably insane for having spent that
much money - 17'000 Euro, as I wrote higher up - on a privately owned
computer system. But then again, for the intended purposes, I need
fast and reliable hardware and a lot of horsepower. :-)

In the event of the OP on the other hand, 45 SAS disks of 300 GB each
and three SAS RAID storage enclosures also doesn't seem like quite an
affordable buy, so I take it he intends to use it for a business.

That, or he's a maniac like me. :p

> If you have the chance, it would be very nice to try out some
> different arrangements and see which is fastest in reality, not just
> in theory!

Ahh, but whole books have been written about such tests, and it still
always boils down to "What are you planning to do with it?" For
instance, a database server has different needs from a mailserver, and
this has different needs from a fileserver or workstation, etc. ;-)

> The other option is to go for a file system that handles multiple
> disks and redundancy directly - ZFS is the best known, with btrfs the
> experimental choice on Linux.

I don't think Btrfs is already considered stable enough. ZFS is of
course a great choice, but the GPL forbids linking ZFS into the Linux
kernel. If there is a "filesystem in userspace" implementation of it,
then it would of course be possible to legally use ZFS on a GNU/Linux
system.

I have been looking into NexentaOS (i.e. GNU/kOpenSolaris) for a while,
which uses ZFS, albeit that ZFS was not my reason for being interested
in the project. I was more interested in the fact that it supports
both Solaris Zones - of which the Linux equivalents are OpenVZ and
VServer - and running paravirtualized on top of Xen.

Doing that with OpenVZ requires the use of a 2.6.27 kernel which is
still considered unstable by the OpenVZ developers, and doing that with
Vserver is as good as impossible, since they're still using a 2.6.16
kernel, and you can't apply the (now obsolete) Xen patches to that
because those are for 2.6.18. And thus, running VServer in a Xen
virtual machine would require that you run it via hardware
virtualization rather than paravirtualized.

The big problem with NexentaOS however is that it's based on Ubuntu and
that it uses binary .deb packages, whereas I would rather have a Gentoo
approach, where you can build the whole thing from sources without
having to go "the LFS way".

Oh well, I've relayed the whole thing for the weekend, so I still have
plenty of time to think things over. ;-)

>> See, this is one of the advantages of Usenet. People get to share
>> not only knowledge but also differing views and strategies, and in
>> the end, everyone will have gleaned something useful. ;-)
>
> Absolutely - that's also why it's good to have a general discussion
> every now and again, rather than just answering a poster's questions.
> Good questions (such as in this thread) inspire an exchange of
> information for many people's benefits (I've learned things here too).

Maybe we should invite some politicians over to Usenet. Then *they*
might possibly learn something about the real world as well. :p

--
*Aragorn*
(registered GNU/Linux user #223157)
From: Aragorn on
On Wednesday 20 January 2010 16:25 in comp.os.linux.misc, somebody
identifying as Rahul wrote...

> David Brown <david(a)westcontrol.removethisbit.com> wrote in
> news:4b570006$0$6251$8404b019(a)news.wineasy.se:
>
> Thanks both Aragorn and David! This is one of the most comprehensive
> advice about RAID issues that I ever got. If you were ever in Madison,
> WI I owe you guys a beer! :)

Unfortunately, that offer will have to remain academic but it is
nevertheless appreciated. ;-)

>> Yes - perhaps the OP will give more details on his expected usage
>> patterns. There are many other factors we haven't discussed that can
>> affect the "best" setup, such as what requirements he has for future
>> expansion.
>
> Sure. More details: Its a mixed bad of I/O actually. This is a part of
> a High Performance Compute Cluster. So a wide variety of codes are in
> use. We have tracked I/O nature. Some of them have large sequential
> writes. Others are dominated by random seeks. Which is why I am not
> really fine tuning my setup for a particular access pattern but going
> for the best overall performance. RAID5 and RAID10 fit the bill it
> seems. RAID10 even more so. I do have the luxary of excess storage
> right now so I am convinced I ought to do a RAID10 like you guys
> suggested (at the HW level).
>
> For combining the 3 RAID10's I am still split between LVM and mdadm.
> The performance advantages convince me towards mdadm. But the ease of
> partition resizing etc. make LVM attractive.

Well, if you're only going to be putting "/home" on the array, then LVM
is a moot point. Just set each array up as a RAID 10, possibly with a
spare on each array and format each array with a single partition, and
then you can use /mdadm/ to combine them into a stripeset. ;-)

--
*Aragorn*
(registered GNU/Linux user #223157)
From: David Brown on
Aragorn wrote:
> On Wednesday 20 January 2010 15:48 in comp.os.linux.misc, somebody
> identifying as David Brown wrote...
>> Aragorn wrote:
>>> On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody
>>> identifying as David Brown wrote...

<snip to save a little space>

>>>> And assuming the data is important, the OP must also think about
>>>> backup solutions. But that's worth its own thread.
>>> Ahh, but that is the First Rule in the Bible of any sysadmin: "Thou
>>> shalt make backups, and lots of them too!" :p
>> The zeroth rule, which is often forgotten (until you learn the hard
>> way!), is "thou shalt make a plan for restoring from backups, test
>> that plan, document that plan, and find a way to ensure that all
>> backups are tested and restoreable in this way". /Then/ you can start
>> making your actual backups!
>
> Well, so far I've always used the tested and tried approach of tar'ing
> in conjunction with bzip2. Can't get any cleaner than that. ;-)
>

rsync copying is even cleaner - the backup copy is directly accessible.
And when combined with hard link copies in some way (such as
rsnapshot) you can get snapshots.

Of course, .tar.bz2 is good too - /if/ you have it automated so that it
is actually done (or you are one of these rare people that can regularly
follow a manual procedure). It also needs to be saved in a safe and
reliable place - many people have had regular backups saved to tape only
to find later that the tapes were unreadable. And of course it needs to
be saved again, in a different place and stored at a different site.

I know I'm preaching to the choir here, as you said before - but there
may be others in the congregation.

>> And the second rule is "thou shalt make backups of your backups",
>> followed by "thou shalt have backups of critical hardware". (That's
>> another bonus of software raid - if your hardware raid card dies, you
>> may have to replace it with exactly the same type of card to get your
>> raid working again - with mdadm raid, you can use any PC.)
>
> Well, considering that my Big Machine has drained my piggy bank for
> about 17'000 Euros worth of hardware, having a duplicate machine is not
> really an option. The piggy bank's on a diet now. :-)
>

You don't need a duplicate machine - you just need duplicates of any
parts that are important, specific, and may not always been easily
available. There is no need to buy a new machine, but as soon as your
particular choice of hardware raid cards start going out of fashion, buy
a spare. Better still, buy a spare /now/ before the manufacturer
decides to update the firmware in new versions of the card and they
become incompatible with your raid drives. Of course, you can always
restore from backup in an emergency if the worst happens.

> I do on the other hand still have a slightly older dual Xeon machine
> with 4 GB of RAM and an U320 SCSI RAID 1 (with two 73 GB disks), which
> I will be setting up as an emergency replacement server, and to store
> additional backups on - I store my other backups on Iomega REV disks.
>
>>> Another thing which must not be overlooked is that the CPU or ASIC on
>>> a hardware RAID controller is typically a RISC chip, and so comparing
>>> clock speeds would not really give an accurate impression of its
>>> performance versus a mainboard processor chip. For instance, a MIPS
>>> or Alpha processor running at 800 MHz still outperforms most (single
>>> core) 2+ GHz processors.
>> As already mentioned, "hardware" raid is often done now with a general
>> purpose processor rather than an ASIC - and MIPS is a particularly
>> popular core for the job.
>
> I'm not sure on the one on my SAS RAID adapter, but I think it's an
> Intel RISC processor. It's not a MIPS or an Alpha, that much I am
> certain of.
>

Intel haven't made RISC processors for many years (discounting the
Itanium, which is an unlikely choice for a raid processor). They used
to have StrongArms, and long, long ago they had a few other designs, but
I'm pretty certain you don't have an Intel RISC processor on the card.
It also will not be an Alpha - they have not been made for years either
(they were very nice chips until DEC, then HP+Compaq totally screwed
them up, with plenty of encouragement from Intel). Realistic cores
include MIPS in many flavours, PPC, and for more recent designs, perhaps
an ARM of some kind. If the heavy lifting is being done by ASIC logic
rather than the processor core, there is a wider choice of possible cores.

>> But while you get a lot more work out of an 800 MHz for a given price,
>> size or power than you do with an x86, you don't get more for a given
>> clock rate. Parity calculations are really just a big stream
>> of "xor"'s, and a modern x86 will chew through these as fast as memory
>> bandwidth allows. Internally, x86 assembly is mostly converted to
>> wide-word RISC-style instructions, so a decently written parity
>> function will be as efficient per clock on an x86 as it is on MIPS.
>
> True.
>
>>>>>> I have never heard of a distinction between a "hot spare" that is
>>>>>> spinning, and a "standby spare" that is not spinning.
>>>>> This is quite a common distinction, mind you. There is even a
>>>>> "live spare" solution, but to my knowledge this is specific to
>>>>> Adaptec - they call it RAID 5E.
>>>>>
>>>>> In a "live spare" scenario, the spare disk is not used as such but
>>>>> is part of the live array, and both data and parity blocks are
>>>>> being written to it, but with the distinction that each disk in the
>>>>> array will also have empty blocks for the total capacity of a
>>>>> standard spare disk. These empty blocks are thus distributed
>>>>> across all disks in the array and are used for array reconstruction
>>>>> in the event of a disk failure.
>>>> Is there any real advantage of such a setup compared to using raid 6
>>>> (in which case, the "empty" blocks are second parity blocks)? There
>>>> would be a slightly greater write overhead (especially for small
>>>> writes), but that would not be seen by the host if there is enough
>>>> cache on the controller.
>>> Well, the advantage of this set-up is that you don't need to replace
>>> a failing disk, since there is already sufficient diskspace left
>>> blank on all disks in the array, and so the array can recreate itself
>>> using that extra blank diskspace. This is of course all nice in
>>> theory, but in practice one would eventually replace the disk anyway.
>> The same is true of raid6 - if one disk dies, the degraded raid6 is
>> very similar to raid5 until you replace the disk.
>>
>> And I still don't see any significant advantage of spreading the
>> wholes around the drives rather than having them all on the one drive
>> (i.e., a normal hot spare). The rebuild still has to do as many reads
>> and writes, and takes as long. The rebuild writes will be spread over
>> all the disks rather than just on the one disk, but I can't see any
>> advantage in that.
>
> Well, the idea is simply to give the spare disk some exercise, i.e. to
> use it as part of the live array while still offering the extra
> redundancy of a spare. So in the event of a failure, the array can be
> fully rebuilt without the need to replace the broken drive, as opposed
> to that the array would stay in degraded mode until the broken drive is
> replaced.
>

The array will be in degraded mode while the rebuild is being done, just
like if it were raid5 with a hot spare - and it will be equally slow
during the rebuild. So no points there.

In fact, according to wikipedia, the controller will "compact" the
degraded raid set into a normal raid5, and when you replace the broken
drive it will "uncompact" it into the raid 5E arrangement again. The
"compact" and "uncompact" operations take much longer than a standard
raid5 rebuild.

So all you get here is a marginal increase in the parallelisation of
multiple simultaneous small reads, which you could get anyway with raid6
rather than raid5 with a spare.

>> I suppose read performance, especially for many parallel small reads,
>> will be slightly higher than for a normal hot spare, since you have
>> more disks with active data and therefore higher chances of
>> parallelising these accesses. But you get the same advantage with
>> raid6.
>
> Yes, but RAID 6 would be slower for small writes, and if one of the
> drives fails, the array stays in degraded mode (since it considers
> itself to be a RAID 6, not a RAID 5E).
>

Degraded raid5 and raid6 have varying speeds, depending on whether the
data you access is available directly or must be calculated from the
rest of the stripe and the parity. The same applies to a degraded raid
5E with a broken drive.

You are right that small writes to raid 6 would be slower than to a raid 5E.

>>>> It looks like we agree on most things here - we just had a little
>>>> difference on the areas we wrote about (specific information for the
>>>> OP, or more general RAID discussions), and a few small differences
>>>> in terminology.
>>> Well, you've made me reconsider my usage of RAID 5, though. I am now
>>> contemplating on using two RAID 10 arrays instead of two RAID 5
>>> arrays, since each of the arrays has four disks. They are both
>>> different arrays, though. They're connected to the same RAID
>>> controller but the first array is comprised of 147 GB 15k Hitachi SAS
>>> disks and the second array is comprised of 1 TB 7.2k Western Digital
>>> RAID Edition SATA-2 disks on a hotswap backplane.
>>>
>>> I had always considered RAID 5 to be the best trade-off, considering
>>> the loss of diskspace involved versus the retail price of the hard
>>> disks - especially the SAS disks - but considering that the SAS array
>>> will be used to house the main systems in a virtualized set-up (on
>>> Xen) and will probably endure the most small and random writes, RAID
>>> 10 might actually be a better solution. The cost of the lost
>>> diskspace on the SATA-2 disks is smaller since this type of disks is
>>> far less expensive than SAS.
>> I gather that raid 10 (hardware or software) is now often considered a
>> better choice - raid 5 is often viewed as unreliable due to the risks
>> of a second failure during rebuilds, which are increasingly
>> time-consuming with larger disks. Where practical, I think
>> mdadm "far" raid 10 is the optimal if you are happy with losing 50% of
>> your disk space - it is faster than other redundant setups in many
>> situations, and has a great deal of flexibility.
>
> Well, 50% is the minimum storage capacity one loses when using any kind
> of mirroring, be it RAID 1, RAID 10, RAID 0+1, RAID 50 or whatever.
>
>> If you want more redundancy, you can use double mirrors for 33% disk
>> space and still have full speed.
>
> Yes, but that's a set-up which, due to understandable financial
> considerations, would be reserved only for the corporate world. Many
> people already consider me certifiably insane for having spent that
> much money - 17'000 Euro, as I wrote higher up - on a privately owned
> computer system. But then again, for the intended purposes, I need
> fast and reliable hardware and a lot of horsepower. :-)
>

I'm curious - what is the intended purpose? I think I would have a hard
job spending more than about three or four thousand Euros on a single
system.

> In the event of the OP on the other hand, 45 SAS disks of 300 GB each
> and three SAS RAID storage enclosures also doesn't seem like quite an
> affordable buy, so I take it he intends to use it for a business.
>

It also does not strike me as a high value-for-money system - I can't
help feeling that this is way more bandwidth than you could actually
make use of in the rest of the system, so it would be better to have
fewer larger drives and less layers to reduce the latencies. Spent the
cash saved on even more ram :-)

45 disks at a throughput of say 75 MBps each gives about 3.3 GBps - say
3 GBps since some are hot spares. Ultimately, being a server, this is
going to be pumped out on Ethernet links. That's a lot of bandwidth -
it would effectively saturate four 10 Gbit links.

I have absolutely no real-world experience with these sorts of systems,
and could therefore be totally wrong, but my gut feeling is that the
theoretical numbers will not scale with so many drives - something like
15 1 TB SATA drives would be similar in speed in practice.

> That, or he's a maniac like me. :p
>
>> If you have the chance, it would be very nice to try out some
>> different arrangements and see which is fastest in reality, not just
>> in theory!
>
> Ahh, but whole books have been written about such tests, and it still
> always boils down to "What are you planning to do with it?" For
> instance, a database server has different needs from a mailserver, and
> this has different needs from a fileserver or workstation, etc. ;-)
>

It would still be fun!

>> The other option is to go for a file system that handles multiple
>> disks and redundancy directly - ZFS is the best known, with btrfs the
>> experimental choice on Linux.
>
> I don't think Btrfs is already considered stable enough. ZFS is of
> course a great choice, but the GPL forbids linking ZFS into the Linux
> kernel. If there is a "filesystem in userspace" implementation of it,
> then it would of course be possible to legally use ZFS on a GNU/Linux
> system.
>

There /is/ a "filesystem in userspace" implementation of ZFS (using
fuse). But it is not feature complete, and not particularly fast.

btrfs is still a risk, and is still missing some features (such as
elegant handling of low free space...), but the potential is there.

> I have been looking into NexentaOS (i.e. GNU/kOpenSolaris) for a while,
> which uses ZFS, albeit that ZFS was not my reason for being interested
> in the project. I was more interested in the fact that it supports
> both Solaris Zones - of which the Linux equivalents are OpenVZ and
> VServer - and running paravirtualized on top of Xen.
>
> Doing that with OpenVZ requires the use of a 2.6.27 kernel which is
> still considered unstable by the OpenVZ developers, and doing that with
> Vserver is as good as impossible, since they're still using a 2.6.16
> kernel, and you can't apply the (now obsolete) Xen patches to that
> because those are for 2.6.18. And thus, running VServer in a Xen
> virtual machine would require that you run it via hardware
> virtualization rather than paravirtualized.
>
> The big problem with NexentaOS however is that it's based on Ubuntu and
> that it uses binary .deb packages, whereas I would rather have a Gentoo
> approach, where you can build the whole thing from sources without
> having to go "the LFS way".
>

Why is it always so hard to get /everything/ you want when building a
system :-(

> Oh well, I've relayed the whole thing for the weekend, so I still have
> plenty of time to think things over. ;-)
>
>>> See, this is one of the advantages of Usenet. People get to share
>>> not only knowledge but also differing views and strategies, and in
>>> the end, everyone will have gleaned something useful. ;-)
>> Absolutely - that's also why it's good to have a general discussion
>> every now and again, rather than just answering a poster's questions.
>> Good questions (such as in this thread) inspire an exchange of
>> information for many people's benefits (I've learned things here too).
>
> Maybe we should invite some politicians over to Usenet. Then *they*
> might possibly learn something about the real world as well. :p
>