From: David Brown on 20 Jan 2010 09:48 Aragorn wrote: > On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody > identifying as David Brown wrote... > >> Aragorn wrote: >> >>> [David Brown wrote:] >>> >>>> But on the other hand, software raid5 can take advantage of large >>>> system memories for cache, and is thus far more likely to have the >>>> required stripe data already in its cache (especially for metadata >>>> and directory areas of the file system, which are commonly accessed >>>> but have small writes). >>> Also correct, but dependent on available system RAM, of course. I'm >>> not sure how much RAM the RAID controllers in my two servers have - I >>> think one has 256 MB and the other one 128 MB - but on a system with, >>> say, 4 GB of system RAM, caching capacity is of course much higher. >> You will probably find it is cheaper to add another 4 GB system ram to >> the host than another 128 MB to the hardware raid controller. > > Well, that all depends... On a system that uses ECC registered RAM - > such as a genuine server - the cost of adding more RAM may be quite > daunting. > > On the other hand, I'm not so sure whether a hardware RAID adapter can > be retrofitted with more memory than it already has out of the box. > "Daunting" is less than "impossible" :-) Of course, it depends on your setup and your needs - there are no fixed answers (I think that's been mentioned before...) >>> On the other hand - and purely in theory, as RAID 5/6 is not the >>> right solution for the OP - a battery-backed hardware RAID controller >>> on a machine hooked up to a UPS should be able to avoid the RAID 5/6 >>> writing hole, since the controller itself has its own processor. I >>> even believe that my Adaptec SAS RAID controller runs a Linux kernel >>> of its own. >> That's often the case these days - "hardware" raid controllers are >> frequently host processors running Linux (or sometimes other systems) >> and software raid. This is especially true of SAN boxes. > > Well, the line between hardware RAID and software RAID is rather blurry > in the event of a modern hardware RAID controller. Sure, it's all > firmware, but there is a software component involved as well, > presumably because of certain efficiencies in scheduling with set-ups > employing multiple disks, as with the nested RAID solutions. > >>>> Make sure your raid controller has batteries, and that the whole >>>> system is on an UPS! >>> If data integrity is important, then I consider a UPS a necessity, >>> even without RAID. ;-) >> And assuming the data is important, the OP must also think about >> backup solutions. But that's worth its own thread. > > Ahh, but that is the First Rule in the Bible of any sysadmin: "Thou > shalt make backups, and lots of them too!" :p > The zeroth rule, which is often forgotten (until you learn the hard way!), is "thou shalt make a plan for restoring from backups, test that plan, document that plan, and find a way to ensure that all backups are tested and restoreable in this way". /Then/ you can start making your actual backups! And the second rule is "thou shalt make backups of your backups", followed by "thou shalt have backups of critical hardware". (That's another bonus of software raid - if your hardware raid card dies, you may have to replace it with exactly the same type of card to get your raid working again - with mdadm raid, you can use any PC.) >>>> [...] a host CPU is perfectly capable of splitting a stripe >>>> into its blocks in a fraction of a microsecond. It is also much >>>> faster at doing the parity calculations - the host CPU typically >>>> runs at least ten times as fast as the CPU or ASIC on the raid card. >>> Yes, but the host CPU also has other things to take care off, while >>> the CPU or ASIC on a RAID controller is dedicated to just that one >>> task. >> That's true, but not really relevant for modern CPUs - when you've got >> 4 or 8 cores running at ten times the speed of the raid controller's >> chip, you are not talking about a significant load. > > Well, 8 cores might be a bit of a stretch, and not everyone has quadcore > CPUs yet, either. My Big Machine for instance has two dualcore > Opterons in it, so that makes for four cores in total. (The machine > also has a SAS/SATA RAID controller, so the RAID discussion is moot > here, but I'm just mentioning it.) > > Another thing which must not be overlooked is that the CPU or ASIC on a > hardware RAID controller is typically a RISC chip, and so comparing > clock speeds would not really give an accurate impression of its > performance versus a mainboard processor chip. For instance, a MIPS or > Alpha processor running at 800 MHz still outperforms most (single core) > 2+ GHz processors. > As already mentioned, "hardware" raid is often done now with a general purpose processor rather than an ASIC - and MIPS is a particularly popular core for the job. But while you get a lot more work out of an 800 MHz for a given price, size or power than you do with an x86, you don't get more for a given clock rate. Parity calculations are really just a big stream of "xor"'s, and a modern x86 will chew through these as fast as memory bandwidth allows. Internally, x86 assembly is mostly converted to wide-word RISC-style instructions, so a decently written parity function will be as efficient per clock on an x86 as it is on MIPS. There are plenty of situations where a slower clock but cleaner architecture gives more true speed, especially if latency is more important than throughput, but this isn't one of them. >>>> I have never heard of a distinction between a "hot spare" that is >>>> spinning, and a "standby spare" that is not spinning. >>> This is quite a common distinction, mind you. There is even a "live >>> spare" solution, but to my knowledge this is specific to Adaptec - >>> they call it RAID 5E. >>> >>> In a "live spare" scenario, the spare disk is not used as such but is >>> part of the live array, and both data and parity blocks are being >>> written to it, but with the distinction that each disk in the array >>> will also have empty blocks for the total capacity of a standard >>> spare disk. These empty blocks are thus distributed across all disks >>> in the array and are used for array reconstruction in the event of a >>> disk failure. >> Is there any real advantage of such a setup compared to using raid 6 >> (in which case, the "empty" blocks are second parity blocks)? There >> would be a slightly greater write overhead (especially for small >> writes), but that would not be seen by the host if there is enough >> cache on the controller. > > Well, the advantage of this set-up is that you don't need to replace a > failing disk, since there is already sufficient diskspace left blank on > all disks in the array, and so the array can recreate itself using that > extra blank diskspace. This is of course all nice in theory, but in > practice one would eventually replace the disk anyway. > The same is true of raid6 - if one disk dies, the degraded raid6 is very similar to raid5 until you replace the disk. And I still don't see any significant advantage of spreading the wholes around the drives rather than having them all on the one drive (i.e., a normal hot spare). The rebuild still has to do as many reads and writes, and takes as long. The rebuild writes will be spread over all the disks rather than just on the one disk, but I can't see any advantage in that. I suppose read performance, especially for many parallel small reads, will be slightly higher than for a normal hot spare, since you have more disks with active data and therefore higher chances of parallelising these accesses. But you get the same advantage with raid6. > In terms of performance, it would be similar to RAID 6 for reads - > because the empty blocks have to be skipped in sequential reads - but > for writing it would be slightly better than RAID 6 since only one set > of parity data per stripe needs to be (re)calculated and (re)written. > > It does of course remain a single-point-of-failure set-up, whereas RAID > 6 offers a two-points-of-failure set-up. > >>>> An "offline spare" is an extra drive that is physically attached, >>>> but not in use automatically - in the event of a failure, it can be >>>> manually assigned to a raid set. This makes sense if you have >>>> several hardware raid sets defined and want to share a single spare, >>>> if the hardware raid cannot support this (mdadm, of course, supports >>>> such a setup with a shared hot spare). >>> Most modern hardware RAID controllers support this. >> OK. >> >> It looks like we agree on most things here - we just had a little >> difference on the areas we wrote about (specific information for the >> OP, or more general RAID discussions), and a few small differences in >> terminology. > > Well, you've made me reconsider my usage of RAID 5, though. I am now > contemplating on using two RAID 10 arrays instead of two RAID 5 arrays, > since each of the arrays has four disks. They are both different > arrays, though. They're connected to the same RAID controller but the > first array is comprised of 147 GB 15k Hitachi SAS disks and the second > array is comprised of 1 TB 7.2k Western Digital RAID Edition SATA-2 > disks on a hotswap backplane. > > I had always considered RAID 5 to be the best trade-off, considering the > loss of diskspace involved versus the retail price of the hard disks - > especially the SAS disks - but considering that the SAS array will be > used to house the main systems in a virtualized set-up (on Xen) and > will probably endure the most small and random writes, RAID 10 might > actually be a better solution. The cost of the lost diskspace on the > SATA-2 disks is smaller since this type of disks is far less expensive > than SAS. > I gather that raid 10 (hardware or software) is now often considered a better choice - raid 5 is often viewed as unreliable due to the risks of a second failure during rebuilds, which are increasingly time-consuming with larger disks. Where practical, I think mdadm "far" raid 10 is the optimal if you are happy with losing 50% of your disk space - it is faster than other redundant setups in many situations, and has a great deal of flexibility. If you want more redundancy, you can use double mirrors for 33% disk space and still have full speed. If you have the chance, it would be very nice to try out some different arrangements and see which is fastest in reality, not just in theory! The other option is to go for a file system that handles multiple disks and redundancy directly - ZFS is the best known, with btrfs the experimental choice on Linux. > See, this is one of the advantages of Usenet. People get to share not > only knowledge but also differing views and strategies, and in the end, > everyone will have gleaned something useful. ;-) > Absolutely - that's also why it's good to have a general discussion every now and again, rather than just answering a poster's questions. Good questions (such as in this thread) inspire an exchange of information for many people's benefits (I've learned things here too).
From: Rahul on 20 Jan 2010 10:25 David Brown <david(a)westcontrol.removethisbit.com> wrote in news:4b570006$0$6251$8404b019(a)news.wineasy.se: Thanks both Aragorn and David! This is one of the most comprehensive advice about RAID issues that I ever got. If you were ever in Madison, WI I owe you guys a beer! :) > Aragorn wrote: >> > > Yes - perhaps the OP will give more details on his expected usage > patterns. There are many other factors we haven't discussed that can > affect the "best" setup, such as what requirements he has for future > expansion. Sure. More details: Its a mixed bad of I/O actually. This is a part of a High Performance Compute Cluster. So a wide variety of codes are in use. We have tracked I/O nature. Some of them have large sequential writes. Others are dominated by random seeks. Which is why I am not really fine tuning my setup for a particular access pattern but going for the best overall performance. RAID5 and RAID10 fit the bill it seems. RAID10 even more so. I do have the luxary of excess storage right now so I am convinced I ought to do a RAID10 like you guys suggested (at the HW level). For combining the 3 RAID10's I am still split between LVM and mdadm. The performance advantages convince me towards mdadm. But the ease of partition resizing etc. make LVM attractive. > > And assuming the data is important, the OP must also think about > backup solutions. But that's worth its own thread. Actually I am lucky. The data will *not* be backed up on tape. You might think this strange but this is meant to be a store for jobs that are staging or running. So people are expected to remove data to more secure storage in ~10 days. Worst case we take a 10 day hit which is Ok for our scientific computing needs. > > I was discussing raid a little more generally, since the OP was asking > about mdadm and LVM, while I think you were talking more about > hardware raid since he has hardware raid devices already (mdadm raid > might be better value for money than hardware raid for most setups - > but not if you already have the hardware!). Just a difference of > emphasis, really. Absolutely. I appreciate the improvement in my overall RAID understanding. -- Rahul
From: Aragorn on 20 Jan 2010 11:23 On Wednesday 20 January 2010 15:48 in comp.os.linux.misc, somebody identifying as David Brown wrote... > Aragorn wrote: > >> On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody >> identifying as David Brown wrote... >> >>> You will probably find it is cheaper to add another 4 GB system ram >>> to the host than another 128 MB to the hardware raid controller. >> >> Well, that all depends... On a system that uses ECC registered RAM - >> such as a genuine server - the cost of adding more RAM may be quite >> daunting. >> >> On the other hand, I'm not so sure whether a hardware RAID adapter >> can be retrofitted with more memory than it already has out of the >> box. > > "Daunting" is less than "impossible" :-) Of course, it depends on > your setup and your needs - there are no fixed answers (I think that's > been mentioned before...) Yeah... Most adapters I know of come with either 128 MB or 256 MB. I'd have to check the specs for my Adaptec SAS RAID adapter again, but my U320 RAID adapter - also from Adaptec - has only 128 MB. The sad news is that the battery packs are often optional, so you need to pay attention when ordering or buying such an adapter card. >>> And assuming the data is important, the OP must also think about >>> backup solutions. But that's worth its own thread. >> >> Ahh, but that is the First Rule in the Bible of any sysadmin: "Thou >> shalt make backups, and lots of them too!" :p > > The zeroth rule, which is often forgotten (until you learn the hard > way!), is "thou shalt make a plan for restoring from backups, test > that plan, document that plan, and find a way to ensure that all > backups are tested and restoreable in this way". /Then/ you can start > making your actual backups! Well, so far I've always used the tested and tried approach of tar'ing in conjunction with bzip2. Can't get any cleaner than that. ;-) > And the second rule is "thou shalt make backups of your backups", > followed by "thou shalt have backups of critical hardware". (That's > another bonus of software raid - if your hardware raid card dies, you > may have to replace it with exactly the same type of card to get your > raid working again - with mdadm raid, you can use any PC.) Well, considering that my Big Machine has drained my piggy bank for about 17'000 Euros worth of hardware, having a duplicate machine is not really an option. The piggy bank's on a diet now. :-) I do on the other hand still have a slightly older dual Xeon machine with 4 GB of RAM and an U320 SCSI RAID 1 (with two 73 GB disks), which I will be setting up as an emergency replacement server, and to store additional backups on - I store my other backups on Iomega REV disks. >> Another thing which must not be overlooked is that the CPU or ASIC on >> a hardware RAID controller is typically a RISC chip, and so comparing >> clock speeds would not really give an accurate impression of its >> performance versus a mainboard processor chip. For instance, a MIPS >> or Alpha processor running at 800 MHz still outperforms most (single >> core) 2+ GHz processors. > > As already mentioned, "hardware" raid is often done now with a general > purpose processor rather than an ASIC - and MIPS is a particularly > popular core for the job. I'm not sure on the one on my SAS RAID adapter, but I think it's an Intel RISC processor. It's not a MIPS or an Alpha, that much I am certain of. > But while you get a lot more work out of an 800 MHz for a given price, > size or power than you do with an x86, you don't get more for a given > clock rate. Parity calculations are really just a big stream > of "xor"'s, and a modern x86 will chew through these as fast as memory > bandwidth allows. Internally, x86 assembly is mostly converted to > wide-word RISC-style instructions, so a decently written parity > function will be as efficient per clock on an x86 as it is on MIPS. True. >>>>> I have never heard of a distinction between a "hot spare" that is >>>>> spinning, and a "standby spare" that is not spinning. >>>> >>>> This is quite a common distinction, mind you. There is even a >>>> "live spare" solution, but to my knowledge this is specific to >>>> Adaptec - they call it RAID 5E. >>>> >>>> In a "live spare" scenario, the spare disk is not used as such but >>>> is part of the live array, and both data and parity blocks are >>>> being written to it, but with the distinction that each disk in the >>>> array will also have empty blocks for the total capacity of a >>>> standard spare disk. These empty blocks are thus distributed >>>> across all disks in the array and are used for array reconstruction >>>> in the event of a disk failure. >>> >>> Is there any real advantage of such a setup compared to using raid 6 >>> (in which case, the "empty" blocks are second parity blocks)? There >>> would be a slightly greater write overhead (especially for small >>> writes), but that would not be seen by the host if there is enough >>> cache on the controller. >> >> Well, the advantage of this set-up is that you don't need to replace >> a failing disk, since there is already sufficient diskspace left >> blank on all disks in the array, and so the array can recreate itself >> using that extra blank diskspace. This is of course all nice in >> theory, but in practice one would eventually replace the disk anyway. > > The same is true of raid6 - if one disk dies, the degraded raid6 is > very similar to raid5 until you replace the disk. > > And I still don't see any significant advantage of spreading the > wholes around the drives rather than having them all on the one drive > (i.e., a normal hot spare). The rebuild still has to do as many reads > and writes, and takes as long. The rebuild writes will be spread over > all the disks rather than just on the one disk, but I can't see any > advantage in that. Well, the idea is simply to give the spare disk some exercise, i.e. to use it as part of the live array while still offering the extra redundancy of a spare. So in the event of a failure, the array can be fully rebuilt without the need to replace the broken drive, as opposed to that the array would stay in degraded mode until the broken drive is replaced. > I suppose read performance, especially for many parallel small reads, > will be slightly higher than for a normal hot spare, since you have > more disks with active data and therefore higher chances of > parallelising these accesses. But you get the same advantage with > raid6. Yes, but RAID 6 would be slower for small writes, and if one of the drives fails, the array stays in degraded mode (since it considers itself to be a RAID 6, not a RAID 5E). >>> It looks like we agree on most things here - we just had a little >>> difference on the areas we wrote about (specific information for the >>> OP, or more general RAID discussions), and a few small differences >>> in terminology. >> >> Well, you've made me reconsider my usage of RAID 5, though. I am now >> contemplating on using two RAID 10 arrays instead of two RAID 5 >> arrays, since each of the arrays has four disks. They are both >> different arrays, though. They're connected to the same RAID >> controller but the first array is comprised of 147 GB 15k Hitachi SAS >> disks and the second array is comprised of 1 TB 7.2k Western Digital >> RAID Edition SATA-2 disks on a hotswap backplane. >> >> I had always considered RAID 5 to be the best trade-off, considering >> the loss of diskspace involved versus the retail price of the hard >> disks - especially the SAS disks - but considering that the SAS array >> will be used to house the main systems in a virtualized set-up (on >> Xen) and will probably endure the most small and random writes, RAID >> 10 might actually be a better solution. The cost of the lost >> diskspace on the SATA-2 disks is smaller since this type of disks is >> far less expensive than SAS. > > I gather that raid 10 (hardware or software) is now often considered a > better choice - raid 5 is often viewed as unreliable due to the risks > of a second failure during rebuilds, which are increasingly > time-consuming with larger disks. Where practical, I think > mdadm "far" raid 10 is the optimal if you are happy with losing 50% of > your disk space - it is faster than other redundant setups in many > situations, and has a great deal of flexibility. Well, 50% is the minimum storage capacity one loses when using any kind of mirroring, be it RAID 1, RAID 10, RAID 0+1, RAID 50 or whatever. > If you want more redundancy, you can use double mirrors for 33% disk > space and still have full speed. Yes, but that's a set-up which, due to understandable financial considerations, would be reserved only for the corporate world. Many people already consider me certifiably insane for having spent that much money - 17'000 Euro, as I wrote higher up - on a privately owned computer system. But then again, for the intended purposes, I need fast and reliable hardware and a lot of horsepower. :-) In the event of the OP on the other hand, 45 SAS disks of 300 GB each and three SAS RAID storage enclosures also doesn't seem like quite an affordable buy, so I take it he intends to use it for a business. That, or he's a maniac like me. :p > If you have the chance, it would be very nice to try out some > different arrangements and see which is fastest in reality, not just > in theory! Ahh, but whole books have been written about such tests, and it still always boils down to "What are you planning to do with it?" For instance, a database server has different needs from a mailserver, and this has different needs from a fileserver or workstation, etc. ;-) > The other option is to go for a file system that handles multiple > disks and redundancy directly - ZFS is the best known, with btrfs the > experimental choice on Linux. I don't think Btrfs is already considered stable enough. ZFS is of course a great choice, but the GPL forbids linking ZFS into the Linux kernel. If there is a "filesystem in userspace" implementation of it, then it would of course be possible to legally use ZFS on a GNU/Linux system. I have been looking into NexentaOS (i.e. GNU/kOpenSolaris) for a while, which uses ZFS, albeit that ZFS was not my reason for being interested in the project. I was more interested in the fact that it supports both Solaris Zones - of which the Linux equivalents are OpenVZ and VServer - and running paravirtualized on top of Xen. Doing that with OpenVZ requires the use of a 2.6.27 kernel which is still considered unstable by the OpenVZ developers, and doing that with Vserver is as good as impossible, since they're still using a 2.6.16 kernel, and you can't apply the (now obsolete) Xen patches to that because those are for 2.6.18. And thus, running VServer in a Xen virtual machine would require that you run it via hardware virtualization rather than paravirtualized. The big problem with NexentaOS however is that it's based on Ubuntu and that it uses binary .deb packages, whereas I would rather have a Gentoo approach, where you can build the whole thing from sources without having to go "the LFS way". Oh well, I've relayed the whole thing for the weekend, so I still have plenty of time to think things over. ;-) >> See, this is one of the advantages of Usenet. People get to share >> not only knowledge but also differing views and strategies, and in >> the end, everyone will have gleaned something useful. ;-) > > Absolutely - that's also why it's good to have a general discussion > every now and again, rather than just answering a poster's questions. > Good questions (such as in this thread) inspire an exchange of > information for many people's benefits (I've learned things here too). Maybe we should invite some politicians over to Usenet. Then *they* might possibly learn something about the real world as well. :p -- *Aragorn* (registered GNU/Linux user #223157)
From: Aragorn on 20 Jan 2010 13:23 On Wednesday 20 January 2010 16:25 in comp.os.linux.misc, somebody identifying as Rahul wrote... > David Brown <david(a)westcontrol.removethisbit.com> wrote in > news:4b570006$0$6251$8404b019(a)news.wineasy.se: > > Thanks both Aragorn and David! This is one of the most comprehensive > advice about RAID issues that I ever got. If you were ever in Madison, > WI I owe you guys a beer! :) Unfortunately, that offer will have to remain academic but it is nevertheless appreciated. ;-) >> Yes - perhaps the OP will give more details on his expected usage >> patterns. There are many other factors we haven't discussed that can >> affect the "best" setup, such as what requirements he has for future >> expansion. > > Sure. More details: Its a mixed bad of I/O actually. This is a part of > a High Performance Compute Cluster. So a wide variety of codes are in > use. We have tracked I/O nature. Some of them have large sequential > writes. Others are dominated by random seeks. Which is why I am not > really fine tuning my setup for a particular access pattern but going > for the best overall performance. RAID5 and RAID10 fit the bill it > seems. RAID10 even more so. I do have the luxary of excess storage > right now so I am convinced I ought to do a RAID10 like you guys > suggested (at the HW level). > > For combining the 3 RAID10's I am still split between LVM and mdadm. > The performance advantages convince me towards mdadm. But the ease of > partition resizing etc. make LVM attractive. Well, if you're only going to be putting "/home" on the array, then LVM is a moot point. Just set each array up as a RAID 10, possibly with a spare on each array and format each array with a single partition, and then you can use /mdadm/ to combine them into a stripeset. ;-) -- *Aragorn* (registered GNU/Linux user #223157)
From: David Brown on 20 Jan 2010 17:59
Aragorn wrote: > On Wednesday 20 January 2010 15:48 in comp.os.linux.misc, somebody > identifying as David Brown wrote... >> Aragorn wrote: >>> On Wednesday 20 January 2010 14:07 in comp.os.linux.misc, somebody >>> identifying as David Brown wrote... <snip to save a little space> >>>> And assuming the data is important, the OP must also think about >>>> backup solutions. But that's worth its own thread. >>> Ahh, but that is the First Rule in the Bible of any sysadmin: "Thou >>> shalt make backups, and lots of them too!" :p >> The zeroth rule, which is often forgotten (until you learn the hard >> way!), is "thou shalt make a plan for restoring from backups, test >> that plan, document that plan, and find a way to ensure that all >> backups are tested and restoreable in this way". /Then/ you can start >> making your actual backups! > > Well, so far I've always used the tested and tried approach of tar'ing > in conjunction with bzip2. Can't get any cleaner than that. ;-) > rsync copying is even cleaner - the backup copy is directly accessible. And when combined with hard link copies in some way (such as rsnapshot) you can get snapshots. Of course, .tar.bz2 is good too - /if/ you have it automated so that it is actually done (or you are one of these rare people that can regularly follow a manual procedure). It also needs to be saved in a safe and reliable place - many people have had regular backups saved to tape only to find later that the tapes were unreadable. And of course it needs to be saved again, in a different place and stored at a different site. I know I'm preaching to the choir here, as you said before - but there may be others in the congregation. >> And the second rule is "thou shalt make backups of your backups", >> followed by "thou shalt have backups of critical hardware". (That's >> another bonus of software raid - if your hardware raid card dies, you >> may have to replace it with exactly the same type of card to get your >> raid working again - with mdadm raid, you can use any PC.) > > Well, considering that my Big Machine has drained my piggy bank for > about 17'000 Euros worth of hardware, having a duplicate machine is not > really an option. The piggy bank's on a diet now. :-) > You don't need a duplicate machine - you just need duplicates of any parts that are important, specific, and may not always been easily available. There is no need to buy a new machine, but as soon as your particular choice of hardware raid cards start going out of fashion, buy a spare. Better still, buy a spare /now/ before the manufacturer decides to update the firmware in new versions of the card and they become incompatible with your raid drives. Of course, you can always restore from backup in an emergency if the worst happens. > I do on the other hand still have a slightly older dual Xeon machine > with 4 GB of RAM and an U320 SCSI RAID 1 (with two 73 GB disks), which > I will be setting up as an emergency replacement server, and to store > additional backups on - I store my other backups on Iomega REV disks. > >>> Another thing which must not be overlooked is that the CPU or ASIC on >>> a hardware RAID controller is typically a RISC chip, and so comparing >>> clock speeds would not really give an accurate impression of its >>> performance versus a mainboard processor chip. For instance, a MIPS >>> or Alpha processor running at 800 MHz still outperforms most (single >>> core) 2+ GHz processors. >> As already mentioned, "hardware" raid is often done now with a general >> purpose processor rather than an ASIC - and MIPS is a particularly >> popular core for the job. > > I'm not sure on the one on my SAS RAID adapter, but I think it's an > Intel RISC processor. It's not a MIPS or an Alpha, that much I am > certain of. > Intel haven't made RISC processors for many years (discounting the Itanium, which is an unlikely choice for a raid processor). They used to have StrongArms, and long, long ago they had a few other designs, but I'm pretty certain you don't have an Intel RISC processor on the card. It also will not be an Alpha - they have not been made for years either (they were very nice chips until DEC, then HP+Compaq totally screwed them up, with plenty of encouragement from Intel). Realistic cores include MIPS in many flavours, PPC, and for more recent designs, perhaps an ARM of some kind. If the heavy lifting is being done by ASIC logic rather than the processor core, there is a wider choice of possible cores. >> But while you get a lot more work out of an 800 MHz for a given price, >> size or power than you do with an x86, you don't get more for a given >> clock rate. Parity calculations are really just a big stream >> of "xor"'s, and a modern x86 will chew through these as fast as memory >> bandwidth allows. Internally, x86 assembly is mostly converted to >> wide-word RISC-style instructions, so a decently written parity >> function will be as efficient per clock on an x86 as it is on MIPS. > > True. > >>>>>> I have never heard of a distinction between a "hot spare" that is >>>>>> spinning, and a "standby spare" that is not spinning. >>>>> This is quite a common distinction, mind you. There is even a >>>>> "live spare" solution, but to my knowledge this is specific to >>>>> Adaptec - they call it RAID 5E. >>>>> >>>>> In a "live spare" scenario, the spare disk is not used as such but >>>>> is part of the live array, and both data and parity blocks are >>>>> being written to it, but with the distinction that each disk in the >>>>> array will also have empty blocks for the total capacity of a >>>>> standard spare disk. These empty blocks are thus distributed >>>>> across all disks in the array and are used for array reconstruction >>>>> in the event of a disk failure. >>>> Is there any real advantage of such a setup compared to using raid 6 >>>> (in which case, the "empty" blocks are second parity blocks)? There >>>> would be a slightly greater write overhead (especially for small >>>> writes), but that would not be seen by the host if there is enough >>>> cache on the controller. >>> Well, the advantage of this set-up is that you don't need to replace >>> a failing disk, since there is already sufficient diskspace left >>> blank on all disks in the array, and so the array can recreate itself >>> using that extra blank diskspace. This is of course all nice in >>> theory, but in practice one would eventually replace the disk anyway. >> The same is true of raid6 - if one disk dies, the degraded raid6 is >> very similar to raid5 until you replace the disk. >> >> And I still don't see any significant advantage of spreading the >> wholes around the drives rather than having them all on the one drive >> (i.e., a normal hot spare). The rebuild still has to do as many reads >> and writes, and takes as long. The rebuild writes will be spread over >> all the disks rather than just on the one disk, but I can't see any >> advantage in that. > > Well, the idea is simply to give the spare disk some exercise, i.e. to > use it as part of the live array while still offering the extra > redundancy of a spare. So in the event of a failure, the array can be > fully rebuilt without the need to replace the broken drive, as opposed > to that the array would stay in degraded mode until the broken drive is > replaced. > The array will be in degraded mode while the rebuild is being done, just like if it were raid5 with a hot spare - and it will be equally slow during the rebuild. So no points there. In fact, according to wikipedia, the controller will "compact" the degraded raid set into a normal raid5, and when you replace the broken drive it will "uncompact" it into the raid 5E arrangement again. The "compact" and "uncompact" operations take much longer than a standard raid5 rebuild. So all you get here is a marginal increase in the parallelisation of multiple simultaneous small reads, which you could get anyway with raid6 rather than raid5 with a spare. >> I suppose read performance, especially for many parallel small reads, >> will be slightly higher than for a normal hot spare, since you have >> more disks with active data and therefore higher chances of >> parallelising these accesses. But you get the same advantage with >> raid6. > > Yes, but RAID 6 would be slower for small writes, and if one of the > drives fails, the array stays in degraded mode (since it considers > itself to be a RAID 6, not a RAID 5E). > Degraded raid5 and raid6 have varying speeds, depending on whether the data you access is available directly or must be calculated from the rest of the stripe and the parity. The same applies to a degraded raid 5E with a broken drive. You are right that small writes to raid 6 would be slower than to a raid 5E. >>>> It looks like we agree on most things here - we just had a little >>>> difference on the areas we wrote about (specific information for the >>>> OP, or more general RAID discussions), and a few small differences >>>> in terminology. >>> Well, you've made me reconsider my usage of RAID 5, though. I am now >>> contemplating on using two RAID 10 arrays instead of two RAID 5 >>> arrays, since each of the arrays has four disks. They are both >>> different arrays, though. They're connected to the same RAID >>> controller but the first array is comprised of 147 GB 15k Hitachi SAS >>> disks and the second array is comprised of 1 TB 7.2k Western Digital >>> RAID Edition SATA-2 disks on a hotswap backplane. >>> >>> I had always considered RAID 5 to be the best trade-off, considering >>> the loss of diskspace involved versus the retail price of the hard >>> disks - especially the SAS disks - but considering that the SAS array >>> will be used to house the main systems in a virtualized set-up (on >>> Xen) and will probably endure the most small and random writes, RAID >>> 10 might actually be a better solution. The cost of the lost >>> diskspace on the SATA-2 disks is smaller since this type of disks is >>> far less expensive than SAS. >> I gather that raid 10 (hardware or software) is now often considered a >> better choice - raid 5 is often viewed as unreliable due to the risks >> of a second failure during rebuilds, which are increasingly >> time-consuming with larger disks. Where practical, I think >> mdadm "far" raid 10 is the optimal if you are happy with losing 50% of >> your disk space - it is faster than other redundant setups in many >> situations, and has a great deal of flexibility. > > Well, 50% is the minimum storage capacity one loses when using any kind > of mirroring, be it RAID 1, RAID 10, RAID 0+1, RAID 50 or whatever. > >> If you want more redundancy, you can use double mirrors for 33% disk >> space and still have full speed. > > Yes, but that's a set-up which, due to understandable financial > considerations, would be reserved only for the corporate world. Many > people already consider me certifiably insane for having spent that > much money - 17'000 Euro, as I wrote higher up - on a privately owned > computer system. But then again, for the intended purposes, I need > fast and reliable hardware and a lot of horsepower. :-) > I'm curious - what is the intended purpose? I think I would have a hard job spending more than about three or four thousand Euros on a single system. > In the event of the OP on the other hand, 45 SAS disks of 300 GB each > and three SAS RAID storage enclosures also doesn't seem like quite an > affordable buy, so I take it he intends to use it for a business. > It also does not strike me as a high value-for-money system - I can't help feeling that this is way more bandwidth than you could actually make use of in the rest of the system, so it would be better to have fewer larger drives and less layers to reduce the latencies. Spent the cash saved on even more ram :-) 45 disks at a throughput of say 75 MBps each gives about 3.3 GBps - say 3 GBps since some are hot spares. Ultimately, being a server, this is going to be pumped out on Ethernet links. That's a lot of bandwidth - it would effectively saturate four 10 Gbit links. I have absolutely no real-world experience with these sorts of systems, and could therefore be totally wrong, but my gut feeling is that the theoretical numbers will not scale with so many drives - something like 15 1 TB SATA drives would be similar in speed in practice. > That, or he's a maniac like me. :p > >> If you have the chance, it would be very nice to try out some >> different arrangements and see which is fastest in reality, not just >> in theory! > > Ahh, but whole books have been written about such tests, and it still > always boils down to "What are you planning to do with it?" For > instance, a database server has different needs from a mailserver, and > this has different needs from a fileserver or workstation, etc. ;-) > It would still be fun! >> The other option is to go for a file system that handles multiple >> disks and redundancy directly - ZFS is the best known, with btrfs the >> experimental choice on Linux. > > I don't think Btrfs is already considered stable enough. ZFS is of > course a great choice, but the GPL forbids linking ZFS into the Linux > kernel. If there is a "filesystem in userspace" implementation of it, > then it would of course be possible to legally use ZFS on a GNU/Linux > system. > There /is/ a "filesystem in userspace" implementation of ZFS (using fuse). But it is not feature complete, and not particularly fast. btrfs is still a risk, and is still missing some features (such as elegant handling of low free space...), but the potential is there. > I have been looking into NexentaOS (i.e. GNU/kOpenSolaris) for a while, > which uses ZFS, albeit that ZFS was not my reason for being interested > in the project. I was more interested in the fact that it supports > both Solaris Zones - of which the Linux equivalents are OpenVZ and > VServer - and running paravirtualized on top of Xen. > > Doing that with OpenVZ requires the use of a 2.6.27 kernel which is > still considered unstable by the OpenVZ developers, and doing that with > Vserver is as good as impossible, since they're still using a 2.6.16 > kernel, and you can't apply the (now obsolete) Xen patches to that > because those are for 2.6.18. And thus, running VServer in a Xen > virtual machine would require that you run it via hardware > virtualization rather than paravirtualized. > > The big problem with NexentaOS however is that it's based on Ubuntu and > that it uses binary .deb packages, whereas I would rather have a Gentoo > approach, where you can build the whole thing from sources without > having to go "the LFS way". > Why is it always so hard to get /everything/ you want when building a system :-( > Oh well, I've relayed the whole thing for the weekend, so I still have > plenty of time to think things over. ;-) > >>> See, this is one of the advantages of Usenet. People get to share >>> not only knowledge but also differing views and strategies, and in >>> the end, everyone will have gleaned something useful. ;-) >> Absolutely - that's also why it's good to have a general discussion >> every now and again, rather than just answering a poster's questions. >> Good questions (such as in this thread) inspire an exchange of >> information for many people's benefits (I've learned things here too). > > Maybe we should invite some politicians over to Usenet. Then *they* > might possibly learn something about the real world as well. :p > |