From: David Brown on 19 Jan 2010 15:37 Rahul wrote: > David Brown <david.brown(a)hesbynett.removethisbit.no> wrote in > news:BtOdnakm8taiUsnWnZ2dnUVZ8qOdnZ2d(a)lyse.net: > > Thanks David! > >> Rahul wrote: >> >> LVM is for logical volume management, mdadm is for administering >> multiple disk setups (i.e., software raid). LVM /can/ do basic > >> striping, in that if you have two physical volumes allocated to the >> same volume group, then a logical volume can be striped across the two >> physical volumes. As another poster has said, you won't notice a >> performance difference between striping via LVM or mdadm. But you > > Will putting LVM on top of mdadm slow things down? Or does LVM not have a > significant performance penalty? LVM does have a performance penalty, but it is not normally significant. If you have a number of logical partitions which you then grow a number of times, you end up with the actual physical blocks of the partitions rather scattered across the disk(s), which may impact performance for streaming or large files. The flexibility you get is normally worth the slight cost (IMHO). >> My recommendation is that you use mdadm to create a raid from the raw >> drives or partitions on the drives, and if you want the volume >> management features of LVM (I find it very useful), put LVM on top of >> mdadm raid. > > This is exactly what I was trying to do. BUt LVM asks "stripe" or :no > stripe". THat I wasn;t sure about. > > >> As for the type of raid to use, that depends on the number of disks >> you have and the redundancy you want. raid5 is well-known to be >> slower for writing, especially for smaller writes, and it can be risky >> for large disks in critical applications > > Maybe if I explain my situation you can have some more comments. > > I have 3 physical "storage boxes" (MD-1000's from Dell). Each takes 15 > SAS 15k drives of 300 GB each. i.e. I have a total of 45 drives of 300 GB > each. Redundancy is important but not critical. Performance was more > imporntant. > > My original plan was to split each box into two RAID5 arrays of 7 disks > each and leave 1 as a hot spare. Thus I get 6 RAID5 arrays in all. They > are visible as /dev/sdb /dev/sdc etc. but I want to mount a single /home > on it. That's where I introduced LVM. But then LVM again introduces a > striping option. Should I be striping or not? > Don't do any striping with LVM - set up your raid arrays (with hardware raid and/or mdadm) until you have a single "disk", and put LVM on that. > That's where I am confuesd about what my best option is. It's hard to > balance redundancy, performance and disk capacity. > > > Any other creative options that come to mind? > > > >> (since rebuilding takes so >> long, and wears the other disks). Mirroring is safer, and mdadmin can >> happily do a raid10 (roughly a stripe of mirrors) on any number of >> disks for high speed and mirrored redundancy. >> >> Booting from raids is complicated, but not as difficult as suggested > > Luckily I don't have to go down that path; I have a seperate drive to > boot from. >
From: David Brown on 19 Jan 2010 16:57 Aragorn wrote: > On Tuesday 19 January 2010 08:37 in comp.os.linux.misc, somebody > identifying as Rahul wrote... > >> Aragorn <aragorn(a)chatfactory.invalid> wrote in news:hj1gta$2hp$5 >> @news.eternal-september.org: >> >> Thanks for the great explaination! > > Glad you appreciated it. ;-) > Unfortunately, there seems to me to be a number of misconceptions in this post. I freely admit to having more theoretical knowledge from trawling the net, reading mdadm documentation, etc., than personal practical experience - so anyone reading this will have to judge for themselves whether they think I am right, or Aragorn is right. Either way, I hope to give you some things to think about. >>> Writing to a RAID 5 is slower than writing to a single disk because >>> with each write, the parity block must be updated, which means >>> calculation of the parity data and writing that parity data to the >>> pertaining disk. >> This is where I get confused. Is writing to a RAID5 slower than a >> single disk irrespective of how many disks I throw at the RAID5? > > Normally, yes, although it won't be *much* slower. But there is some > overhead in the calculation of the parity, yes. This is why RAID 6 is > even slower during writes: it stores *two* parity blocks per data > segment (and as such, it requires a minimum of 4 disks). > Writing to RAID5 (or RAID6) /may/ be slower than writing to a single disk - or it may be much faster (closer to RAID0 speeds). The actual parity calculations are negligible with modern hardware, whether it be the host CPU or a hardware raid card. What takes time is if existing data has to be read in from the disks in order to calculate the parity - this causes a definite delay. If you are writing a whole stripe, the parity can be calculated directly and the write goes at N-1 speed as each block in the stripe can be written in parallel. This is also the case if the parts of the block are already in the cache from before. Thus random writes are slow on RAID5 (and RAID6), but larger block writes are full speed. There can also be significant differences between the speed of mdadm software RAID5, and hardware RAID5. With hardware raid, the card can report a small write as "finished" before it has read in the block and written out the data and new parity. This is safe for good hardware with battery backup of its buffers, and gives fast writes (as far as the host is concerned) even for small writes. Software raid5 cannot do this. But on the other hand, software raid5 can take advantage of large system memories for cache, and is thus far more likely to have the required stripe data already in its cache (especially for metadata and directory areas of the file system, which are commonly accessed but have small writes). This is perhaps also a good time to mention one of the risks of raid5 (and raid6) - the RAID5 Write Hole. When you are writing a stripe to the disk, the system must write at least two blocks - data and the updated parity block. These two writes cannot be done atomically - if you get a system failure at this point, the blocks may be inconsistent and the whole stripe is inconsistent and effectively becomes silent garbage. >> I currently have a 7-disk RAID5. Will writing to this be slower than a >> single disk? > > A little, yes. But reading from it will be significantly faster. > Not necessarily - writing will be slower if you do lots of small random writes, but much faster if you write large blocks. Also remember that with a 7 disk array under heavy use, you /will/ see a disk failure at some point. Degraded performance of raid 5 is very poor, and rebuilds are slow. Some people believe that the chance of a second disk failure occurring during a rebuild is so large (rebuilds are particular intensive for the other disks) that raid 5 should be considered unsafe for large arrays. Raid 6 is better since it can survive a second failure, but mirrored raids are safer still. >> Isn't the parity calculation a fairly fast process especially if one >> has a hardware based card? > A decent host processor will do the parity calculations /much/ faster than the raid processor on most hardware cards. But the calculations themselves are not the cause of the latency, it's the extra reads that take time. > Ah, but with a hardware-based RAID things are different. The actual > writing process will still be somewhat slower than writing to a single > disk, but considering that everything is taken care of by the hardware > and that such adapters have a very large cache - often backed by a > battery - this will not really have a noticeable performance impact. > > With hardware RAID, the kernel treats the entire array as a single disk > and will simply write to the array. As far as the operating system is > concerned, that's where it ends, and the array takes care of everything > else from there, in a delayed fashion, but this is not something you > notice as your actual CPU(s) are freed up again as soon as the data is > transfered to the memory of the RAID adapter. > True, but see above for more information. > It is however advised if you have a hardware RAID adapter to disable the > write barriers. Write barriers are where the kernel forces the disks > drives to flush their caches. Since a hardware RAID adapter must be in > total control of the disk drives and has cache memory of its own, the > operating system should never force the disk drives to flush their > cache. > Make sure your raid controller has batteries, and that the whole system is on an UPS! >> And then if the write gets split into 6 parts shouldnt that speed up >> the process since each disk is writing only 1/6th of the chunk? > > Yes, but the data has to be split up first - which is of course a lot > faster on hardware RAID since it is done by a dedicated processor on > the adapter itself then - and the parity has to be calculated. This is > overhead which you do not have with a single disk. > Nonsense - a host CPU is perfectly capable of splitting a stripe into its blocks in a fraction of a microsecond. It is also much faster at doing the parity calculations - the host CPU typically runs at least ten times as fast as the CPU or ASIC on the raid card. And again, the splitting and parity calculations are not the bottleneck, it's the latency of the reads needed to calculate the new parity that takes time. Where a hardware raid card will win is if your IO is a bottleneck, which can be the case for large fast arrays. In particular, if you have a mirror raid with software raid, then the host CPU has to write out all the data twice - with hardware raid, it's the raid card that doubles up the data. There are times when top-range hardware raid cards will beat software raid on speed, but not often - especially with a fast multi-core modern host cpu. It does, however, depend highly on your raid setup and the type of load you have - there are no set answers here. Software raid does of course have a reliability weak point - if your OS crashes in the middle of a write, you have a bigger chance of hitting the raid 5 write hole than you would with a hardware raid card with a battery. >>> In this case, you don't have any redundancy. Writing to the >>> stripeset is faster than writing to a single disk, and the same >>> applies for reading. It's not a 2:1 performance boost due to the >>> overhead for splitting the data for writes and re-assembling it upon >>> reads, but there is a significant performance improvement, and >>> especially so if you use more than two disks. >> Why doesn;t a similar boost come out of a RAID5 with a large number of >> disks? Merely because of the parity calculation overhead? > > Yes, that is the main difference. Like I said, RAID 6 is even slower > during writes (and has equal performance during reads). > Assuming (again!) that you are doing a small write and the old data and parity blocks are not in the cache, then you have the latency of the reads (two reads for a single block write on raid 5, and three reads for raid 6). For reading, especially for large reads, raid 5 is approximately like N-1 raid 0 drives, while raid 6 is like N-2 raid 0. >>> There are however a few considerations you should take into account >>> with both of these approaches, i.e. that you should not put the >>> filesystem which holds the kernels and /initrd/ - and preferably not >>> the root filesystem either[1] - on a stripe, because the bootloader >>> recognizes >> Luckily that is not needed. I have a seperate drive to boot from. The >> RAID is intended only for user /home dirs. > > Ah but wait a minute. As I understand it, you have a hardware RAID > adapter card. In that case - assuming that it is a real hardware RAID > adapter and not one of those on-board fake-RAID things - it doesn't > matter, because to the operating system (and even to the BIOS), the > entire array will be seen as a single disk. So then it is perfectly > possible to have your bootloader, your "/boot" and your "/" living on > the RAID array. (I am doing that myself on one of my machines, which > has two RAID 5 arrays of four disks each.) > > And in this case - i.e. if you have a hardware RAID array - then your > original question regarding software RAID 0 versus striping via LVM is > also answered, because hardware RAID will always be a bit faster than > software RAID or striped LVM. Additionally, since you mention seven > disks, you could even opt for RAID 10 or 51 and even have a "hot spare" > or "standby spare". (Or you could use the extra disk as an individual, > standalone disk.) > > RAID 10 is where you have a mirror (i.e. RAID 1) which is striped to > another mirror - you could instead also use RAID 01, which is a stripe > which is mirrored on another stripe. RAID 10 is better than RAID 01 > though - there's a good article on Wikipedia about it. RAID 10 or 01 > require four disks in total. Performance is very good for both reading > and writing *and* you have redundancy. > Yes, wikipedia /does/ have some useful information about raid - it's worth reading. One thing you are missing here is that Linux mdadm raid 10 is very much more flexible than just a "stripe of mirrors", which is the standard raid 10. In particular, you can use any number of disks (from 2 upwards), you can have more than 2 copies of each block (at the cost of disk space, obviously) for greater redundancy, and you can have a layout that optimises the throughput for different loads. For example, a "f2" md raid 10 layout gives you full raid 0 performance for large reads while being at least as fast as other raids for writing and random reads (and much faster than raid 5 for small random writes). It is normally the fastest raid layout with redundancy - though at a 50% cost in disk space. <http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10> Raid10 performance is also much less affected by a disk failure, and rebuilds are faster and less stressful on the system. And a single hot spare will cover all the disks - you don't need a spare per However, while a "f2" md raid 10 is probably the fastest setup for directly connected drives, this is not what you have. You will also suffer from bandwidth issues if you try to do all the mirroring of all 45 drives in software. In your case, I would recommend raid 10 on each box - 7 raid1 pairs striped together with a hot spare (assuming the hardware supports a common hot spare). Your host then sees these three disks, which you should stripe together with mdadm raid0 - there is no need for redundancy here, as that is handled at a lower level. Put your LVM physical volume on top of this if you want the flexibility of LVM - if you don't need it, don't bother. > Similarly, RAID 51 is where you have a RAID 5 which is mirrored onto > another RAID 5. Or you could use RAID 15, which is a RAID 5 comprised > of mirrors. RAID 51 and 15 require a minimum of six disks. > (Similarly, there is RAID 61 and 16, which require a minimum of eight > disks.) > As a minor point, mdadm raid 5 can work on 2 disks (and raid 6 on three disks). Such a 2-disk raid 5 is not much use in a working system, but can be convenient when setting things up or upgrading drives, as you can add more drives to the mdadm raid 5 later on. It's just an example of how much more flexible mdadm is than hardware raid solutions. > There is of course a trade-off. Except for RAID 0, which isn't really > RAID because it has no redundancy, all RAID solutions are expensive in > diskspace, and how expensive exactly depends on the chosen RAID type. > In RAID 1, RAID 10 or RAID 01 set-up, you lose 50% of your storage > capacity. > > With RAID 5, your storage capacity is reduced by the capacity of one > disk in the array, and with RAID 6 by the capacity of two disks in the > array. So, with a single RAID 5 array comprised of seven disks without > a standby or hot spare, your total storage capacity is that of six > disks. > > And then there's the lost capacity of the hot spare or standby spare - a > hot spare is spinning but otherwise unused until one of the other disks > starts to fail, while a standby spare is spun down until one of the > other disks fails. Upon such failure, the array will be automatically > rebuilt using the parity blocks to write the missing data to the spare > disk. > I have never heard of a distinction between a "hot spare" that is spinning, and a "standby spare" that is not spinning. Given that spinup takes a few seconds, and a rebuild often takes many hours, I can't see you have much to gain by keeping a spare drive spinning. To my mind, a "hot spare" is a drive that will be used automatically to replace a dead drive. An "offline spare" is an extra drive that is physically attached, but not in use automatically - in the event of a failure, it can be manually assigned to a raid set. This makes sense if you have several hardware raid sets defined and want to share a single spare, if the hardware raid cannot support this (mdadm, of course, supports such a setup with a shared hot spare). > The bottom line...: A seven-disk RAID 0 would be faster than a RAID 5 > during writes, but not really significantly faster during reads, and > you would have the full storage capacity of all disks in the array, but > there would be no redundancy at all. So, considering that you have > seven disks, I think you really should consider building in redundancy. > After all, with RAID 0, if a single disk in the array fails, then > you'll have lost all of your data. A RAID 5 would upon failure of a > single disk run slower, but at least you'd still have access to your > data. >
From: Rahul on 19 Jan 2010 18:32 David Brown <david.brown(a)hesbynett.removethisbit.no> wrote in news:NKOdnXtWJIFpt8vWnZ2dnUVZ7radnZ2d(a)lyse.net: > > themselves whether they think I am right, or Aragorn is right. Either > way, I hope to give you some things to think about. An alternative viewpoint is always good! > Thus random writes are slow on RAID5 (and RAID6), but larger block > writes are full speed. And if I did a RAID10 at hardware level (as you later suggest) I'd get the speedup on random writes as well? (which are otherwise slow on a RAID5?) What other way do I have to speed up random writes? > There can also be significant differences between the speed of mdadm > software RAID5, and hardware RAID5. With hardware raid, the card can > report a small write as "finished" before it has read in the block and > written out the data and new parity. This is safe for good hardware > with battery backup of its buffers, and gives fast writes (as far as > the host is concerned) even for small writes. Software raid5 cannot > do this. But on the other hand, software raid5 can take advantage of > large system memories for cache, and is thus far more likely to have > the required stripe data already in its cache (especially for metadata > and directory areas of the file system, which are commonly accessed > but have small writes). Yes, I do have a battery backed up cache on my Hardware card. But from the point you make above there's something to be said about a software (mdadm or LVM) on top of hardware approach? This way I get the best of both worlds? LVM / mdadm will serve out from RAM (I've 48 Gigs of it) and speed up reads. Writes will be speeded up due to the caches of the Hardware card. Does this make sense? > This is perhaps also a good time to mention one of the risks of raid5 > (and raid6) - the RAID5 Write Hole. This risk is reduced by a battery backed-up cache, correct? > >>> I currently have a 7-disk RAID5. Will writing to this be slower than >>> a single disk? >> >> A little, yes. But reading from it will be significantly faster. > > Not necessarily - writing will be slower if you do lots of small > random writes, but much faster if you write large blocks. And will the reads and large-sequential-writes be even faster if I did a 14 disk RAID5 instead of a 7-disk RAID5? > > Make sure your raid controller has batteries, and that the whole > system is on an UPS! Yes! Both. > > For reading, especially for large reads, raid 5 is approximately like > N-1 raid 0 drives, while raid 6 is like N-2 raid 0. Problem is I haven't seen a similar formula mentioned for writes. Neither large nor small writes. What's a approximate design equation to use to rate options? > > However, while a "f2" md raid 10 is probably the fastest setup for > directly connected drives, this is not what you have. You will also > suffer from bandwidth issues Which bandwidth are we talking about? THe CPU-to-controller? >if you try to do all the mirroring of all > 45 drives in software. In your case, I would recommend raid 10 on > each box - 7 raid1 pairs striped together with a hot spare (assuming > the hardware supports a common hot spare). Your host then sees these > three disks, which you should stripe together with mdadm raid0 - there > is no need for redundancy here, as that is handled at a lower level. > Put your LVM physical volume on top of this if you want the > flexibility of LVM - if you don't need it, don't bother. Ah! Thanks! That;s a creative solution I hadn't thought about. > > I have never heard of a distinction between a "hot spare" that is > spinning, and a "standby spare" that is not spinning. Me neither. > >> The bottom line...: A seven-disk RAID 0 would be faster than a RAID 5 >> during writes, but not really significantly faster during reads, and >> you would have the full storage capacity of all disks in the array, >> but there would be no redundancy at all. So, considering that you >> have seven disks, I think you really should consider building in >> redundancy. After all, with RAID 0, if a single disk in the array >> fails, then you'll have lost all of your data. A RAID 5 would upon >> failure of a single disk run slower, but at least you'd still have >> access to your data. >> Or I could do the RAID10 that you suggest and stripe on top of three such arrays using mdadm. I'm thinking about this very interesting option. Thanks! -- Rahul
From: unruh on 19 Jan 2010 18:50 On 2010-01-19, Rahul <nospam(a)nospam.invalid> wrote: > Aragorn <aragorn(a)chatfactory.invalid> wrote in > news:hj52h6$lr7$2(a)news.eternal-september.org: > >> >> I would personally not use all of them for "/home". You mention three >> arrays, so I would suggest the following...: >> >> ?? First array: >> - /boot >> - / >> - /usr >> - /usr/local >> - /opt >> - an optional rescue/emergency root filesystem >> >> ?? Second array: >> - /var >> - /tmp (Note: you can also make this a /tmpfs/ >> instead.) - /srv (Note: use at your own discretion.) >> >> ?? Third array: >> - /home >> > > Sorry, I should have clarified. For /boot /usr etc. all I have a > seperate mirrored SAS drive. So those are taken care of. Besides 15x > 300GB would be too much storage for any of those trees. > > I have all 45 drives bought just to provide a high performance /home. > The question is how best to configure them: > > 1. What RAID pattern? Do you want speed or do you want size or do you want redundancy? I have just instituted raid0 ( striped) across two partitions on two disks ( the disks are identical, and the partitioning of them is identical). There are 500GB WD disks 7200 SATA. hdparm -t gives about 82MB/s I used mdadm to set up a raid0 ( first bringing in the raid0 module) on two 450GB patitions, one on each of the drives and mounted the resultant /dev/md0 after formatting as ext3 onto /local I then did cat /dev/null>/local/a for 12 sec, and a was then a 2GB file, so writing to that disk (assuming writing all 0 from cat does not produce some sort of sparse file) went at about 160MB/s, ie twice as fast as reading from a single disk. > 2. Do I add LVM on top? THis is cleaner than arbitrarily mounting /home1 > /home2 etc. But the overhead of LVM worries me 3. Do I use LVM striping > or not? etc. You want to use lvm why? > >
From: Rahul on 19 Jan 2010 19:23
unruh <unruh(a)wormhole.physics.ubc.ca> wrote in news:slrnhlchar.4le.unruh(a)wormhole.physics.ubc.ca: Thanks unruh! > Do you want speed or do you want size or do you want redundancy? Mainly speed. Size and Redundancy are good but lesser goals. I guess its always a tradeoff between all 3. > >> 2. Do I add LVM on top? THis is cleaner than arbitrarily mounting /home1 >> /home2 etc. But the overhead of LVM worries me 3. Do I use LVM striping >> or not? etc. > > You want to use lvm why? > Because I have 3 different "storage boxes" with 15 drives each. At best I see three devices /dev/sda /dev/sdb /dev/sdc after I use the hardware RIAD controllers. Logically I just want to mount /home on them. At worst (If I do 7 disk RAID5's) I might see 6 physical drives. Then again LVM would aggregate them and I could mount /home. I am open to other sugesstions. -- Rahul |