difference between striping using mdadm and LVM [General Linux]

Prev: Variable I/O Performance with dd vs. cat
Next: ? Recommend PC for learning RHEL ?

From: David Brown on 21 Jan 2010 18:21

Aragorn wrote:
> On Wednesday 20 January 2010 23:59 in comp.os.linux.misc, somebody
> identifying as David Brown wrote...
>
>> Aragorn wrote:
>>
>>> On Wednesday 20 January 2010 15:48 in comp.os.linux.misc, somebody
>>> identifying as David Brown wrote...
>> <snip to save a little space>
>
> Yeah, these posts themselves are getting quite long, but at least, it's
> one of those rare threads in which the conversation continues
> on-topic. :-)
>
> Quite honestly, I'm enjoying this thread, because I get to hear
> interesting feedback - and I think you do to, from your point of view -
> and I have a feeling that Rahul, the OP, is sitting there enjoying
> himself over all the valid arguments being discussed here in the debate
> over various RAID types. ;-)
>
> This is a good thread, and I recommend that any lurking newbies would
> save the posts for later reference in the event that they are faced
> with the decision on whether and how to implement RAID on one of their
> machines. Newbies, heads up! :p
>

Yes, lots of interesting things are turning up here.

>>>> The zeroth rule, which is often forgotten (until you learn the hard
>>>> way!), is "thou shalt make a plan for restoring from backups, test
>>>> that plan, document that plan, and find a way to ensure that all
>>>> backups are tested and restoreable in this way". /Then/ you can
>>>> start making your actual backups!
>>> Well, so far I've always used the tested and tried approach of
>>> tar'ing in conjunction with bzip2. Can't get any cleaner than
>>> that. ;-)
>> rsync copying is even cleaner - the backup copy is directly
>> accessible. And when combined with hard link copies in some way (such
>> as rsnapshot) you can get snapshots.
>
> I have seen this method being discussed before, but to be honest I've
> never even looked into "rsnapshot". I do intend to explore it for the
> future, since the ability to make incremental backups seems very
> interesting.
>

Another poster has given you a pretty good explanation of how rsync
snapshot backups work. I'll just give a few more points here.

rsync is designed to make a copy of a directory as efficiently as
possible. It will only copy over files that have changed or been added,
and even for changed files it can often copy over just the changes
rather than the whole file. And if you are doing the rsync over a slow
network, you can compress the transfers. There are additional flags to
delete files in the destination that are no longer present in the
source, and to omit certain files from the copy (amongst many other flags).

For snapshots, this is combined with the "cp -al" command that copies a
tree but hard-links files rather than copying them. So you do something
like rsync copy the source tree to a "current copy", then "cp -al" the
"current copy" to a dated backup snapshot directory. The next day, you
repeat the process - only changes from the source to the "current copy"
are transferred, and any files left untouched will be hardlinked each
time - you only ever have one real copy of each file, with hardlinks in
each snapshot. It's not perfect - for example, a file rename will cause
a new transfer, for example (the "--fuzzy" flag can be used to avoid the
transfer, but not the file duplication). And if you have partial
transfers you can end up breaking the hard-link chaining and end up with
extra copies of the files (and thus extra disk space).

rsnapshot and dirvish are two higher level backup systems build on this
technique.

Another option, which is a bit more efficient for the transfer and can
help avoid duplicates if you have occasional hiccups, is to use the
"--link-dest" option to provide the source of your links. This avoids
the extra "cp -al" step for greater efficiency, and also lets you
specify a number of old snapshots - helpful if some of these were
incomplete.

Remember also that rsync is aimed at network transfers - you want to
keep your backups on a different machine (although it's nice with local
snapshots of /etc). At the very least, keep them on a different file
system partition than the original - that way you have protection
against file system disasters. Obviously you want to avoid making any
changes to the files in the snapshots, though deleting them works
perfectly (files disappear only once all the hard links are gone). It
is also a good idea to hide the backup tree from locate/updatedb, if you
have it - your 100 daily snapshots may not take much more disk space
than a single copy, but it does take 100 times as many files and
directories.

> So far I have always made either data backups only - and on occasion,
> backups of important directories such as "/etc" - or complete
> filesystem backups, but never incremental backups. For IRC logs - I
> run an IRC server (which is currently inactive - see farther down) and
> I log the channels I'm in - I normally use "zip" every month, and then
> erase the logs themselves. This is not an incremental approach, of
> course.
>
> My reason for using "zip" rather than "tar" for IRC logs is that my
> colleagues run Windoze and so their options are limited. ;-)
>

Tell your Windoze colleagues to get a proper zipper program, instead of
relying on windows own half-hearted "zip folders" or illegal
unregistered copies of WinZip. 7zip is totally free, and vastly better
- amongst other features, it supports .tar.bz2 without problems.

>> Of course, .tar.bz2 is good too - /if/ you have it automated so that
>> it is actually done (or you are one of these rare people that can
>> regularly follow a manual procedure).
>
> To be honest, so far I've been doing that manually, but like I said, my
> approach is rather amateuristic, in the sense that it's not a
> systematic approach. But then again, so far the risk was rather
> limited because I only needed to save my own files.
>
> On the hosting server we used - which is now no longer operational as
> such - the hosting software itself made regular backups of the domains,
> but using the ".tar.bz2" approach. I'm not sure whether there was
> anything incremental about the backups as it was my colleague who
> occupied himself with the management of that machine - it was located
> at his home.
>
>> It also needs to be saved in a safe and reliable place - many people
>> have had regular backups saved to tape only to find later that the
>> tapes were unreadable.
>
> That is always a risk, just as it was with the old music cassette tapes.
> Magnetic storage is actually not advised for backups.
>

I agree - I dislike tapes for backup systems. But I also dislike
optical storage - disks are typically not big enough for complete
backups, so you have to have a really messy system with multiple disks
for a backup set, or even worse, incremental backup patch sets. Even if
you can fit everything on a single disk, it requires manual intervention
to handle the disks and store them safely off-site, and you need to test
them regularly (that's important!).

Hard disk space is cheap, especially if you are not bothered about
performance. It is, IMHO, the best backup medium these days. Make sure
you have two independent copies in case of disk crashes, of course. At
my office I have an onsite backup server (it's ideal for when someone
tells me they deleted an important folder a few weeks ago - I can browse
to the right dated backup snapshot, and copy out the data directly), and
I have an offsite backup with copies over the Internet at night.

>> And of course it needs to be saved again, in a different place and
>> stored at a different site.
>
> That would indeed be the best approach. Like I said in my previous
> post, I use Iomega REV disks for backups to which I want to have
> immediate access, but I also forgot to mention that I back up stuff to
> DVDs, and I use DVD+RW media for that, since they tend to be of higher
> quality than DVD-/+R - likewise I prefer CD-RW over CD-R - and the
> advantage of optical storage is that it is the better choice in the
> event of magnetic corruption, which you *can* and eventually *do* get
> on tape drives.
>
> Hard disks are relatively cheap these days - at least, if we're talking
> about consumergrade SATA disks - and they are magnetically better than
> tapes, in the sense that the magnetic coating on the platters is more
> time-resilient than with tape drives. On the other hand, hard disks
> contain lots of moving components and if a hard disk fails - and here
> we go again - you lose all your data, unless you have a RAID set-up.
>
> So one can use hard disks for backups - it's fast, reasonably affordable
> and reasonably reliable, but it's not the final solution. If one
> stores one's backups on hard disks, then one needs to make backups of
> those backups on another kind of media.
>
> My advice would therefore be to make redundant backups on different
> types of media. Optical media are ideal in terms of the fact that they
> are not susceptible to electromagnetic interference, but they might in
> turn have other issues - especially older CDs and DVDs - since storage
> there is in fact mechanical, i.e. the data is stored via physical
> indentations in a kind of resin, made by a fairly high-powered laser.
> And some readers will not accept media that were burned using other
> CD/DVD writers. This is becoming more rare these days, but the problem
> still exists.
>
>> I know I'm preaching to the choir here, as you said before - but there
>> may be others in the congregation.
>
> Indeed, and people tend "not to care" until they burn their fingers. So
> we can't stress this enough.
>
>>>> And the second rule is "thou shalt make backups of your backups",
>>>> followed by "thou shalt have backups of critical hardware". (That's
>>>> another bonus of software raid - if your hardware raid card dies,
>>>> you may have to replace it with exactly the same type of card to get
>>>> your raid working again - with mdadm raid, you can use any PC.)
>>> Well, considering that my Big Machine has drained my piggy bank for
>>> about 17'000 Euros worth of hardware, having a duplicate machine is
>>> not really an option. The piggy bank's on a diet now. :-)
>>>
>> You don't need a duplicate machine - you just need duplicates of any
>> parts that are important, specific, and may not always been easily
>> available.
>
> Well, just about everything in that machine is very expensive. And on
> the other hand, I did have another server here - which was
> malfunctioning but which has been repaired now - so I might as well put
> that one to use as a back-up machine in the event that my main machine
> would fail somehow - something which I am not looking forward to, of
> course! ;-)
>
> I also can't use the Xen live migration approach, because I intend to
> set up my main machine with 64-bit software, while the other server is
> a strictly 32-bit machine. But redundancy - i.e. a duplicate set-up of
> the main servers - should be adequate enough for my purposes.
>

Can't Xen cope with mixes of 64-bit and 32-bit machines? I've never
used it - my servers use OpenVZ (no problem with mixes of 32-bit and
64-bit virtual machines with a 64-bit host), and on desktops I use
Virtual Box (you can mix 32-bit and 64-bit hosts and guests in any
combination).

> The other machine uses Ultra 320 SCSI drives, and I have a small stack
> of those lying around, as well as a couple of Ultra 160s, which can
> also be hooked up to the same RAID card.
>
>> There is no need to buy a new machine, but as soon as your particular
>> choice of hardware raid cards start going out of fashion, buy
>> a spare. Better still, buy a spare /now/ before the manufacturer
>> decides to update the firmware in new versions of the card and they
>> become incompatible with your raid drives. Of course, you can always
>> restore from backup in an emergency if the worst happens.
>
> Well, considering that this is an entirely private project and that
> there is no real risk involved in downtime - not that I don't care
> about downtime - I think I've got it all sufficiently covered.
>
>>> I'm not sure on the one on my SAS RAID adapter, but I think it's an
>>> Intel RISC processor. It's not a MIPS or an Alpha, that much I am
>>> certain of.
>> Intel haven't made RISC processors for many years (discounting the
>> Itanium, which is an unlikely choice for a raid processor).
>
> The Itanium is not a RISC processor, it's a CISC. It's just not an
> x86. ;-)
>

The Itanium is not CISC (though it certainly is complex!) - it is
technically a VLIW processor (very long instruction word). But VLIW is
a subtype of RISC - one in which more than one RISC instruction is given
in each instruction group for explicit parallelisation. It's an
interesting theory, relies on non-existent super-compilers,
inefficiently implemented and hopeless in practice for most types of
software.

>> They used to have StrongArms, and long, long ago they had a few other
>> designs, but I'm pretty certain you don't have an Intel RISC processor
>> on the card. It also will not be an Alpha - they have not been made
>> for years either (they were very nice chips until DEC, then HP+Compaq
>> totally screwed them up, with plenty of encouragement from Intel).
>> Realistic cores include MIPS in many flavours, PPC, and for more
>> recent designs, perhaps an ARM of some kind. If the heavy lifting is
>> being done by ASIC logic rather than the processor core, there is a
>> wider choice of possible cores.
>
> Apparently it's an Intel 80333 processor, clocked at 800 MHz. Hmm, I
> don't know whether that's a RISC processor; I've never heard of it
> before, actually.
>

After a bit of web searching, I see that the 80333 is a dedicated RAID
system-on-a-chip, not a general purpose processor. The core is indeed
RISC - it is an XScale processor, which is approximately the same thing
as the StrongARM and is basically a modification of an older ARM core.

> This is my RAID adapter card...
>
> http://www.adaptec.com/en-US/products/Controllers/Hardware/sas/value/SAS-31205/
>
>>>>>>> This is quite a common distinction, mind you. There is even a
>>>>>>> "live spare" solution, but to my knowledge this is specific to
>>>>>>> Adaptec - they call it RAID 5E.
>>>>>>>
>>>>>>> In a "live spare" scenario, the spare disk is not used as such
>>>>>>> but is part of the live array, and both data and parity blocks
>>>>>>> are being written to it, but with the distinction that each disk
>>>>>>> in the array will also have empty blocks for the total capacity
>>>>>>> of a standard spare disk. These empty blocks are thus
>>>>>>> distributed across all disks in the array and are used for array
>>>>>>> reconstruction in the event of a disk failure.
>>>>>> Is there any real advantage of such a setup compared to using raid
>>>>>> 6 (in which case, the "empty" blocks are second parity blocks)?
>>>>>> There would be a slightly greater write overhead (especially for
>>>>>> small writes), but that would not be seen by the host if there is
>>>>>> enough cache on the controller.
>>>>> Well, the advantage of this set-up is that you don't need to
>>>>> replace a failing disk, since there is already sufficient diskspace
>>>>> left blank on all disks in the array, and so the array can recreate
>>>>> itself using that extra blank diskspace. This is of course all
>>>>> nice in theory, but in practice one would eventually replace the
>>>>> disk anyway.
>>>> The same is true of raid6 - if one disk dies, the degraded raid6 is
>>>> very similar to raid5 until you replace the disk.
>>>>
>>>> And I still don't see any significant advantage of spreading the
>>>> wholes around the drives rather than having them all on the one
>>>> drive (i.e., a normal hot spare). The rebuild still has to do as
>>>> many reads and writes, and takes as long. The rebuild writes will
>>>> be spread over all the disks rather than just on the one disk, but I
>>>> can't see any advantage in that.
>>> Well, the idea is simply to give the spare disk some exercise, i.e.
>>> to use it as part of the live array while still offering the extra
>>> redundancy of a spare. So in the event of a failure, the array can
>>> be fully rebuilt without the need to replace the broken drive, as
>>> opposed to that the array would stay in degraded mode until the
>>> broken drive is replaced.
>> The array will be in degraded mode while the rebuild is being done,
>> just like if it were raid5 with a hot spare - and it will be equally
>> slow during the rebuild. So no points there.
>
> Well, it's not really something that - at least, in my impression - is
> advised as "a particular RAID solution", but rather as "a nice
> extension to RAID 5".
>

I have to conclude it is more like "an inefficient and proprietary
extension to raid 5 that looked good on the marketing brochure - who
cares about reality?" :-)

>> In fact, according to wikipedia, the controller will "compact" the
>> degraded raid set into a normal raid5, and when you replace the broken
>> drive it will "uncompact" it into the raid 5E arrangement again. The
>> "compact" and "uncompact" operations take much longer than a standard
>> raid5 rebuild.
>>
>> So all you get here is a marginal increase in the parallelisation of
>> multiple simultaneous small reads, which you could get anyway with
>> raid6 rather than raid5 with a spare.
>
> Well, yes, but the idea of RAID 5E is merely that you can have a RAID 5
> with the extra disk being part of the array so as to spread the wear.
> I know it's not of much use, but we began speaking of this with regards
> to the terms "standby spare", "hot spare" and "live spare". ;-)
>
>>>> If you want more redundancy, you can use double mirrors for 33% disk
>>>> space and still have full speed.
>>> Yes, but that's a set-up which, due to understandable financial
>>> considerations, would be reserved only for the corporate world. Many
>>> people already consider me certifiably insane for having spent that
>>> much money - 17'000 Euro, as I wrote higher up - on a privately owned
>>> computer system. But then again, for the intended purposes, I need
>>> fast and reliable hardware and a lot of horsepower. :-)
>> I'm curious - what is the intended purpose? I think I would have a
>> hard job spending more than about three or four thousand Euros on a
>> single system.
>
> Well, okay, here goes... It's intended to be a kind of "mainframe" -
> which is what I call it on occasion when referring to that machine
> among the other machines I own.
>
> I have had this machine over at my place for two years already, but I
> still needed a few extra hardware components - I want things pristine
> before I begin my set-up so as to exclude nasty surprises with changes
> to the hardware afterwards - and the person who was supposed to deliver
> this hardware to me pulled a no-show on me. At first he kept on
> stonewalling me - and, oh irony, I've been there before with another
> hardware vendor - and eventually he wouldn't even return my phone calls
> (to his voicemail) or my e-mails.
>
> So eventually I directly contacted the people who had actually built the
> machine, and for whom the other person was the mediator. These people
> also needed a lot of time to get all the extra components, but
> eventually they did, and the machine was delivered at my home again two
> days ago now, so I can begin the installation over the weekend.
>
> As for the hardware, it's a Tyan Thunder n6650W (S2915) motherboard -
> the original one, not the revised one - which is a twin-socket ccNUMA
> board for AMD Opterons. There are two 2218HE Opterons installed -
> dualcore, 68 Watt, 2.6 GHz. The motherboard has eight DIMM sockets (as
> two nodes of four DIMM sockets each), all of which are populated with
> ATP 4 GB ECC registered DDR-2 pc5300 modules, making for a total of 32
> GB of RAM, or if you will, two 16 GB ccNUMA nodes.
>
> I've already shown you what RAID adapter card is installed, and this
> adapter connects to eight hard disks, four of which are 147 GB 15k
> Hitachi disks mounted in a "hidden" drive cage and to be used for the
> main system, and the four others being 1 TB 7.2k Western Digital RAID
> Edition SATA-2 disks, mounted in an IcyDock hotswap backplane drive
> cage. There is a Plextor PX810-SA SATA double layer DVD writer and no
> floppy drive. The motherboad also has a non-RAID on-board SAS
> controller (which I've disabled in the BIOS) and a Firewire controller.
>
> The original PSU was a CoolerMaster EPS12V 800 Watt, but considering the
> extra drives and certain negative reviews of that CoolerMaster PSU
> under heavy load, I have had it replaced now with a Zippy 1200 Watt
> EPS12V PSU. The chassis is a CoolerMaster CM832 Stacker, which is not
> the more commonly known Stacker but a model that now still only exists
> as the black-and-green "nVidia Edition" model. Mine is completely
> black, however.
>
> There are two videocards installed. One is an older GeCube PCI Radeon
> 9250 card (with 256 MB), connected to the second channel on one of my
> two SGI 21" CRT monitors. The other one is an Asus PCIe GeForce 8800
> GTS (with 640 MB), connected to the first channel on both SGI monitors.
>
> There are also two keyboards and one mouse. One keyboard is connected
> via PS/2, the other one (and the mouse) via USB. So far the
> hardware. ;-)
>
> Now, as for my intended purposes, I am going to set up this machine with
> Xen, as I have mentioned earlier. There will be three primary XenLinux
> virtual machines running on this system, all of which will be Gentoo
> installations.
>
> The three main virtual machines will be set up as follows:
>
> (1) The Xen dom0 virtual machine. For those not familiar with Xen, it
> is a hypervisor that itself normally runs on the bare metal
> (although it can be nested if the hardware has virtualization
> extensions) but unlike the more familiar virtual machine monitors
> like VMWare Workstation/Player or VirtualBox which are commonly used
> on desktops and laptops, Xen does not have a "host" system. Instead
> Xen has a "privileged guest", and this is called "dom0", or "domain
> 0". This virtual machine is privileged because it is from there
> that one starts and stops the other Xen guests. It is also the
> system that has direct access to the hardware - i.e. "the driver
> domain".
>
> On my machine, this is the virtual machine that will be using the
> PCI Radeon card for video output and the PS/2 keyboard for input.
> It will however not have full access to all the hardware, because
> - and Xen allows this - the PCIe GeForce card, the soundchip on the
> motherboard and all USB hubs will be hidden from Xen and from dom0.
>
> (2) A workstation virtual machine. This is an unprivileged guest -
> which in a Xen context is called "domU" - but it will also be a
> driver domain, i.e. it will have direct access to the GeForce, the
> soundchip and the USB hubs. It'll boot up to runlevel 3, but it'll
> have KDE 4.x installed, along with loads of applications. As it has
> direct access to the USB hubs, it'll also be running a CUPS server
> for my USB-connected multifunctional device, a Brother MFC-9880.
> It'll also be running an NFS server for multimedia files.
>
> (3) A server virtual machine which I intend to set up - if possible -
> with an OpenVZ kernel. Again for those who are not familiar with
> it, OpenVZ is a modified Linux kernel which offers operating system
> level virtualization. This means that you have one common kernel
> running multiple, virtualized userspaces, each with their own
> filesystems and user accounts, and their own "init" set up. I am
> not sure yet whether I will be hiding the second Gbit Ethernet
> adapter on the motherboard from dom0 and have this server domU
> access it directly, or whether I will simply have this domU connect
> to the dom0's Ethernet bridge.
>
> The OpenVZ system will be running several isolated userspaces -
> which are called "zones", just as in (Open)Solaris - one of which
> I intend to set up as the sole system from which /ssh/ login from
> the internet is allowed, and doing nothing else. The idea is that
> access to any other machine in the network - physical or virtual -
> must pass through this one virtual machine, making it harder for
> an eventual black hat to do any damage. Then, there will also be
> a generic "zone" for a DNS server and one, possibly two websites,
> and one, possibly two mailservers. Lastly, another "zone" will be
> running an IRC server and an IRC services package, possibly also
> with a few eggdrops.
>
> Systems (1) and (2) will be installed on the SAS disks, which are
> currently set up as a RAID 5, but which I am now going to set up as a
> RAID 10. System (3) itself will be installed on the same array as well
> whereas the privileged userspace and the "ssh honeypot" are concerned.
> The other "zones" will be installed on the SATA-2 array - currently
> also set up as RAID 5 but also to be converted to RAID 10 - together
> with the NFS share exported by system (2) and an additional volume for
> backups. These backups will then be backed up themselves to the other
> physical server - i.e. the 32-bit dual Xeon machine - as well as to
> DVDs and REV disks.
>

This sounds like a fun system! However, I would have split this into
two distinct machines - a server and a workstation. You are mixing two
very different types of use on a single machine, giving you something
that is bound to be a compromise (or a lot more expensive than necessary).

A good workstation will have a processor aimed at high peak performance
for single threads, with reasonable performance for up to 4 threads
(more if you do a lot of compiling or other highly parallel tasks).
Memory should be optimised for latency (this is more important than fast
bandwidth), as should the disks (main disk(s) should be SSD, possibly
with harddisks for bulk storage). You want good graphics and sound.
For software, you want your host OS to be the main working OS - put
guest OS's under Virtual Box if you want.

For a server, your processor is aimed at high throughput on multiple
threads, and memory should be large - even if that means slow. Disks
should be large and reliable (raid). Graphics can be integrated on the
motherboard - you need a console keyboard and screen for the initial
installation, and they are disconnected afterwards (except possibly for
disaster recovery). These days you want your host OS to be minimal, and
have the real work done in virtual machines. Go for OpenVZ as much as
possible - OpenVZ machines are very light, and very fast to set up (on
the server at work, I can set up a new OpenVZ virtual machine in a
couple of minutes). Use Xen or KVM if you need more complete
virtualisation.

Of course, when you already have the hardware, you use what you have.

> As for the IRC part, I'll try to cut a very, very long story short... A
> number of years ago - in July 2002, to be precise - I was part of a
> group of people who started a new and small IRC network. Actually, it
> all started when we decided to take over an existing but dying IRC
> network in order to save it, but that's a whole other story.
>
> Over the years, people came and went in our team - and as our team was
> quite large, there were also a number of intrigues and hidden agendas
> going on, resulting in some people getting fired on the spot - and we
> also experienced a number of difficulties with hosting solutions -
> primarily, having to pay too much money for too poor a service - and so
> a little over three years ago, the remaining team members decided
> jointly that it would be more cost-effective if we started self-hosting
> our domain. We obtained a few second-hand servers and regular
> consumergrade PCs via eBay and some SCSI disks, and we set the whole
> thing up on an entry-level professional ADSL connection, all housed at
> the home of one guy of our team, who was and still is living at his
> parents' house. We also made up a contract that each of us would pay a
> monthly contribution for the rent of the ADSL connection and
> electricity, with a small margin for unexpected expenses.
>
> So far so good, but already right from the beginning, one of us squirmed
> his way out of having to pay monthly contributions, and then some ego
> clashes occurred within the team - both the guy at whose home the
> servers are set-up and another team member who was his best buddy are
> what you could consider "socially dysfunctional" - resulting in the
> loss of virtually all our users. To cut /that/ story short as well,
> the guy who was running the servers at his parents' home set up a
> shadow network behind my back (and on a machine of his own to which I
> had no /ssh/ access) and moved over all our users to that other domain.
> I only found out about it because one of our users was confused over
> the two different domains and came to ask me why we had two IRC
> networks which were not linked to one another.
>
> The guy who set up that shadow network did however stay true to the
> contract and kept the servers up and running, contributed financially
> to the costs for the domain, and even still offered some technical
> support for when things went bad - it's old hardware, and every once in
> a while something breaks down and needs to be replaced. He also
> meticulously kept the accounting up to date in terms of contributions
> and expenses.
>
> Then, as our contract was drawn up for an effective term of three
> years - since that was the minimum rental term for the "businessgrade"
> ADSL connection - and as this contract was about to end (on November
> 1st 2009), the guy sent an e-mail to our mailing list - sufficiently in
> advance - that he had decided to step out of the IRC team at the expiry
> date of the contract, but that he would help those who were still
> interested in moving the domain over, and that he would still keep the
> servers running until that day. So far he's still keeping the IRC
> server up until I've set everything up myself, but the mail- and
> webservers are down.
>
> So at present, the IRC network that we had jointly started in 2002 is
> now in suspended animation, with only one or two users (apart from the
> other guy and myself) still regularly connecting, and a bunch of people
> who seek to leech MP3s and pr0n - both of which are not to be found on
> our network because for legal reasons we have decided to ban public
> filesharing. The fines for copyright infringement or illegal "warez"
> distribution over here are quite high, and I'm not prepared to go to
> jail over something that stupid.
>
> I'm not sure how I am going to revive the IRC network again - and it
> will be a network again (as opposed to a single server) because one of
> our old users and a girl who was on my team have both offered to set up
> a server and link it to my new server - but I feel that it would be a
> shame to give up on something that I have co-founded now eight years
> ago and of which I have all that time been the chairman. (I was
> elected chairman from the start and when someone challenged my position
> and demanded re-elections one year later - as he wanted a shot at the
> position - I was, with the exception of by that one person, unanimously
> re-elected as chairman.)
>
> So there will eventually be three servers on the new network (plus the
> IRC services, which are considered a separate server by the IRCd
> software). My now ex-colleage at whose place the main server is at
> present still running did however overdo it a bit in terms of the
> required set-up, hardwarewise. As I wrote higher up, it was an
> entry-level businessgrade ADSL connection with eight public IP
> addresses. Way too much, but the guy's an IT maniac and even more so
> than I am. He's also a lot younger and still lacks some wisdom in
> terms of spending.
>
> So I am simply going to convert my residential cable internet connection
> to what they call an "Office Line" over here, i.e. a single static IP
> address via cable, requiring no extra hardware (as the cable modem can
> handle the higher speeds) and a larger threshold for the traffic
> volume, with (non-guaranteed) down/up speeds of 20 Mb/sec and 10 Mb/sec
> respectively. I have a simple Linksys WRT45GL router now with the
> standard firmware - which is Linux, by the way ;-) - and it'll do well
> enough to do port forwarding to the respective virtual machines.
> Additional firewalling can be done via /iptables/ on the respective
> virtual machines.
>

I like to install OpenWRT on WRT54GL devices - it makes them far more
flexible than the original firmware. Of course, if the original
firmware does all you need, then that's fine.

> So there you have it. Not quite as short a description as I had
> announced higher up, but then again, you wanted to know. :-)
>

Not /quite/ as short as it sounded at the start... but yes, an
interesting history. And it explains where you got this collection of
hardware.

>>> In the event of the OP on the other hand, 45 SAS disks of 300 GB each
>>> and three SAS RAID storage enclosures also doesn't seem like quite an
>>> affordable buy, so I take it he intends to use it for a business.
>> It also does not strike me as a high value-for-money system - I can't
>> help feeling that this is way more bandwidth than you could actually
>> make use of in the rest of the system, so it would be better to have
>> fewer larger drives and less layers to reduce the latencies. Spent
>> the cash saved on even more ram :-)
>
> Well, what I personally find overkill in this is that he intends to use
> the entire array only for the "/home" filesystem. That seems like an
> awful waste of some great resources that I personally would put to use
> more efficiently - e.g. you could have the entire "/var" tree on it,
> and an additional "/srv" tree.
>
> Of course, a lot depends on the software. As I have come to experience
> myself, lots of hosting software parks all the domains under "/home"
> instead of under "/var" or "/srv". In fact, one could say that on a
> general note, the implementation of "/srv" in just about every
> GNU/Linux distribution is abominable. Some distros create a "/srv" dir
> at install time but that's about as far as it goes. All the packages
> are still configured to use "/var" for websites and FTP repositories -
> which I suppose you could circumvent through symlinks - but like I
> said, most hosting software typically parks everything under "/home".
>

The /srv directory has no specified standard function, so it is
perfectly reasonable for it to be empty. As for what goes in /var, and
where you put it, that also varies a lot. Mail server data, databases,
and web server data are often there - but not necessarily. Logs are
almost invariably under /var. But if the OP's server is mainly a file
server, it is not unreasonable for all the relevant files to be in
/home. He could even mount /var/log on a tmpfs filesystem for speed
(obviously it will be lost on reboot).

Personally, on file servers I generally have a /data directory for
shared data, and I have my OpenVZ directories under /vz.

>> 45 disks at a throughput of say 75 MBps each gives about 3.3 GBps -
>> say 3 GBps since some are hot spares. Ultimately, being a server,
>> this is going to be pumped out on Ethernet links. That's a lot of
>> bandwidth - it would effectively saturate four 10 Gbit links.
>
> Well, since he talks of a high performance computing set-up, I would
> imagine that he has plenty of 10 Gbit links at his disposal, or
> possibly something a lot faster still. ;-)
>
>> I have absolutely no real-world experience with these sorts of
>> systems, and could therefore be totally wrong, but my gut feeling is
>> that the theoretical numbers will not scale with so many drives -
>> something like 15 1 TB SATA drives would be similar in speed in
>> practice.
>
> No real world experience with that sort of thing here either, but like I
> said, in my opinion using 45 disks - or perhaps 42 if he keeps three
> hot spares - for a single "/home" filesystem does seem like overkill to
> me, and yes, there is the bandwidth issue too.
>
>>> I have been looking into NexentaOS (i.e. GNU/kOpenSolaris) for a
>>> while, which uses ZFS, albeit that ZFS was not my reason for being
>>> interested in the project. I was more interested in the fact that it
>>> supports both Solaris Zones - of which the Linux equivalents are
>>> OpenVZ and VServer - and running paravirtualized on top of Xen.
>>>
>>> [...]
>>> The big problem with NexentaOS however is that it's based on Ubuntu
>>> and that it uses binary .deb packages, whereas I would rather have a
>>> Gentoo approach, where you can build the whole thing from sources
>>> without having to go "the LFS way".
>> Why is it always so hard to get /everything/ you want when building a
>> system :-(
>
> True... Putting a binary "one size fits all"-optimized distribution on
> an unimportant PC or laptop is okay by me, but for a system so
> specialized and geared for performance as the one I have, I want
> everything to be optimized for the underlying hardware, and I also
> don't need or want all those typical "Windoze-style desktop
> optimizations" most distribution vendors now build into their systems.
>
> Gentoo is far from ideal - given some issues over at the Gentoo
> Foundation itself and the fact that the developers seem mostly occupied
> with discussing how cool they think they are, rather than to actually
> do something sensible, and they've also started to implement a few
> defaults of which they themselves say that these are not the best
> choices but that they are the choices of which they think most users
> will opt for them - but at least the basic premise is still there, i.e.
> you do build it from sources, and as such you have more control over
> how the resulting system will be set-up, both in terms of hardware
> optimizations and in terms of software interoperability.
>

Have you looked at Sabayon Linux? It's originally based on Gentoo, but
you might find the developer community more to your liking.

From: Rahul on 21 Jan 2010 22:55

Aragorn <aragorn(a)chatfactory.invalid> wrote in news:hj7amh$i8q$1
@news.eternal-september.org:

>
> In the event of the OP on the other hand, 45 SAS disks of 300 GB each
> and three SAS RAID storage enclosures also doesn't seem like quite an
> affordable buy, so I take it he intends to use it for a business.
>
> That, or he's a maniac like me. :p
>

Probably the later! ;) This is a academic setting with a scientific
computing cluster.

--
Rahul

From: Rahul on 21 Jan 2010 23:19

David Brown <david.brown(a)hesbynett.removethisbit.no> wrote in
news:8KednbJOTsi3FsrWnZ2dnUVZ8i2dnZ2d(a)lyse.net:

> It also does not strike me as a high value-for-money system - I can't
> help feeling that this is way more bandwidth than you could actually
> make use of in the rest of the system, so it would be better to have
> fewer larger drives and less layers to reduce the latencies.

Interesting! I actually spent money getting more but smaller drives. It
would have been easier getting larger drives but I thought more spindles
the better especially for random I/O.

Which latency could have been reduced had I used larger drives? There's
hardly any layers I can see that are extranous. I have a storage box,
RAID controller and then LVM / mdadm on top.

So long as I used more than one box having the LVM / mdadm layer seemed
pretty necessary anyways. And just 15 drives seemed too few for IOPS.
Besides the larger drives have bad seek times.

>Spent the
> cash saved on even more ram :-)

I already thought I maxed out on my RAM. I've 48 Gigs. Do you think
that's enough? I guess I can always add more later if necessary.

> 45 disks at a throughput of say 75 MBps each gives about 3.3 GBps - say
> 3 GBps since some are hot spares. Ultimately, being a server, this is
> going to be pumped out on Ethernet links. That's a lot of bandwidth -
> it would effectively saturate four 10 Gbit links.

Well, I only have two 10 Gbit links. But my calculations had shown that
I'd be maxing out the two RAID cards I have before that happens. But I
could be wrong.

On the other hand bandwidth was just half the story as I saw it. I did
have a fair share of apps doing random I/O and seeks. Here I wanted to
maximise my IOPS. Splitting over more independant spindles should
hopefully boost my performance in that respect.

>
> I have absolutely no real-world experience with these sorts of systems,
> and could therefore be totally wrong, but my gut feeling is that the
> theoretical numbers will not scale with so many drives - something like
> 15 1 TB SATA drives would be similar in speed in practice.

I almost did that option. Point was that I was scared with the IOPS
expectations and there was no real way to test on full load. SO I speced
it out genorously.

By way of application: This storage is supposed to be the NFS server that
will serve out NFS mounts to ~275 servers each 8 core. Being a HPC
environment there's pretty much full load 24x7.

--
Rahul

From: Rahul on 21 Jan 2010 23:37

Aragorn <aragorn(a)chatfactory.invalid> wrote in
news:hj9ftp$7p4$1(a)news.eternal-september.org:

> Quite honestly, I'm enjoying this thread, because I get to hear
> interesting feedback - and I think you do to, from your point of view
> - and I have a feeling that Rahul, the OP, is sitting there enjoying
> himself over all the valid arguments being discussed here in the
> debate over various RAID types. ;-)

Absolutely. I am still lurking around. After all I have the blame as
being the OP who started us off on this interesting RAID discussion.

> I have seen this method being discussed before, but to be honest I've
> never even looked into "rsnapshot". I do intend to explore it for the
> future, since the ability to make incremental backups seems very
> interesting.

Jumping on the rsync topic: I've been using rsync+rsnapshot backups for
the last 6 months to keep around 1 Terabyte of user data safe. It works
like a charm. Originally there was some convincing required since the
bosses thought "it wasn't backup unless it was tape"

But we have a compromise now in that we do a full tape backup about once
every 6 months. But incremental daily, weekly and monthly backpus are all
disk-to-disk.

> So far I have always made either data backups only - and on occasion,
> backups of important directories such as "/etc" -

I'd also add something like bazaar or mercurial to the backup equation.
I've found it invaluable to version my config files and protect myself
against sys-admin blunders. Besides it gives me a new-found confidence
making system changes in the security of doing an instant rollback when
there is trouble to pretty much any point in time.

>> It also needs to be saved in a safe and reliable place - many people
>> have had regular backups saved to tape only to find later that the
>> tapes were unreadable.

I'm moving away from tapes to disk-to-disk backups. In these days of
cheap disk it's starting to make so much more sense.

>
> Well, what I personally find overkill in this is that he intends to
> use the entire array only for the "/home" filesystem. That seems like
> an awful waste of some great resources that I personally would put to
> use more efficiently - e.g. you could have the entire "/var" tree on
> it, and an additional "/srv" tree.

Does it seem a waste when you think that the /home is being served out
via NFS to ~275 servers? Those are the "compute nodes" doing most of the
I/O so that's where I need high-performance storage.

Well, I already have a fast disk for /var etc. But those are merely local
to the server connected to the storage. This server is supposed to do
just one thing and do it well: Serve out the central storage via NFS.

> Well, since he talks of a high performance computing set-up, I would
> imagine that he has plenty of 10 Gbit links at his disposal, or
> possibly something a lot faster still. ;-)

I have twin 10Gig Links right now with options for two more at a later
date.

--
Rahul

From: David Brown on 22 Jan 2010 04:06

Rahul wrote:
> David Brown <david.brown(a)hesbynett.removethisbit.no> wrote in
> news:8KednbJOTsi3FsrWnZ2dnUVZ8i2dnZ2d(a)lyse.net:
>
>> It also does not strike me as a high value-for-money system - I can't
>> help feeling that this is way more bandwidth than you could actually
>> make use of in the rest of the system, so it would be better to have
>> fewer larger drives and less layers to reduce the latencies.
>
> Interesting! I actually spent money getting more but smaller drives. It
> would have been easier getting larger drives but I thought more spindles
> the better especially for random I/O.
>

It is very difficult to judge these things - it is so dependent on the
load. Don't rate my gut feeling above /your/ gut feeling! The trouble
is, the only way to be sure is to try out both combinations and see,
which is a little impractical. If you are able, you could do testing
with only have the disks attached to see if it makes a measurable
difference.

More spindles /may/ reduce the seek time for random IO - it will be
especially effective if there is a lot of random reads in parallel.

> Which latency could have been reduced had I used larger drives? There's
> hardly any layers I can see that are extranous. I have a storage box,
> RAID controller and then LVM / mdadm on top.
>

I have been imagining two layers of controllers here - your disks are
connected to one controller on a storage box, and that controller is
then connected to the controller in the host. With fewer disks, you can
cut out the middle man and connect the disks directly to raid
controllers on the host. Theoretically, that will reduce latency -
though I don't know if there will be a difference in real life.

> So long as I used more than one box having the LVM / mdadm layer seemed
> pretty necessary anyways. And just 15 drives seemed too few for IOPS.
> Besides the larger drives have bad seek times.
>

Have you considered using a clustered file system such as Lustre or GFS?
You then have a central server for metadata, which is easy to get fast
(everything will be in the server's ram cache), and the actual data is
spread around and duplicated on different servers.

>> Spent the
>> cash saved on even more ram :-)
>
> I already thought I maxed out on my RAM. I've 48 Gigs. Do you think
> that's enough? I guess I can always add more later if necessary.
>

640 KB ram is enough for anybody :-)

48 GB is actually quite a lot. Whether it is enough or not is hard to
say. Run the system you have got - if people complain that it is slow,
do some monitoring to find the bottlenecks. If they don't complain,
then 48 GB is enough!

>> 45 disks at a throughput of say 75 MBps each gives about 3.3 GBps - say
>> 3 GBps since some are hot spares. Ultimately, being a server, this is
>> going to be pumped out on Ethernet links. That's a lot of bandwidth -
>> it would effectively saturate four 10 Gbit links.
>
> Well, I only have two 10 Gbit links. But my calculations had shown that
> I'd be maxing out the two RAID cards I have before that happens. But I
> could be wrong.
>

You know the details of the system, and you've probably done the
calculations on paper rather than just in your head, so your numbers are
a better guess. How does the bandwidth of the raid cards compare to the
theoretical bandwidth of the disks?

> On the other hand bandwidth was just half the story as I saw it. I did
> have a fair share of apps doing random I/O and seeks. Here I wanted to
> maximise my IOPS. Splitting over more independant spindles should
> hopefully boost my performance in that respect.
>
>> I have absolutely no real-world experience with these sorts of systems,
>> and could therefore be totally wrong, but my gut feeling is that the
>> theoretical numbers will not scale with so many drives - something like
>> 15 1 TB SATA drives would be similar in speed in practice.
>
> I almost did that option. Point was that I was scared with the IOPS
> expectations and there was no real way to test on full load. SO I speced
> it out genorously.
>
> By way of application: This storage is supposed to be the NFS server that
> will serve out NFS mounts to ~275 servers each 8 core. Being a HPC
> environment there's pretty much full load 24x7.
>
>
>

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9
Prev: Variable I/O Performance with dd vs. cat
Next: ? Recommend PC for learning RHEL ?