Prev: Whitespace Coding style fixes.
Next: Frontswap [PATCH 1/4] (was Transcendent Memory): swap data structure changes
From: Dan Magenheimer on 3 May 2010 11:10 > > My analogy only requires some > > statistical bad luck: Multiple guests with peaks and valleys > > of memory requirements happen to have their peaks align. > > Not sure I understand. Virtualization is all about statistical multiplexing of fixed resources. If all guests demand a resource simultaneously, that is peak alignment == "bad luck". (But, honestly, I don't even remember the point either of us was trying to make here :-) > > Or maybe not... when a guest is in the middle of a live migration, > > I believe (in Xen), the entire guest memory allocation (possibly > > excluding ballooned-out pages) must be simultaneously in RAM briefly > > in BOTH the host and target machine. That is, live migration is > > not "pipelined". Is this also true of KVM? > > No. The entire guest address space can be swapped out on the source > and > target, less the pages being copied to or from the wire, and pages > actively accessed by the guest. Of course performance will suck if all > memory is swapped out. Will it suck to the point of eventually causing the live migration to fail? Or will swap-storms effectively cause denial-of-service for other guests? Anyway, if live migration works fine with mostly-swapped-out guests on KVM, that's great. > > Choosing the _optimal_ overcommit ratio is impossible without a > > prescient knowledge of the workload in each guest. Hoping memory > > will be available is certainly not a good solution, but if memory > > is not available guest swapping is much better than host swapping. > > You cannot rely on guest swapping. Frontswap only relies on the guest having an existing swap device, defined in /etc/fstab like any normal Linux swap device. If this is "relying on guest swapping", yes frontswap relies on guest swapping. Or if you are referring to your "host can't force guest to reclaim pages" argument, see the other thread. > > And making RAM usage as dynamic as possible and live migration > > as easy as possible are keys to maximizing the benefits (and > > limiting the problems) of virtualization. > > That is why you need overcommit. You make things dynamic with page > sharing and ballooning and live migration, but at some point you need a > failsafe fallback. The only failsafe fallback I can see (where the > host doesn't rely on guests) is swapping. No fallback is required if the overcommitment is done intelligently. > As far as I can tell, frontswap+tmem increases the problem. You loan > the guest some memory without the means to take it back, this increases > memory pressure on the host. The result is that if you want to avoid > swapping (or are unable to) you need to undercommit host resources. > Instead of sum(guest mem) + reserve < (host mem), you need sum(guest > mem > + committed tmem) + reserve < (host mem). You need more host memory, > or less guests, or to be prepared to swap if the worst happens. Your argument might make sense from a KVM perspective but is not true of frontswap with Xen+tmem. With KVM, the host's swap disk(s) can all be used as "slow RAM". With Xen, there is no host swap disk. So, yes, the degree of potential memory overcommitment is smaller with Xen+tmem than with KVM. In order to avoid all the host problems with host-swapping, frontswap+Xen+tmem intentionally limits the degree of memory overcommitment... but this is just memory overcommitment done intelligently. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Dan Magenheimer on 3 May 2010 12:10 > > Simple policies must exist and must be enforced by the hypervisor to > ensure > > this doesn't happen. Xen+tmem provides these policies and enforces > them. > > And it enforces them very _dynamically_ to constantly optimize > > RAM utilization across multiple guests each with dynamically varying > RAM > > usage. Frontswap fits nicely into this framework. > > Can you explain what "enforcing" means in this context? You loaned the > guest some pages, can you enforce their return? We're getting into hypervisor policy issues, but given that probably nobody else is listening by now, I guess that's OK. ;-) The enforcement is on the "put" side. The page is not loaned, it is freely given, but only if the guest is within its contractual limitations (e.g. within its predefined "maxmem"). If the guest chooses to never remove the pages from frontswap, that's the guest's option, but that part of the guests memory allocation can never be used for anything else so it is in the guest's self-interest to "get" or "flush" the pages from frontswap. > > Huge performance hits that are completely inexplicable to a user > > give virtualization a bad reputation. If the user (i.e. guest, > > not host, administrator) can at least see "Hmmm... I'm doing a lot > > of swapping, guess I'd better pay for more (virtual) RAM", then > > the user objections are greatly reduced. > > What you're saying is "don't overcommit". Not at all. I am saying "overcommit, but do it intelligently". > That's a good policy for some > scenarios but not for others. Note it applies equally well for cpu as > well as memory. Perhaps, but CPU overcommit has been a well-understood part of computing for a very long time and users, admins, and hosting providers all know how to recognize it and deal with it. Not so with overcommitment of memory; the only exposure to memory limitations is "my disk light is flashing a lot, I'd better buy more RAM". Obviously, this doesn't translate to virtualization very well. And, as for your interrupt latency analogy, let's revisit that if/when Xen or KVM support CPU overcommitment for real-time-sensitive guests. Until then, your analogy is misleading. > frontswap+tmem is not overcommit, it's undercommit. You have spare > memory, and you give it away. It isn't a replacement. However, > without > the means to reclaim this spare memory, it can result in overcommit. But you are missing part of the magic: Once the memory page is no longer directly addressable (AND this implies not directly writable) by the guest, the hypervisor can do interesting things with it, such as compression and deduplication. As a result, the sum of pages used by all the guests exceeds the total pages of RAM in the system. Thus overcommitment. I agree that the degree of overcommitment is less than possible with host-swapping, but none of the evil issues of host-swapping happen. Again, this is "intelligent overcommitment". Other existing forms are "overcommit and cross your fingers that bad things don't happen." > > Xen+tmem uses the SAME internal kernel interface. The Xen-specific > > code which performs the Xen-specific stuff (hypercalls) is only in > > the Xen-specific directory. > > This makes it an external interface. > : > Something completely internal to the guest can be replaced by something > completely different. Something that talks to a hypervisor will need > those hooks forever to avoid regressions. Uh, no. As I've said, everything about frontswap is entirely optional, both at compile-time and run-time. A frontswap-enabled guest is fully compatible with a hypervisor with no frontswap; a frontswap-enabled hypervisor is fully compatible with a guest with no frontswap. The only thing that is reserved forever is a hypervisor-specific "hypercall number" which is not exposed in the Linux kernel except in Xen-specific code. And, for Xen, frontswap shares the same hypercall number with cleancache. So, IMHO, you are being alarmist. This is not an "API maintenance" problem for Linux. > Exactly as large as the swap space which the guest would have in the > frontswap+tmem case. > : > Not needed, though I expect it is already supported (SAN volumes do > grow). > : > If block layer overhead is a problem, go ahead and optimize it instead > of adding new interfaces to bypass it. Though I expect it wouldn't be > needed, and if any optimization needs to be done it is in the swap > layer. > Optimizing swap has the additional benefit of improving performance on > flash-backed swap. > : > What happens when no tmem is available? you swap to a volume. That's > the disk size needed. > : > You're dynamic swap is limited too. And no, no guest modifications. You keep saying you are going to implement all of the dynamic features of frontswap with no changes to the guest and no copying and no host-swapping. You are being disingenuous. VMware has had a lot of people working on virtualization a lot longer than you or I have. Don't you think they would have done this by now? Frontswap exists today and is even shipping in real released products. If you can work your magic (in Xen... I am not trying to claim frontswap should work with KVM), please show us the code. > So, you take a synchronous copyful interface, add another copy to make > it into an asynchronous interface, instead of using the original > asynchronous copyless interface. "Add another copy" is not required any more than it is with the other examples you cited. The "original asynchronous copyless interface" works because DMA for devices has been around for >40 years and has been greatly refined. We're not talking about DMA to a device here, we're talking about DMA from one place in RAM to another (i.e. from guest RAM to hypervisor RAM). Do you have examples of DMA engines that do page-size-ish RAM-to-RAM more efficiently than copying? > The networking stack seems to think 4096 bytes is a good size for dma > (see net/core/user_dma.c, NET_DMA_DEFAULT_COPYBREAK). Networking is a device-to-RAM, not RAM-to-RAM. > When swapping out, Linux already batches pages in the block device's > request queue. Swapping out is inherently asynchronous and batched, > you're swapping out those pages _because_ you don't need them, and > you're never interested in swapping out a single page. Linux already > reserves memory for use during swapout. There's no need to re-solve > solved problems. Swapping out is inherently asynchronous and batches because it was designed for swapping to a device, while you are claiming that the same _unchanged_ interface is suitable for swap-to-hypervisor-RAM and at the same time saying that the block layer might need to be "optimized" (apparently without code changes). I'm not trying to re-solve a solved problem; frontswap solves a NEW problem, with very little impact to existing code. > Swapping in is less simple, it is mostly synchronous (in some cases it > isn't: with many threads, or with the preswap patches (IIRC unmerged)). > You can always choose to copy if you don't have enough to justify dma. Do you have a pointer to these preswap patches? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Pavel Machek on 3 May 2010 15:40 > > If block layer overhead is a problem, go ahead and optimize it instead > > of adding new interfaces to bypass it. Though I expect it wouldn't be > > needed, and if any optimization needs to be done it is in the swap > > layer. > > Optimizing swap has the additional benefit of improving performance on > > flash-backed swap. > > : > > What happens when no tmem is available? you swap to a volume. That's > > the disk size needed. > > : > > You're dynamic swap is limited too. And no, no guest modifications. > > You keep saying you are going to implement all of the dynamic features > of frontswap with no changes to the guest and no copying and no > host-swapping. You are being disingenuous. VMware has had a lot I don't see why no copying is a requirement. I believe requirement should be "it is fast enough". Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Martin Schwidefsky on 10 May 2010 12:10
On Fri, 30 Apr 2010 09:08:00 -0700 Dave Hansen <dave(a)linux.vnet.ibm.com> wrote: > On Fri, 2010-04-30 at 08:59 -0700, Dan Magenheimer wrote: > > Dave or others can correct me if I am wrong, but I think CMM2 also > > handles dirty pages that must be retained by the hypervisor. The > > difference between CMM2 (for dirty pages) and frontswap is that > > CMM2 sets hints that can be handled asynchronously while frontswap > > provides explicit hooks that synchronously succeed/fail. > > Once pages were dirtied (or I guess just slightly before), they became > volatile, and I don't think the hypervisor could do anything with them. > It could still swap them out like usual, but none of the CMM-specific > optimizations could be performed. Well, almost correct :-) A dirty page (or one that is about to become dirty) can be in one of two CMMA states: 1) stable This is the case for pages where the kernel is doing some operation on the page that will make it dirty, e.g. I/O. Before the kernel can allow the operation the page has to be made stable. If the state conversion to stable fails because the hypervisor removed the page the page needs to get deleted from page cache and recreated from scratch. 2) potentially-volatile This state is used for page cache pages for which a writable mapping exists. The page can be removed by the hypervisor as long as the physical per-page dirty bit is not set. As soon as the bit is set the page is considered stable although the CMMA state still is potentially- volatile. In both cases the only thing the hypervisor can do with a dirty page is to swap it as usual. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |