vhost_net: a kernel-level virtio server [Kernel]

Prev: bfa: Brocade BFA FC SCSI driver (bfad)
Next: Atheros Linux wireless drivers home page - and two new driver projects

From: Avi Kivity on 16 Sep 2009 17:10

On 09/16/2009 10:22 PM, Gregory Haskins wrote:
> Avi Kivity wrote:
>
>> On 09/16/2009 05:10 PM, Gregory Haskins wrote:
>>
>>>> If kvm can do it, others can.
>>>>
>>>>
>>> The problem is that you seem to either hand-wave over details like this,
>>> or you give details that are pretty much exactly what vbus does already.
>>> My point is that I've already sat down and thought about these issues
>>> and solved them in a freely available GPL'ed software package.
>>>
>>>
>> In the kernel. IMO that's the wrong place for it.
>>
> 3) "in-kernel": You can do something like virtio-net to vhost to
> potentially meet some of the requirements, but not all.
>
> In order to fully meet (3), you would need to do some of that stuff you
> mentioned in the last reply with muxing device-nr/reg-nr. In addition,
> we need to have a facility for mapping eventfds and establishing a
> signaling mechanism (like PIO+qid), etc. KVM does this with
> IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be
> invented.
>

irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted.

> To meet performance, this stuff has to be in kernel and there has to be
> a way to manage it.

and management belongs in userspace.

> Since vbus was designed to do exactly that, this is
> what I would advocate. You could also reinvent these concepts and put
> your own mux and mapping code in place, in addition to all the other
> stuff that vbus does. But I am not clear why anyone would want to.
>

Maybe they like their backward compatibility and Windows support.

> So no, the kernel is not the wrong place for it. Its the _only_ place
> for it. Otherwise, just use (1) and be done with it.
>
>

I'm talking about the config stuff, not the data path.

>> Further, if we adopt
>> vbus, if drop compatibility with existing guests or have to support both
>> vbus and virtio-pci.
>>
> We already need to support both (at least to support Ira). virtio-pci
> doesn't work here. Something else (vbus, or vbus-like) is needed.
>

virtio-ira.

>>> So the question is: is your position that vbus is all wrong and you wish
>>> to create a new bus-like thing to solve the problem?
>>>
>> I don't intend to create anything new, I am satisfied with virtio. If
>> it works for Ira, excellent. If not, too bad.
>>
> I think that about sums it up, then.
>

Yes. I'm all for reusing virtio, but I'm not going switch to vbus or
support both for this esoteric use case.

>>> If so, how is it
>>> different from what Ive already done? More importantly, what specific
>>> objections do you have to what Ive done, as perhaps they can be fixed
>>> instead of starting over?
>>>
>>>
>> The two biggest objections are:
>> - the host side is in the kernel
>>
> As it needs to be.
>

vhost-net somehow manages to work without the config stuff in the kernel.

> With all due respect, based on all of your comments in aggregate I
> really do not think you are truly grasping what I am actually building here.
>

Thanks.

>>> Bingo. So now its a question of do you want to write this layer from
>>> scratch, or re-use my framework.
>>>
>>>
>> You will have to implement a connector or whatever for vbus as well.
>> vbus has more layers so it's probably smaller for vbus.
>>
> Bingo!

(addictive, isn't it)

> That is precisely the point.
>
> All the stuff for how to map eventfds, handle signal mitigation, demux
> device/function pointers, isolation, etc, are built in. All the
> connector has to do is transport the 4-6 verbs and provide a memory
> mapping/copy function, and the rest is reusable. The device models
> would then work in all environments unmodified, and likewise the
> connectors could use all device-models unmodified.
>

Well, virtio has a similar abstraction on the guest side. The host side
abstraction is limited to signalling since all configuration is in
userspace. vhost-net ought to work for lguest and s390 without change.

>> It was already implemented three times for virtio, so apparently that's
>> extensible too.
>>
> And to my point, I'm trying to commoditize as much of that process as
> possible on both the front and backends (at least for cases where
> performance matters) so that you don't need to reinvent the wheel for
> each one.
>

Since you're interested in any-to-any connectors it makes sense to you.
I'm only interested in kvm-host-to-kvm-guest, so reducing the already
minor effort to implement a new virtio binding has little appeal to me.

>> You mean, if the x86 board was able to access the disks and dma into the
>> ppb boards memory? You'd run vhost-blk on x86 and virtio-net on ppc.
>>
> But as we discussed, vhost doesn't work well if you try to run it on the
> x86 side due to its assumptions about pagable "guest" memory, right? So
> is that even an option? And even still, you would still need to solve
> the aggregation problem so that multiple devices can coexist.
>

I don't know. Maybe it can be made to work and maybe it cannot. It
probably can with some determined hacking.

--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Gregory Haskins on 16 Sep 2009 23:20

Avi Kivity wrote:
> On 09/16/2009 10:22 PM, Gregory Haskins wrote:
>> Avi Kivity wrote:
>>
>>> On 09/16/2009 05:10 PM, Gregory Haskins wrote:
>>>
>>>>> If kvm can do it, others can.
>>>>>
>>>>>
>>>> The problem is that you seem to either hand-wave over details like
>>>> this,
>>>> or you give details that are pretty much exactly what vbus does
>>>> already.
>>>> My point is that I've already sat down and thought about these
>>>> issues
>>>> and solved them in a freely available GPL'ed software package.
>>>>
>>>>
>>> In the kernel. IMO that's the wrong place for it.
>>>
>> 3) "in-kernel": You can do something like virtio-net to vhost to
>> potentially meet some of the requirements, but not all.
>>
>> In order to fully meet (3), you would need to do some of that stuff you
>> mentioned in the last reply with muxing device-nr/reg-nr. In addition,
>> we need to have a facility for mapping eventfds and establishing a
>> signaling mechanism (like PIO+qid), etc. KVM does this with
>> IRQFD/IOEVENTFD, but we dont have KVM in this case so it needs to be
>> invented.
>>
>
> irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted.

Not per se, but it needs to be interfaced. How do I register that
eventfd with the fastpath in Ira's rig? How do I signal the eventfd
(x86->ppc, and ppc->x86)?

To take it to the next level, how do I organize that mechanism so that
it works for more than one IO-stream (e.g. address the various queues
within ethernet or a different device like the console)? KVM has
IOEVENTFD and IRQFD managed with MSI and PIO. This new rig does not
have the luxury of an established IO paradigm.

Is vbus the only way to implement a solution? No. But it is _a_ way,
and its one that was specifically designed to solve this very problem
(as well as others).

(As an aside, note that you generally will want an abstraction on top of
irqfd/eventfd like shm-signal or virtqueues to do shared-memory based
event mitigation, but I digress. That is a separate topic).

>
>> To meet performance, this stuff has to be in kernel and there has to be
>> a way to manage it.
>
> and management belongs in userspace.

vbus does not dictate where the management must be. Its an extensible
framework, governed by what you plug into it (ala connectors and devices).

For instance, the vbus-kvm connector in alacrityvm chooses to put DEVADD
and DEVDROP hotswap events into the interrupt stream, because they are
simple and we already needed the interrupt stream anyway for fast-path.

As another example: venet chose to put ->call(MACQUERY) "config-space"
into its call namespace because its simple, and we already need
->calls() for fastpath. It therefore exports an attribute to sysfs that
allows the management app to set it.

I could likewise have designed the connector or device-model differently
as to keep the mac-address and hotswap-events somewhere else (QEMU/PCI
userspace) but this seems silly to me when they are so trivial, so I didn't.

>
>> Since vbus was designed to do exactly that, this is
>> what I would advocate. You could also reinvent these concepts and put
>> your own mux and mapping code in place, in addition to all the other
>> stuff that vbus does. But I am not clear why anyone would want to.
>>
>
> Maybe they like their backward compatibility and Windows support.

This is really not relevant to this thread, since we are talking about
Ira's hardware. But if you must bring this up, then I will reiterate
that you just design the connector to interface with QEMU+PCI and you
have that too if that was important to you.

But on that topic: Since you could consider KVM a "motherboard
manufacturer" of sorts (it just happens to be virtual hardware), I don't
know why KVM seems to consider itself the only motherboard manufacturer
in the world that has to make everything look legacy. If a company like
ASUS wants to add some cutting edge IO controller/bus, they simply do
it. Pretty much every product release may contain a different array of
devices, many of which are not backwards compatible with any prior
silicon. The guy/gal installing Windows on that system may see a "?" in
device-manager until they load a driver that supports the new chip, and
subsequently it works. It is certainly not a requirement to make said
chip somehow work with existing drivers/facilities on bare metal, per
se. Why should virtual systems be different?

So, yeah, the current design of the vbus-kvm connector means I have to
provide a driver. This is understood, and I have no problem with that.

The only thing that I would agree has to be backwards compatible is the
BIOS/boot function. If you can't support running an image like the
Windows installer, you are hosed. If you can't use your ethernet until
you get a chance to install a driver after the install completes, its
just like most other systems in existence. IOW: It's not a big deal.

For cases where the IO system is needed as part of the boot/install, you
provide BIOS and/or an install-disk support for it.

>
>> So no, the kernel is not the wrong place for it. Its the _only_ place
>> for it. Otherwise, just use (1) and be done with it.
>>
>>
>
> I'm talking about the config stuff, not the data path.

As stated above, where config stuff lives is a function of what you
interface to vbus. Data-path stuff must be in the kernel for
performance reasons, and this is what I was referring to. I think we
are generally both in agreement, here.

What I was getting at is that you can't just hand-wave the datapath
stuff. We do fast path in KVM with IRQFD/IOEVENTFD+PIO, and we do
device discovery/addressing with PCI. Neither of those are available
here in Ira's case yet the general concepts are needed. Therefore, we
have to come up with something else.

>
>>> Further, if we adopt
>>> vbus, if drop compatibility with existing guests or have to support both
>>> vbus and virtio-pci.
>>>
>> We already need to support both (at least to support Ira). virtio-pci
>> doesn't work here. Something else (vbus, or vbus-like) is needed.
>>
>
> virtio-ira.

Sure, virtio-ira and he is on his own to make a bus-model under that, or
virtio-vbus + vbus-ira-connector to use the vbus framework. Either
model can work, I agree.

>
>>>> So the question is: is your position that vbus is all wrong and you
>>>> wish
>>>> to create a new bus-like thing to solve the problem?
>>>>
>>> I don't intend to create anything new, I am satisfied with virtio. If
>>> it works for Ira, excellent. If not, too bad.
>>>
>> I think that about sums it up, then.
>>
>
> Yes. I'm all for reusing virtio, but I'm not going switch to vbus or
> support both for this esoteric use case.

With all due respect, no one asked you to. This sub-thread was
originally about using vhost in Ira's rig. When problems surfaced in
that proposed model, I highlighted that I had already addressed that
problem in vbus, and here we are.

>
>>>> If so, how is it
>>>> different from what Ive already done? More importantly, what specific
>>>> objections do you have to what Ive done, as perhaps they can be fixed
>>>> instead of starting over?
>>>>
>>>>
>>> The two biggest objections are:
>>> - the host side is in the kernel
>>>
>> As it needs to be.
>>
>
> vhost-net somehow manages to work without the config stuff in the kernel.

I was referring to data-path stuff, like signal and memory
configuration/routing.

As an aside, it should be noted that vhost under KVM has
IRQFD/IOEVENTFD, PCI-emulation, QEMU, etc to complement it and fill in
some of the pieces one needs for a complete solution. Not all
environments have all of those pieces (nor should they), and those
pieces need to come from somewhere.

It should also be noted that what remains (config/management) after the
data-path stuff is laid out is actually quite simple. It consists of
pretty much an enumerated list of device-ids within a container,
DEVADD(id), DEVDROP(id) events, and some sysfs attributes as defined on
a per-device basis (many of which are often needed regardless of whether
the "config-space" operation is handled in-kernel or not)

Therefore, the configuration aspect of the system does not necessitate a
complicated (e.g. full PCI emulation) or external (e.g. userspace)
component per se. The parts of vbus that could be construed as
"management" are (afaict) built using accepted/best-practices for
managing arbitrary kernel subsystems (sysfs, configfs, ioctls, etc) so
there is nothing new or reasonably controversial there. It is for this
reason that I think the objection to "in-kernel config" is unfounded.

Disagreements on this point may be settled by the connector design,
while still utilizing vbus, and thus retaining most of the other
benefits of using the vbus framework. The connector ultimately dictates
how and what is exposed to the "guest".

>
>> With all due respect, based on all of your comments in aggregate I
>> really do not think you are truly grasping what I am actually building
>> here.
>>
>
> Thanks.
>
>
>
>>>> Bingo. So now its a question of do you want to write this layer from
>>>> scratch, or re-use my framework.
>>>>
>>>>
>>> You will have to implement a connector or whatever for vbus as well.
>>> vbus has more layers so it's probably smaller for vbus.
>>>
>> Bingo!
>
> (addictive, isn't it)

Apparently.

>
>> That is precisely the point.
>>
>> All the stuff for how to map eventfds, handle signal mitigation, demux
>> device/function pointers, isolation, etc, are built in. All the
>> connector has to do is transport the 4-6 verbs and provide a memory
>> mapping/copy function, and the rest is reusable. The device models
>> would then work in all environments unmodified, and likewise the
>> connectors could use all device-models unmodified.
>>
>
> Well, virtio has a similar abstraction on the guest side. The host side
> abstraction is limited to signalling since all configuration is in
> userspace. vhost-net ought to work for lguest and s390 without change.

But IIUC that is primarily because the revectoring work is already in
QEMU for virtio-u and it rides on that, right? Not knocking that, thats
nice and a distinct advantage. It should just be noted that its based
on sunk-cost, and not truly free. Its just already paid for, which is
different. It also means it only works in environments based on QEMU,
which not all are (as evident by this sub-thread).

>
>>> It was already implemented three times for virtio, so apparently that's
>>> extensible too.
>>>
>> And to my point, I'm trying to commoditize as much of that process as
>> possible on both the front and backends (at least for cases where
>> performance matters) so that you don't need to reinvent the wheel for
>> each one.
>>
>
> Since you're interested in any-to-any connectors it makes sense to you.
> I'm only interested in kvm-host-to-kvm-guest, so reducing the already
> minor effort to implement a new virtio binding has little appeal to me.
>

Fair enough.

>>> You mean, if the x86 board was able to access the disks and dma into the
>>> ppb boards memory? You'd run vhost-blk on x86 and virtio-net on ppc.
>>>
>> But as we discussed, vhost doesn't work well if you try to run it on the
>> x86 side due to its assumptions about pagable "guest" memory, right? So
>> is that even an option? And even still, you would still need to solve
>> the aggregation problem so that multiple devices can coexist.
>>
>
> I don't know. Maybe it can be made to work and maybe it cannot. It
> probably can with some determined hacking.
>

I guess you can say the same for any of the solutions.

Kind Regards,
-Greg

From: Michael S. Tsirkin on 17 Sep 2009 00:00

On Wed, Sep 16, 2009 at 10:10:55AM -0400, Gregory Haskins wrote:
> > There is no role reversal.
>
> So if I have virtio-blk driver running on the x86 and vhost-blk device
> running on the ppc board, I can use the ppc board as a block-device.
> What if I really wanted to go the other way?

It seems ppc is the only one that can initiate DMA to an arbitrary
address, so you can't do this really, or you can by tunneling each
request back to ppc, or doing an extra data copy, but it's unlikely to
work well.

The limitation comes from hardware, not from the API we use.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Gregory Haskins on 17 Sep 2009 00:20

Michael S. Tsirkin wrote:
> On Wed, Sep 16, 2009 at 10:10:55AM -0400, Gregory Haskins wrote:
>>> There is no role reversal.
>> So if I have virtio-blk driver running on the x86 and vhost-blk device
>> running on the ppc board, I can use the ppc board as a block-device.
>> What if I really wanted to go the other way?
>
> It seems ppc is the only one that can initiate DMA to an arbitrary
> address, so you can't do this really, or you can by tunneling each
> request back to ppc, or doing an extra data copy, but it's unlikely to
> work well.
>
> The limitation comes from hardware, not from the API we use.

Understood, but presumably it can be exposed as a sub-function of the
ppc's board's register file as a DMA-controller service to the x86.
This would fall into the "tunnel requests back" category you mention
above, though I think "tunnel" implies a heavier protocol than it would
actually require. This would look more like a PIO cycle to a DMA
controller than some higher layer protocol.

You would then utilize that DMA service inside the memctx, and it the
rest of vbus would work transparently with the existing devices/drivers.

I do agree it would require some benchmarking to determine its
feasibility, which is why I was careful to say things like "may work"
;). I also do not even know if its possible to expose the service this
way on his system. If this design is not possible or performs poorly, I
admit vbus is just as hosed as vhost in regard to the "role correction"
benefit.

Kind Regards,
-Greg

From: Avi Kivity on 17 Sep 2009 04:00

On 09/17/2009 06:11 AM, Gregory Haskins wrote:
>
>> irqfd/eventfd is the abstraction layer, it doesn't need to be reabstracted.
>>
> Not per se, but it needs to be interfaced. How do I register that
> eventfd with the fastpath in Ira's rig? How do I signal the eventfd
> (x86->ppc, and ppc->x86)?
>

You write a userspace or kernel module to do it. It's a few dozen lines
of code.

> To take it to the next level, how do I organize that mechanism so that
> it works for more than one IO-stream (e.g. address the various queues
> within ethernet or a different device like the console)? KVM has
> IOEVENTFD and IRQFD managed with MSI and PIO. This new rig does not
> have the luxury of an established IO paradigm.
>
> Is vbus the only way to implement a solution? No. But it is _a_ way,
> and its one that was specifically designed to solve this very problem
> (as well as others).
>

virtio assumes that the number of transports will be limited and
interesting growth is in the number of device classes and drivers. So
we have support for just three transports, but 6 device classes (9p,
rng, balloon, console, blk, net) and 8 drivers (the preceding 6 for
linux, plus blk/net for Windows). It would have nice to be able to
write a new binding in Visual Basic but it's hardly a killer feature.

>>> Since vbus was designed to do exactly that, this is
>>> what I would advocate. You could also reinvent these concepts and put
>>> your own mux and mapping code in place, in addition to all the other
>>> stuff that vbus does. But I am not clear why anyone would want to.
>>>
>>>
>> Maybe they like their backward compatibility and Windows support.
>>
> This is really not relevant to this thread, since we are talking about
> Ira's hardware. But if you must bring this up, then I will reiterate
> that you just design the connector to interface with QEMU+PCI and you
> have that too if that was important to you.
>

Well, for Ira the major issue is probably inclusion in the upstream kernel.

> But on that topic: Since you could consider KVM a "motherboard
> manufacturer" of sorts (it just happens to be virtual hardware), I don't
> know why KVM seems to consider itself the only motherboard manufacturer
> in the world that has to make everything look legacy. If a company like
> ASUS wants to add some cutting edge IO controller/bus, they simply do
> it.

No, they don't. New buses are added through industry consortiums these
days. No one adds a bus that is only available with their machine, not
even Apple.

> Pretty much every product release may contain a different array of
> devices, many of which are not backwards compatible with any prior
> silicon. The guy/gal installing Windows on that system may see a "?" in
> device-manager until they load a driver that supports the new chip, and
> subsequently it works. It is certainly not a requirement to make said
> chip somehow work with existing drivers/facilities on bare metal, per
> se. Why should virtual systems be different?
>

Devices/drivers are a different matter, and if you have a virtio-net
device you'll get the same "?" until you load the driver. That's how
people and the OS vendors expect things to work.

> What I was getting at is that you can't just hand-wave the datapath
> stuff. We do fast path in KVM with IRQFD/IOEVENTFD+PIO, and we do
> device discovery/addressing with PCI.

That's not datapath stuff.

> Neither of those are available
> here in Ira's case yet the general concepts are needed. Therefore, we
> have to come up with something else.
>

Ira has to implement virtio's ->kick() function and come up with
something for discovery. It's a lot less lines of code than there are
messages in this thread.

>> Yes. I'm all for reusing virtio, but I'm not going switch to vbus or
>> support both for this esoteric use case.
>>
> With all due respect, no one asked you to. This sub-thread was
> originally about using vhost in Ira's rig. When problems surfaced in
> that proposed model, I highlighted that I had already addressed that
> problem in vbus, and here we are.
>

Ah, okay. I have no interest in Ira choosing either virtio or vbus.

>> vhost-net somehow manages to work without the config stuff in the kernel.
>>
> I was referring to data-path stuff, like signal and memory
> configuration/routing.
>

signal and memory configuration/routing are not data-path stuff.

>> Well, virtio has a similar abstraction on the guest side. The host side
>> abstraction is limited to signalling since all configuration is in
>> userspace. vhost-net ought to work for lguest and s390 without change.
>>
> But IIUC that is primarily because the revectoring work is already in
> QEMU for virtio-u and it rides on that, right? Not knocking that, thats
> nice and a distinct advantage. It should just be noted that its based
> on sunk-cost, and not truly free. Its just already paid for, which is
> different. It also means it only works in environments based on QEMU,
> which not all are (as evident by this sub-thread).
>

No. We expose a mix of emulated-in-userspace and emulated-in-the-kernel
devices on one bus. Devices emulated in userspace only lose by having
the bus emulated in the kernel. Devices in the kernel gain nothing from
having the bus emulated in the kernel. It's a complete slow path so it
belongs in userspace where state is easy to get at, development is
faster, and bugs are cheaper to fix.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Prev: bfa: Brocade BFA FC SCSI driver (bfad)
Next: Atheros Linux wireless drivers home page - and two new driver projects