Unexpected splice "always copy" behavior observed [Kernel]

Prev: [GIT Pull] hpet for 2.6.35
Next: [GIT PULL] percpu for v2.6.35-rc1

From: Steven Rostedt on 18 May 2010 12:20

On Tue, 2010-05-18 at 11:53 -0400, Steven Rostedt wrote:

> I'm currently looking at the network code to see if it is better.

The network code seems to do the right thing. It sends the actual page
directly to the network.

Hopefully we can find a way to avoid the copy to file. But the splice
code was created to avoid the copy to and from userspace, it did not
guarantee no copy within the kernel itself.

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 18 May 2010 12:20

On Wed, May 19, 2010 at 02:00:51AM +1000, Nick Piggin wrote:
> On Tue, May 18, 2010 at 10:56:24AM -0500, Christoph Lameter wrote:
> > On Wed, 19 May 2010, Nick Piggin wrote:
> >
> > > What would be needed is to have filesystem maintainers go through and
> > > enable it on a case by case basis. It's trivial for tmpfs/ramfs type
> > > filesystems and I have a patch for those, but I never posted it on.yet.
> > > Even basic buffer head filesystems IIRC get a little more complex --
> > > but we may get some milage just out of invalidating the existing
> > > pagecache rather than getting fancy and trying to move buffers over
> > > to the new page.
> >
> > There is a "migration" address space operation for moving pages. Page
> > migration requires that in order to be able to move dirty pages. Can
> > splice use that?
>
> Hmm yes I didn't think of that, it probably could.

It's not the only requirement, of course, just that it could
potentially reuse some of the code.

The big difference is that the source page is already dirty, and
the destination page might not exist, might exist and be partially
uptodate, not have blocks allocated, might be past i_size, fully
uptodate, etc.

So it's more than a matter of just a simple copy to another page
and taking over exactly the same filesystem state as the old page.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 18 May 2010 12:30

On Tue, 18 May 2010, Steven Rostedt wrote:
>
> Hopefully we can find a way to avoid the copy to file. But the splice
> code was created to avoid the copy to and from userspace, it did not
> guarantee no copy within the kernel itself.

Well, we always _wanted_ to splice directly to a file, but it's just not
been done properly. It's not entirely trivial, since you need to worry
about preexisting pages and generally just do the right thing wrt the
filesystem.

And no, it should NOT use migration code. I suspect you could do something
fairly simple like:

- get the inode semaphore.
- check if the splice is a pure "extend size" operation for that page
- if so, just create the page cache entry and mark it dirty
- otherwise, fall back to copying.

because the "extend file" case is the easiest one, and is likely the only
one that matters in practice (if you are overwriting an existing file,
things get _way_ hairier, and why the hell would anybody expect that to be
fast anyway?)

But somebody needs to write the code..

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nick Piggin on 19 May 2010 02:40

On Tue, May 18, 2010 at 09:25:05AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 18 May 2010, Steven Rostedt wrote:
> >
> > Hopefully we can find a way to avoid the copy to file. But the splice
> > code was created to avoid the copy to and from userspace, it did not
> > guarantee no copy within the kernel itself.
>
> Well, we always _wanted_ to splice directly to a file, but it's just not
> been done properly. It's not entirely trivial, since you need to worry
> about preexisting pages and generally just do the right thing wrt the
> filesystem.
>
> And no, it should NOT use migration code. I suspect you could do something
> fairly simple like:

I was thinking it could possibly reuse some of the migration code for
swapping filesystem state to the new page. But I agree it gets hairy and
is probably better to just insert new pages.

>
> - get the inode semaphore.
> - check if the splice is a pure "extend size" operation for that page
> - if so, just create the page cache entry and mark it dirty
> - otherwise, fall back to copying.
>
> because the "extend file" case is the easiest one, and is likely the only
> one that matters in practice (if you are overwriting an existing file,
> things get _way_ hairier, and why the hell would anybody expect that to be
> fast anyway?)
>
> But somebody needs to write the code..

We can possibly do an attempt to invalidate existing pagecache and
then try to install the new page. The filesystem still needs a look
over to ensure error handling will work properly, and that it does
not make incorrect assumptions about the contents of the page being
passed in.

This still isn't ideal because we drop the filesystem state (eg bufer
heads) on a page which, by definition, will need to be written out soon.
But something smarter could be added if it turns out to be important.

Big if, because I don't like adding complex code without having a
really good reason. I do like having the splice flag there, though.
The more the app can tell the kernel the better. Hopefully people use
it and we can get a better idea of whether these fancy optimisations
will be worth it.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Linus Torvalds on 19 May 2010 10:50

On Wed, 19 May 2010, Nick Piggin wrote:
>
> We can possibly do an attempt to invalidate existing pagecache and
> then try to install the new page.

Yes, but that's going to be rather hairier. We need to make sure that the
filesystem doesn't have some kind of dirty pointers to the old page etc.
Although I guess that should always show up in the page counters, so I
guess we can always handle the case of page_count() being 1 (only page
cache) and the page being unlocked.

So I'd much rather just handle the "append to the end".

The real limitation is likely always going to be the fact that it has to
be page-aligned and a full page. For a lot of splice inputs, that simply
won't be the case, and you'll end up copying for alignment reasons anyway.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7
Prev: [GIT Pull] hpet for 2.6.35
Next: [GIT PULL] percpu for v2.6.35-rc1