Faster CREATE DATABASE by delaying fsync (was 8.4.1 ubuntu karmic slow createdb) [PgSql]

Prev: Make TOAST_TUPLES_PER_PAGE configurable per table.
Next: use of dblink_build_sql_insert() induces a server crash

From: Robert Haas on 2 Feb 2010 12:36

On Fri, Jan 29, 2010 at 1:56 PM, Greg Stark <gsstark(a)mit.edu> wrote:
> On Tue, Jan 19, 2010 at 3:25 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
>> That function *seriously* needs documentation, in particular the fact
>> that it's a no-op on machines without the right kernel call. The name
>> you've chosen is very bad for those semantics. I'd pick something
>> else myself. Maybe "pg_start_data_flush" or something like that?
>>
>
> I would like to make one token argument in favour of the name I
> picked. If it doesn't convince I'll change it since we can always
> revisit the API down the road.
>
> I envision having two function calls, pg_fsync_start() and
> pg_fsync_finish(). The latter will wait until the data synced in the
> first call is actually synced. The fall-back if there's no
> implementation of this would be for fsync_start() to be a noop (or
> something unreliable like posix_fadvise) and fsync_finish() to just be
> a regular fsync.
>
> I think we can accomplish this with sync_file_range() but I need to
> read up on how it actually works a bit more. In this case it doesn't
> make a difference since when we call fsync_finish() it's going to be
> for the entire file and nothing else will have been writing to these
> files. But for wal writing and checkpointing it might have very
> different performance characteristics.
>
> The big objection to this is that then we don't really have an api for
> FADV_DONT_NEED which is more about cache policy than about syncing to
> disk. So for example a sequential scan might want to indicate that it
> isn't planning on reading the buffers it's churning through but
> doesn't want to force them to be written sooner than otherwise and is
> never going to call fsync_finish().

I took a look at this patch today and I agree with Tom that
pg_fsync_start() is a very confusing name. I don't know what the
right name is, but this doesn't fsync so I don't think it shuld have
fsync in the name. Maybe something like pg_advise_abandon() or
pg_abandon_cache(). The current name is really wishful thinking:
you're hoping that it will make the kernel start the fsync, but it
might not. I think pg_start_data_flush() is similarly optimistic.

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 3 Feb 2010 08:42

On Wed, Feb 3, 2010 at 6:53 AM, Greg Stark <gsstark(a)mit.edu> wrote:
> On Tue, Feb 2, 2010 at 7:45 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>> I think you're probably right, but it's not clear what the new name
>> should be until we have a comment explaining what the function is
>> responsible for.
>
> So I wrote some comments but wasn't going to repost the patch with the
> unchanged name without explanation... But I think you're right though
> I was looking at it the other way around. I want to have an API for a
> two-stage sync and of course if I do that I'll comment it to explain
> that clearly.
>
> The gist of the comments was that the function is preparing to fsync
> to initiate the i/o early and allow the later fsync to fast -- but
> also at the same time have the beneficial side-effect of avoiding
> cache poisoning. It's not clear that the two are necessarily linked
> though. Perhaps we need two separate apis, though it'll be hard to
> keep them separate on all platforms.

Well, maybe we should start with a discussion of what kernel calls
you're aware of on different platforms and then we could try to put an
API around it. I mean, right now all you've got is
POSIX_FADV_DONTNEED, so given just that I feel like the API could
simply be pg_dontneed() or something. It's hard to design a general
framework based on one example.

....Robert

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andres Freund on 8 Feb 2010 14:29

On Monday 08 February 2010 19:34:01 Greg Stark wrote:
> On Mon, Feb 8, 2010 at 4:53 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> > On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
> >
> >> Yeah, it seems there are two patches here -- one is the addition of
> >> fsync_fname() and the other is the fsync_prepare stuff.
>
> Sorry, I'm just catching up on my mail from FOSDEM this past weekend.
>
> I had come to the same conclusion as Greg that I might as well just
> commit it with Tom's "pg_flush_data()" name and we can decide later if
> and when we have pg_fsync_start()/pg_fsync_finish() whether it's worth
> keeping two apis or not.
>
> So I was just going to commit it like that but I discovered last week
> that I don't have cvs write access set up yet. I'll commit it as soon
> as I generate a new ssh key and Dave installs it, etc. I intentionally
> picked a small simple patch that nobody was waiting on because I knew
> there was a risk of delays like this and the paperwork. I'm nearly
> there.
Do you still want me to split the patches into two or do you want to do it
yourself?
One in multiple versions for the directory fsync and another one for 9.0?

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andres Freund on 7 Feb 2010 20:31

On Sunday 07 February 2010 19:27:02 Andres Freund wrote:
> On Sunday 07 February 2010 19:23:10 Robert Haas wrote:
> > On Sun, Feb 7, 2010 at 11:24 AM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> > > Greg Smith <greg(a)2ndquadrant.com> writes:
> > >> This is turning into yet another one of those situations where
> > >> something simple and useful is being killed by trying to generalize
> > >> it way more than it needs to be, given its current goals and its lack
> > >> of external interfaces. There's no catversion bump or API breakage
> > >> to hinder future refactoring if this isn't optimally designed
> > >> internally from day one.
> > >
> > > I agree that it's too late in the cycle for any major redesign of the
> > > patch. But is it too much to ask to use a less confusing name for the
> > > function?
> >
> > +1. Let's just rename the thing, add some comments, and call it good.
>
> Will post a updated patch in the next hours unless somebody beats me too
> it.
Here we go.

I left the name at my suggestion pg_fsync_prepare instead of Tom's
prepare_for_fsync because it seemed more consistend with the naming in the
rest of the file. Obviously feel free to adjust.

I personally think the fsync on the directory should be added to the stable
branches - other opinions?
If wanted I can prepare patches for that.

Andres

From: Greg Stark on 8 Feb 2010 13:34

On Mon, Feb 8, 2010 at 4:53 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> On Sun, Feb 7, 2010 at 10:09 PM, Alvaro Herrera
>> Yeah, it seems there are two patches here -- one is the addition of
>> fsync_fname() and the other is the fsync_prepare stuff.

Sorry, I'm just catching up on my mail from FOSDEM this past weekend.

I had come to the same conclusion as Greg that I might as well just
commit it with Tom's "pg_flush_data()" name and we can decide later if
and when we have pg_fsync_start()/pg_fsync_finish() whether it's worth
keeping two apis or not.

So I was just going to commit it like that but I discovered last week
that I don't have cvs write access set up yet. I'll commit it as soon
as I generate a new ssh key and Dave installs it, etc. I intentionally
picked a small simple patch that nobody was waiting on because I knew
there was a risk of delays like this and the paperwork. I'm nearly
there.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

| Next | Last
Pages: 1 2 3 4
Prev: Make TOAST_TUPLES_PER_PAGE configurable per table.
Next: use of dblink_build_sql_insert() induces a server crash