From: Alvaro Herrera on
Tom Lane wrote:
> I wrote:
> > Anyway it's only a guess. It could well be that that machine was simply
> > so heavily loaded that the stats collector couldn't respond fast enough.
> > I'm just wondering whether there's an unrecognized bug lurking here.
>
> Still meditating on this ... and it strikes me that the pgstat.c code
> is really uncommunicative about problems. In particular,
> pgstat_read_statsfile_timestamp and pgstat_read_statsfile don't complain
> at all about being unable to read a stats file.

Yeah, I had the same thought.

> Lastly, backend_read_statsfile is designed to send an inquiry message
> every time through the loop, ie, every 10 msec. This is said to be in
> case the stats collector drops one. But is this enough to flood the
> collector and make things worse? I wonder if there should be some
> backoff there.

I also think the autovacuum worker minimum timestamp may be playing
games with the retry logic too. Maybe a worker is requesting a new file
continuously because pgstat is not able to provide one before the
deadline is past, and thus overloading it. I still think that 500ms is
too much for a worker, but backing off all the way to 10ms seems too
much. Maybe it should just be, say, 100ms.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Alvaro Herrera <alvherre(a)commandprompt.com> writes:
> Tom Lane wrote:
>> Still meditating on this ... and it strikes me that the pgstat.c code
>> is really uncommunicative about problems. In particular,
>> pgstat_read_statsfile_timestamp and pgstat_read_statsfile don't complain
>> at all about being unable to read a stats file.

> Yeah, I had the same thought.

OK, I'll add some logging.

>> Lastly, backend_read_statsfile is designed to send an inquiry message
>> every time through the loop, ie, every 10 msec. This is said to be in
>> case the stats collector drops one. But is this enough to flood the
>> collector and make things worse? I wonder if there should be some
>> backoff there.

> I also think the autovacuum worker minimum timestamp may be playing
> games with the retry logic too. Maybe a worker is requesting a new file
> continuously because pgstat is not able to provide one before the
> deadline is past, and thus overloading it. I still think that 500ms is
> too much for a worker, but backing off all the way to 10ms seems too
> much. Maybe it should just be, say, 100ms.

But we don't advance the deadline within the wait loop, so (in theory)
a single requestor shouldn't be able to trigger more than one stats file
update. I wonder though if an autovac worker could make many such
requests over its lifespan ...

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Alvaro Herrera on
Tom Lane wrote:
> Alvaro Herrera <alvherre(a)commandprompt.com> writes:

> > I also think the autovacuum worker minimum timestamp may be playing
> > games with the retry logic too. Maybe a worker is requesting a new file
> > continuously because pgstat is not able to provide one before the
> > deadline is past, and thus overloading it. I still think that 500ms is
> > too much for a worker, but backing off all the way to 10ms seems too
> > much. Maybe it should just be, say, 100ms.
>
> But we don't advance the deadline within the wait loop, so (in theory)
> a single requestor shouldn't be able to trigger more than one stats file
> update.

Hmm, yeah.

> I wonder though if an autovac worker could make many such
> requests over its lifespan ...

Well, yes, but it will request fresh stats only for the recheck logic
before each table, so there will be one intervening vacuum (or none,
actually, if the table was vacuumed by some other autovac worker.
Though given the default naptime of 1 min I find it unlikely that the
regression database will ever see more than one worker).

Since the warning comes from the launcher and not the worker, I wonder
if this is a red herring.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Alvaro Herrera <alvherre(a)commandprompt.com> writes:
> Since the warning comes from the launcher and not the worker, I wonder
> if this is a red herring.

It's all speculation at the moment. So far there's not really enough
evidence to refute the idea that the system was just under heavy load
at that point --- except that even under heavy load it shouldn't take
the stats collector 5 seconds to write the stats file for the regression
database, ISTM.

I wonder if there is any practical way for the buildfarm client script
to report about the system's load average, or some other gauge of how
much is going on in the buildfarm machine besides the regression tests.
One thought is just to log how long it takes to run the regression
tests. A longer-than-usual run for a particular animal would be evidence
of a load spike; if we could correlate that with failures of this sort
it would be easier to write them off as heavy load.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers