From: Simon Riggs on
On Wed, 2010-05-12 at 18:05 +0100, Greg Stark wrote:

> I'm not sure who to blame for the shouting match over whose commit
> introduced the bug -- it doesn't seem like a relevant or useful thing
> to argue about, please both stop.

I haven't blamed Robert's code, merely asked him to consider that it is
something other HS, since we have no evidence either way at present
because the issue is sporadic and has not been replicated as yet, with
no specific detail leading to any section of code.

> > there is no need
> > to comment on every thread.
>
> This is out of line.

Quoted out of context, it is. My full comment is "Please don't continue
to make wild speculations about things related to HS and recovery, so
that issues do not become confused; there is no need to comment on every
thread." ... by which I mean threads related to HS and recovery. I
respect everybody's right to free speech here, but I would say the same
to anyone if they do it repeatedly. I'm not the first to make such a
comment on hackers either.

--
Simon Riggs www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Joshua D. Drake" on
On Wed, 2010-05-12 at 17:49 +0100, Simon Riggs wrote:
> On Wed, 2010-05-12 at 12:04 -0400, Robert Haas wrote:

> Normal shutdown didn't work on a standby before HS was committed and it
> didn't work afterwards either. Use all the capitals you like but if you
> use poor arguments and combine that with no evidence then we'll not get
> very far, either in working together or in solving the actual bugs.
> Please don't continue to make wild speculations about things related to
> HS and recovery, so that issues do not become confused; there is no need
> to comment on every thread.
>

Simon,

People are very passionate about this feature. This feature has the
ability to show us as moving forward in a fashion that will allow us to
directly compete with the "big boys" in the "big installs", although we
are still probably 2-3 releases from that.

It also has the ability to make us look like a bunch of yahoos (no pun
intended) who are better served beating up on that database that Oracle
just bought, versus Oracle itself.

Patience is a virtue for all when it comes to the this feature.

Joshua D. Drake


--
PostgreSQL.org Major Contributor
Command Prompt, Inc: http://www.commandprompt.com/ - 503.667.4564
Consulting, Training, Support, Custom Development, Engineering



--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on
On Wed, May 12, 2010 at 1:21 PM, Simon Riggs <simon(a)2ndquadrant.com> wrote:
> On Wed, 2010-05-12 at 18:05 +0100, Greg Stark wrote:
>
>> I'm not sure who to blame for the shouting match over whose commit
>> introduced the bug -- it doesn't seem like a relevant or useful thing
>> to argue about, please both stop.
>
> I haven't blamed Robert's code, merely asked him to consider that it is
> something other HS, since we have no evidence either way at present
> because the issue is sporadic and has not been replicated as yet, with
> no specific detail leading to any section of code.

I'm not really sure what we're arguing about here. I feel like I'm
being accused either of (a) introducing the bug (which is possible) or
(b) saying that Simon introduced the bug (which presumably is also
possible, although it's not really my point). I ventured an
uninformed guess at what the problem might be; Simon thinks my guess
is wrong, and it may well be: but either way there's a bug buried in
here somewhere and it would be nice to fix it. I thought that it
would be a good idea for Simon to look at it because, on the surface,
it APPEARS to have something to do with Hot Standby, since that's what
Stefan was testing when he found it. Sure, the investigation might
lead somewhere else; I completely admit that.

Now, Simon just said he HAS looked at it and can't reproduce the
problem. So now I'm even less sure what we're arguing about. I'm
glad he looked at it. It's interesting that he wasn't able to
reproduce the problem. I hope that he or someone else will find
something that helps us move forward. I am having difficulty
reproducing Stefan's test environment and perhaps for that reason I
can't reproduce it either, though I've encountered several other
problems about which, I suppose, I will post separate emails.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Stefan Kaltenbrunner on
On 05/12/2010 05:28 PM, Simon Riggs wrote:
> On Wed, 2010-05-12 at 14:18 +0100, Simon Riggs wrote:
>> On Wed, 2010-05-12 at 08:52 -0400, Robert Haas wrote:
>>> On Wed, May 12, 2010 at 7:26 AM, Simon Riggs<simon(a)2ndquadrant.com> wrote:
>>>> On Wed, 2010-05-12 at 07:10 -0400, Robert Haas wrote:
>>>>
>>>>> I'm not sure what to make of this. Sometimes not shutting down
>>>>> doesn't sound like a feature to me.
>>>>
>>>> It acts exactly the same in recovery as in normal running. It is not a
>>>> special feature of recovery at all, bug or otherwise.
>>>
>>> Simon, that doesn't make any sense. We are talking about a backend
>>> getting stuck forever on an exclusive lock that is held by the startup
>>> process and which will never be released (for example, because the
>>> master has shut down and no more WAL can be obtained for replay). The
>>> startup process does not hold locks in normal operation.
>>
>> When I test it, startup process holding a lock does not prevent shutdown
>> of a standby.
>>
>> I'd be happy to see your test case showing a bug exists and that the
>> behaviour differs from normal running.
>
> Let me put this differently: I accept that Stefan has reported a
> problem. Neither Tom nor myself can reproduce the problem. I've re-run
> Stefan's test case and restarted the server more than 400 times now
> without any issue.
>
> I re-read your post where you gave what you yourself called "uninformed
> speculation". There's no real polite way to say it, but yes your
> speculation does appear to be uninformed, since it is incorrect. Reasons
> would be not least that Stefan's tests don't actually send any locks to
> the standby anyway (!), but even if they did your speculation as to the
> cause is still all wrong, as explained.
>
> There is no evidence to link this behaviour with HS, as yet, and you
> should be considering the possibility the problem lies elsewhere,
> especially since it could be code you committed that is at fault.

Well I'm not sure why people seem to have that hard a time reproducing
that issue - it seems that I can provoke it really trivially(in this
case no loops, no pgbench, no tricks). A few minutes ago I logged into
my test standby (which is idle except for the odd connect to template1
caused by nagios - the master is idle as well and has been for days):

postgres(a)soldata005:~$ psql
psql (9.0beta1)
Type "help" for help.

postgres=# select 1;
?column?
----------
1
(1 row)

postgres=# \q
postgres(a)soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart
waiting for server to shut down.... done
server stopped
server starting
postgres(a)soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart
waiting for server to shut down.... done
server stopped
server starting
postgres(a)soldata005:~$ pg_ctl -D /var/lib/postgresql/9.0b1/main/ restart
waiting for server to shut
down............................................................... failed
pg_ctl: server does not shut down


the server log for that is as follows:

<2010-05-12 20:36:18.166 CEST,,,> LOG: received smart shutdown request
<2010-05-12 20:36:18.167 CEST,,,> FATAL: terminating walreceiver
process due to administrator command
<2010-05-12 20:36:18.174 CEST,,,> LOG: shutting down
<2010-05-12 20:36:18.251 CEST,,,> LOG: database system is shut down
<2010-05-12 20:36:19.706 CEST,,,> LOG: database system was interrupted
while in recovery at log time 2010-05-06 17:36:05 CEST
<2010-05-12 20:36:19.706 CEST,,,> HINT: If this has occurred more than
once some data might be corrupted and you might need to choose an
earlier recovery target.
<2010-05-12 20:36:19.706 CEST,,,> LOG: entering standby mode
<2010-05-12 20:36:19.721 CEST,,,> LOG: consistent recovery state
reached at 1/12000078
<2010-05-12 20:36:19.721 CEST,,,> LOG: invalid record length at 1/12000078
<2010-05-12 20:36:19.723 CEST,,,> LOG: database system is ready to
accept read only connections
<2010-05-12 20:36:19.737 CEST,,,> LOG: streaming replication
successfully connected to primary
<2010-05-12 20:36:19.918 CEST,,,> LOG: received smart shutdown request
<2010-05-12 20:36:19.919 CEST,,,> FATAL: terminating walreceiver
process due to administrator command
<2010-05-12 20:36:19.922 CEST,,,> LOG: shutting down
<2010-05-12 20:36:19.937 CEST,,,> LOG: database system is shut down
<2010-05-12 20:36:21.433 CEST,,,> LOG: database system was interrupted
while in recovery at log time 2010-05-06 17:36:05 CEST
<2010-05-12 20:36:21.433 CEST,,,> HINT: If this has occurred more than
once some data might be corrupted and you might need to choose an
earlier recovery target.
<2010-05-12 20:36:21.433 CEST,,,> LOG: entering standby mode
<2010-05-12 20:36:21.482 CEST,,,> LOG: received smart shutdown request
<2010-05-12 20:36:21.504 CEST,,,> LOG: consistent recovery state
reached at 1/12000078
<2010-05-12 20:36:21.504 CEST,,,> LOG: invalid record length at 1/12000078
<2010-05-12 20:36:21.505 CEST,,,> LOG: database system is ready to
accept read only connections
<2010-05-12 20:36:21.516 CEST,,,> LOG: streaming replication
successfully connected to primary

so it restarted two times successfully - however if one looks at the
third time one can see that it received the smart shutdown request
BEFORE it reached a consistent recovery state - yet it continued to
enable HS and reenabled SR as well.

The database is now sitting there doing nothing and it more or less
broken because you cannot connect to it in the current state:

~$ psql
psql: FATAL: the database system is shutting down

the startup process has the following backtrace:

(gdb) bt
#0 0x00007fbe24cb2c83 in select () from /lib/libc.so.6
#1 0x00000000006e811a in pg_usleep ()
#2 0x000000000048c333 in XLogPageRead ()
#3 0x000000000048c967 in ReadRecord ()
#4 0x0000000000493ab6 in StartupXLOG ()
#5 0x0000000000495a88 in StartupProcessMain ()
#6 0x00000000004ab25e in AuxiliaryProcessMain ()
#7 0x00000000005d4a7d in StartChildProcess ()
#8 0x00000000005d70c2 in PostmasterMain ()
#9 0x000000000057d898 in main ()


Stefan

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on
On Wed, 2010-05-12 at 21:10 +0200, Stefan Kaltenbrunner wrote:

> > There is no evidence to link this behaviour with HS, as yet, and you
> > should be considering the possibility the problem lies elsewhere,
> > especially since it could be code you committed that is at fault.
>
> Well I'm not sure why people seem to have that hard a time reproducing
> that issue - it seems that I can provoke it really trivially(in this
> case no loops, no pgbench, no tricks). A few minutes ago I logged into
> my test standby (which is idle except for the odd connect to template1
> caused by nagios - the master is idle as well and has been for days):

Thanks, good report.

> so it restarted two times successfully - however if one looks at the
> third time one can see that it received the smart shutdown request
> BEFORE it reached a consistent recovery state - yet it continued to
> enable HS and reenabled SR as well.
>
> The database is now sitting there doing nothing and it more or less
> broken because you cannot connect to it in the current state:
>
> ~$ psql
> psql: FATAL: the database system is shutting down
>
> the startup process has the following backtrace:
>
> (gdb) bt
> #0 0x00007fbe24cb2c83 in select () from /lib/libc.so.6
> #1 0x00000000006e811a in pg_usleep ()
> #2 0x000000000048c333 in XLogPageRead ()
> #3 0x000000000048c967 in ReadRecord ()
> #4 0x0000000000493ab6 in StartupXLOG ()
> #5 0x0000000000495a88 in StartupProcessMain ()
> #6 0x00000000004ab25e in AuxiliaryProcessMain ()
> #7 0x00000000005d4a7d in StartChildProcess ()
> #8 0x00000000005d70c2 in PostmasterMain ()
> #9 0x000000000057d898 in main ()

Well, its waiting for new info from primary. Nothing to do with locking,
but that's not an indication that its an SR issue though either. ;-)

I'll put some waits into that part of the code and see if I can induce
the failure. Maybe its just a simple lack of a CHECK_FOR_INTERRUPTS().

--
Simon Riggs www.2ndQuadrant.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers