From: "Kevin Grittner" on
Hiroyuki Yamada <yamada(a)kokolink.net> wrote:

> 4. Startup process tries to redo XLOG_HEAP2_CLEAN record, calls
> LockBufferForCleanup() and freezes until the Xact 1 ends.

I think they word you're searching for is "blocks". Blocking to
protect integrity doesn't sound like a bug to me; perhaps an
opportunity for enhancement.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on
On Tue, 2009-12-15 at 20:11 +0900, Hiroyuki Yamada wrote:
> Hot Standby node can freeze when startup process calls LockBufferForCleanup().
> This bug can be reproduced by the following procedure.
>
> 0. start Hot Standby, with one active node(node A) and one standby node(node B)
> 1. create table X and table Y in node A
> 2. insert several rows in table X in node A
> 3. delete one row from table X in node A
> 4. begin xact 1 in node A, execute following commands, and leave xact 1 open
> 4.1 LOCK table Y IN ACCESS EXCLUSIVE MODE
> 5. wait until WAL's for above actions are applied in node B
> 6. begin xact 2 in node B, and execute following commands
> 6.1 DECLARE CURSOR test_cursor FOR SELECT * FROM table X;
> 6.2 FETCH test_cursor;
> 6.3 SELECT * FROM table Y;
> 7. execute VACUUM FREEZE table A in node A
> 8. commit xact 1 in node A
>
> ...then in node B occurs following "deadlock" situation, which is not detected by deadlock check.
> * startup process waits for xact 2 to release buffers in table X (in LockBufferForCleanup())
> * xact 2 waits for startup process to release ACCESS EXCLUSIVE lock in table Y

Deadlock bug was prevented by stop-gap measure in December commit.

Full resolution patch attached for Startup process waits on buffer pins.

Startup process sets SIGALRM when waiting on a buffer pin. If woken by
alarm we send SIGUSR1 to all backends requesting that they check to see
if they are blocking Startup process. If so, they throw ERROR/FATAL as
for other conflict resolutions. Deadlock stop gap removed.
max_standby_delay = -1 option removed to prevent deadlock.

Reviews welcome, otherwise commit at end of week.

--
Simon Riggs www.2ndQuadrant.com
From: Andres Freund on
On Tuesday 19 January 2010 11:46:24 Simon Riggs wrote:
> On Tue, 2009-12-15 at 20:11 +0900, Hiroyuki Yamada wrote:
> > Hot Standby node can freeze when startup process calls
> > LockBufferForCleanup(). This bug can be reproduced by the following
> > procedure.
> >
> > 0. start Hot Standby, with one active node(node A) and one standby
> > node(node B) 1. create table X and table Y in node A
> > 2. insert several rows in table X in node A
> > 3. delete one row from table X in node A
> > 4. begin xact 1 in node A, execute following commands, and leave xact 1
> > open 4.1 LOCK table Y IN ACCESS EXCLUSIVE MODE
> > 5. wait until WAL's for above actions are applied in node B
> > 6. begin xact 2 in node B, and execute following commands
> > 6.1 DECLARE CURSOR test_cursor FOR SELECT * FROM table X;
> > 6.2 FETCH test_cursor;
> > 6.3 SELECT * FROM table Y;
> > 7. execute VACUUM FREEZE table A in node A
> > 8. commit xact 1 in node A
> >
> > ...then in node B occurs following "deadlock" situation, which is not
> > detected by deadlock check.
> >
> > * startup process waits for xact 2 to release buffers in table X (in
> > LockBufferForCleanup()) * xact 2 waits for startup process to release
> > ACCESS EXCLUSIVE lock in table Y
>
> Deadlock bug was prevented by stop-gap measure in December commit.
>
> Full resolution patch attached for Startup process waits on buffer pins.
>
> Startup process sets SIGALRM when waiting on a buffer pin. If woken by
> alarm we send SIGUSR1 to all backends requesting that they check to see
> if they are blocking Startup process. If so, they throw ERROR/FATAL as
> for other conflict resolutions. Deadlock stop gap removed.
> max_standby_delay = -1 option removed to prevent deadlock.
Wouldnt it be more foolproof to also loop around sending the FATAL? Not that
its likely but...

From HoldingBufferPinThatDelaysRecovery youre calling
GetStartupBufferPinWaitBufId - that sounds a bit dangerous because that one is
acquiring a spinlock which can also get taken at other places. Its not the
most likely scenario, but it would certainly be annoying to debug.

Is there any supported platform with sizeof(sig_atomic_t) <4 - I would doubt
so? If not the locking in GetStartupBufferPinWaitBufId and
SetStartupBufferPinWaitBufId shouldnt be needed?

Same issue issue (and more likely to trigger) exists with CheckStandbyTimeout-
>SendRecoveryConflictWithBufferPin->CancelDBBackends

Too tired to look further.

Greetings,
Andres


PS: Sorry for not doing anything on the idle front - I have been burried alive
the last days.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on
Andres Freund <andres(a)anarazel.de> writes:
> Is there any supported platform with sizeof(sig_atomic_t) <4 - I would doubt
> so?

Er ... what? I believe there are live platforms with sig_atomic_t = char.
If we're assuming more that's a must-fix.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andres Freund on
On Wednesday 20 January 2010 06:30:28 Tom Lane wrote:
> Andres Freund <andres(a)anarazel.de> writes:
> > Is there any supported platform with sizeof(sig_atomic_t) <4 - I would
> > doubt so?
>
> Er ... what? I believe there are live platforms with sig_atomic_t = char.
> If we're assuming more that's a must-fix.
So were assuming genereally that a integer cannot be read/written
atomatically?
Or was that only relating to the actualy signal.h typedef?

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers