An example of bugs for Hot Standby [PgSql]

Prev: idea - new aggregates median, listagg
Next: PATCH: Add hstore_to_json()

From: Simon Riggs on 20 Jan 2010 04:40

On Wed, 2010-01-20 at 06:14 +0100, Andres Freund wrote:
> >
> > Full resolution patch attached for Startup process waits on buffer pins.
> >
> > Startup process sets SIGALRM when waiting on a buffer pin. If woken by
> > alarm we send SIGUSR1 to all backends requesting that they check to see
> > if they are blocking Startup process. If so, they throw ERROR/FATAL as
> > for other conflict resolutions. Deadlock stop gap removed.
> > max_standby_delay = -1 option removed to prevent deadlock.

> Wouldnt it be more foolproof to also loop around sending the FATAL? Not that
> its likely but...

More foolproof and much less accurate. The Startup process doesn't know
who is holding the buffer pin that blocks it, so it could not target a
FATAL.

> From HoldingBufferPinThatDelaysRecovery youre calling
> GetStartupBufferPinWaitBufId - that sounds a bit dangerous because that one is
> acquiring a spinlock which can also get taken at other places. Its not the
> most likely scenario, but it would certainly be annoying to debug.

Spinlock. It isn't held for long in any situation. What problem do you
foresee?

> Is there any supported platform with sizeof(sig_atomic_t) <4 - I would doubt
> so? If not the locking in GetStartupBufferPinWaitBufId and
> SetStartupBufferPinWaitBufId shouldnt be needed?

I prefer spinlocking.

> Same issue issue (and more likely to trigger) exists with CheckStandbyTimeout-
> >SendRecoveryConflictWithBufferPin->CancelDBBackends

I don't see an issue.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andres Freund on 20 Jan 2010 04:45

On Wednesday 20 January 2010 06:30:28 Tom Lane wrote:
> Andres Freund <andres(a)anarazel.de> writes:
> > Is there any supported platform with sizeof(sig_atomic_t) <4 - I would
> > doubt so?
>
> Er ... what? I believe there are live platforms with sig_atomic_t = char.
> If we're assuming more that's a must-fix.
The reason I have asked is that the code is doing things like:
/*
* Used by backends when they receive a request to check for buffer pin waits.
*/
int
GetStartupBufferPinWaitBufId(void)
{
int bufid;

/* use volatile pointer to prevent code rearrangement */
volatile PROC_HDR *procglobal = ProcGlobal;

SpinLockAcquire(ProcStructLock);

bufid = procglobal->startupBufferPinWaitBufId;

SpinLockRelease(ProcStructLock);

return bufid;
}

or similar things with LWLockAcquire in a signal handler which strikes me as a
not that good idea. As at least on x86 reading an integer is atomic the above
spinlock is pointless. My cross arch experience is barely existing, so...

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andres Freund on 20 Jan 2010 04:49

On Wednesday 20 January 2010 10:40:10 Simon Riggs wrote:
> On Wed, 2010-01-20 at 06:14 +0100, Andres Freund wrote:
> > > Full resolution patch attached for Startup process waits on buffer
> > > pins.
> > >
> > > Startup process sets SIGALRM when waiting on a buffer pin. If woken by
> > > alarm we send SIGUSR1 to all backends requesting that they check to see
> > > if they are blocking Startup process. If so, they throw ERROR/FATAL as
> > > for other conflict resolutions. Deadlock stop gap removed.
> > > max_standby_delay = -1 option removed to prevent deadlock.
> >
> > Wouldnt it be more foolproof to also loop around sending the FATAL? Not
> > that its likely but...
>
> More foolproof and much less accurate. The Startup process doesn't know
> who is holding the buffer pin that blocks it, so it could not target a
> FATAL.
>
> > From HoldingBufferPinThatDelaysRecovery youre calling
> > GetStartupBufferPinWaitBufId - that sounds a bit dangerous because that
> > one is acquiring a spinlock which can also get taken at other places.
> > Its not the most likely scenario, but it would certainly be annoying to
> > debug.
> Spinlock. It isn't held for long in any situation. What problem do you
> foresee?
If any backend is signalled while currently holding the ProcStructLock there
is a basically unrecoverable deadlock - its not likely but possible.

> > Is there any supported platform with sizeof(sig_atomic_t) <4 - I would
> > doubt so? If not the locking in GetStartupBufferPinWaitBufId and
> > SetStartupBufferPinWaitBufId shouldnt be needed?
> I prefer spinlocking.
Well, its deadlock land taking the same lock inside and outside of a signal
handler...

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on 20 Jan 2010 04:52

On Wed, 2010-01-20 at 10:45 +0100, Andres Freund wrote:
> LWLockAcquire

I'm using spinlocks, not lwlocks.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Andres Freund on 20 Jan 2010 05:04

On Wednesday 20 January 2010 10:52:24 Simon Riggs wrote:
> On Wed, 2010-01-20 at 10:45 +0100, Andres Freund wrote:
> > LWLockAcquire
>
> I'm using spinlocks, not lwlocks.
CancelDBBackends which is used in SendRecoveryConflictWithBufferPin which in
turn used by CheckStandbyTimeout triggered by SIGALRM acquires the lwlock.

Now that case is a bit less dangerous because you would have to interrupt
yourself to trigger a deadlock there because the code sleeps soon after
setting up the handler.
If ever two SIGALRM occur consecutive there is a problem.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: idea - new aggregates median, listagg
Next: PATCH: Add hstore_to_json()