An example of bugs for Hot Standby [PgSql]

Prev: idea - new aggregates median, listagg
Next: PATCH: Add hstore_to_json()

From: Andres Freund on 20 Jan 2010 12:41

Hi Tom, Hi Simon,

On Wednesday 20 January 2010 17:59:36 Tom Lane wrote:
> Andres Freund <andres(a)anarazel.de> writes:
> > I realize its way too late in the cycle for that, but why dont we start
> > using some library for easy cross platform atomic ops?
>
> (1) there probably isn't one that does exactly what we want, works
> everywhere, and has the right license;
> (2) what actual gain would we get? We've already done the work.
That there might be some other instructions were interested in?
Like really atomic increment?

> >> [ grows visibly pale ] *Please* tell me we are not trying to take
> >> locks in a signal handler. What happens if it interrupts code that
> >> is already holding that lock?
> >
> > Yes the patch does that at two places.
>
> That's a must-fix.
Its code intended to fix a existing problem not already comitted code. But
otherwise I definitely agree.

Andres

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on 20 Jan 2010 13:47

On Wed, 2010-01-20 at 17:40 +0100, Andres Freund wrote:
> > > or similar things with LWLockAcquire in a signal handler
> >
> > [ grows visibly pale ] *Please* tell me we are not trying to take
> > locks in a signal handler. What happens if it interrupts code that
> > is already holding that lock?

> Yes the patch does that at two places.

I think it would be more sensible to discuss specific code and issues,
rather than have general discussions about various horrors.

You've already pointed out that I need to prevent multiple sigalrm
interrupts using boolean flags; I've already said that I would do that.
The use of locks themselves are clearly not a problem, since the
existing sigalrm handler takes LWlocks for deadlock detection. The
problem is just about being called multiple times.

The code in HoldingBufferPinThatDelaysRecovery() also needs protection
against being interrupted multiple times, but we should note that a
second signal of that type is not going to arrive from anywhere inside
the server and requires an explicit user action. The locking isn't
strictly necessary since the value is only read when the only process
that ever writes that value is sleeping on a semaphore. The single
integer value can always be read atomically anyway.

So I will remove the locking in XXXStartupBufferPinWaitBufId(), add in
the booleans and we're done.

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Hiroyuki Yamada on 21 Jan 2010 04:01

>Deadlock bug was prevented by stop-gap measure in December commit.
>
>Full resolution patch attached for Startup process waits on buffer pins.
>
>Startup process sets SIGALRM when waiting on a buffer pin. If woken by
>alarm we send SIGUSR1 to all backends requesting that they check to see
>if they are blocking Startup process. If so, they throw ERROR/FATAL as
>for other conflict resolutions. Deadlock stop gap removed.
>max_standby_delay = -1 option removed to prevent deadlock.
>
>Reviews welcome, otherwise commit at end of week.
>

I think the patch has two problems.

* disable_standby_sig_alarm() does not clear standby_timeout_active flag
when it succeeds in disabling the alarm.

* Assertion check in HoldingBufferPinThatDelaysRecovery() can fail
with following scenario.

1. Two transactions, xact A and xact B, are running in a HotStandby server.
2. Xact A holds a pin on buffer X.
3. Startup process calls LockBufferForCleanup() for buffer X,
sets ProcGlobal->startupBufferPinWaitBufId = X,
sends PROCSIG_RECOVERY_CONFLICT_BUFFERPIN signal to both transactions,
and sleeps.
4. Xact A handles the signal,
aborts itself,
releases the pin on buffer X,
and awake startup process.
5. Startup process wakes up
and sets ProcGlobal->startupBufferPinWaitBufId = -1.
6. Xact B handles the signal,
checks ProcGlobal->startupBufferPinWaitBufId,
and fails in the assertion check in HoldingBufferPinThatDelaysRecovery().

regards,

--
Hiroyuki YAMADA
Kokolink Corporation
yamada(a)kokolink.net

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on 21 Jan 2010 04:11

On Thu, 2010-01-21 at 18:01 +0900, Hiroyuki Yamada wrote:

> I think the patch has two problems.
>
> * disable_standby_sig_alarm() does not clear standby_timeout_active flag
> when it succeeds in disabling the alarm.

Ah, thanks.

> * Assertion check in HoldingBufferPinThatDelaysRecovery() can fail
> with following scenario.

Agreed. Will remove Assert() because of race conditions.

Thanks,

--
Simon Riggs www.2ndQuadrant.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Simon Riggs on 21 Jan 2010 05:05

On Thu, 2010-01-21 at 18:01 +0900, Hiroyuki Yamada wrote:
> >Deadlock bug was prevented by stop-gap measure in December commit.
> >
> >Full resolution patch attached for Startup process waits on buffer pins.
> >
> >Startup process sets SIGALRM when waiting on a buffer pin. If woken by
> >alarm we send SIGUSR1 to all backends requesting that they check to see
> >if they are blocking Startup process. If so, they throw ERROR/FATAL as
> >for other conflict resolutions. Deadlock stop gap removed.
> >max_standby_delay = -1 option removed to prevent deadlock.
> >
> >Reviews welcome, otherwise commit at end of week.
> >
>
> I think the patch has two problems.
>
> * disable_standby_sig_alarm() does not clear standby_timeout_active flag
> when it succeeds in disabling the alarm.
>
> * Assertion check in HoldingBufferPinThatDelaysRecovery() can fail
> with following scenario.

Updated patch, including changes from Andres and Hiroyuki.

--
Simon Riggs www.2ndQuadrant.com

First | Prev | Next | Last
Pages: 1 2 3 4 5 6
Prev: idea - new aggregates median, listagg
Next: PATCH: Add hstore_to_json()