[BUGS] BUG #5412: test case produced, possible race condition. [PgSql]

Prev: shared_buffers documentation
Next: [HACKERS] gist README typo

From: Tom Lane on 14 Apr 2010 14:31

I wrote:
> [ theory about cause of Rusty's crash ]

I started to doubt this theory after wondering why the problem hadn't
been exposed by CLOBBER_CACHE_ALWAYS testing, which is done routinely
by the buildfarm. That setting would surely cause the cache flush to
happen at the troublesome time. After a good deal more investigation,
I found out why it doesn't crash with that. The problematic case is
for a relation that has rd_newRelfilenodeSubid nonzero but
rd_createSubid zero (ie, it's been truncated in the current xact).
Given that, RelationFlushRelation will attempt a rebuild but
RelationCacheInvalidate won't exempt the relation from destruction.
However, if you do a TRUNCATE under CLOBBER_CACHE_ALWAYS, the relcache
entry gets blown away immediately at the conclusion of that command,
because we'll do a RelationCacheInvalidate as a consequence of
CLOBBER_CACHE_ALWAYS. When the relcache entry is rebuilt for later use,
it won't have rd_newRelfilenodeSubid set, so it's not a hazard anymore.
In order to expose this bug, the relcache entry has to survive past the
TRUNCATE and then a cache flush has to occur while we are in process of
rebuilding it, not before.

What this suggests is that CLOBBER_CACHE_ALWAYS is actually too strong
to provide a thorough test of cache flush hazards. Maybe we need an
alternate setting along the lines of CLOBBER_CACHE_SOMETIMES that would
randomly choose whether or not to flush at any given opportunity. But
if such a setup did produce a crash, it'd be awfully hard to reproduce
for investigation. Ideas?

There is another slightly odd thing here, which is that the stack trace
Rusty provided clearly shows the crash occurring during processing of a
local relcache invalidation message for the truncated relation. This
would be expected during execution of the TRUNCATE itself, but at that
point the rel has positive refcnt so there's no problem. According to
the stack trace the active SQL command is an INSERT ... SELECT, and I
wouldn't expect that to queue any relcache invals. Are there any
triggers or other unusual things in the real application (not the
watered-down test case) that would be triggered in INSERT/SELECT?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: "Kevin Grittner" on 14 Apr 2010 14:39

Tom Lane <tgl(a)sss.pgh.pa.us> wrote:

> What this suggests is that CLOBBER_CACHE_ALWAYS is actually too
> strong to provide a thorough test of cache flush hazards. Maybe
> we need an alternate setting along the lines of
> CLOBBER_CACHE_SOMETIMES that would randomly choose whether or not
> to flush at any given opportunity. But if such a setup did produce
> a crash, it'd be awfully hard to reproduce for investigation.
> Ideas?

Seed the random number generator such that each run of the test gets
the same "random" numbers? Or *allow* the seed to be set, with the
default being a random seed which is logged so that it can be forced
for a repeat of the run?

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Rusty Conover on 14 Apr 2010 14:59

On Apr 14, 2010, at 2:31 PM, Tom Lane wrote:

> I wrote:
>> [ theory about cause of Rusty's crash ]
>
> I started to doubt this theory after wondering why the problem hadn't
> been exposed by CLOBBER_CACHE_ALWAYS testing, which is done routinely
> by the buildfarm. That setting would surely cause the cache flush to
> happen at the troublesome time. After a good deal more investigation,
> I found out why it doesn't crash with that. The problematic case is
> for a relation that has rd_newRelfilenodeSubid nonzero but
> rd_createSubid zero (ie, it's been truncated in the current xact).
> Given that, RelationFlushRelation will attempt a rebuild but
> RelationCacheInvalidate won't exempt the relation from destruction.
> However, if you do a TRUNCATE under CLOBBER_CACHE_ALWAYS, the relcache
> entry gets blown away immediately at the conclusion of that command,
> because we'll do a RelationCacheInvalidate as a consequence of
> CLOBBER_CACHE_ALWAYS. When the relcache entry is rebuilt for later use,
> it won't have rd_newRelfilenodeSubid set, so it's not a hazard anymore.
> In order to expose this bug, the relcache entry has to survive past the
> TRUNCATE and then a cache flush has to occur while we are in process of
> rebuilding it, not before.
>
> What this suggests is that CLOBBER_CACHE_ALWAYS is actually too strong
> to provide a thorough test of cache flush hazards. Maybe we need an
> alternate setting along the lines of CLOBBER_CACHE_SOMETIMES that would
> randomly choose whether or not to flush at any given opportunity. But
> if such a setup did produce a crash, it'd be awfully hard to reproduce
> for investigation. Ideas?
>
> There is another slightly odd thing here, which is that the stack trace
> Rusty provided clearly shows the crash occurring during processing of a
> local relcache invalidation message for the truncated relation. This
> would be expected during execution of the TRUNCATE itself, but at that
> point the rel has positive refcnt so there's no problem. According to
> the stack trace the active SQL command is an INSERT ... SELECT, and I
> wouldn't expect that to queue any relcache invals. Are there any
> triggers or other unusual things in the real application (not the
> watered-down test case) that would be triggered in INSERT/SELECT?
>
> regards, tom lane

There are no triggers or other unusual things going on in the real application. This worked in 8.3.9 but started failing when going to 8.4.3.

The test case program was the smallest thing I could write to reproduce the problem consistently on my machine, but I couldn't reproduce it consistently on other machines and architectures. I'm glad Heikki was able to also see the crash on his hardware. I can take Heikki's patch back out and get a new stack trace from the test program if that would be useful to you.

Thanks,

Rusty
--
Rusty Conover
rconover(a)infogears.com
InfoGears Inc / GearBuyer.com / FootwearBuyer.com
http://www.infogears.com
http://www.gearbuyer.com
http://www.footwearbuyer.com

From: Tom Lane on 14 Apr 2010 15:30

Rusty Conover <rconover(a)infogears.com> writes:
> On Apr 14, 2010, at 2:31 PM, Tom Lane wrote:
>> There is another slightly odd thing here, which is that the stack trace
>> Rusty provided clearly shows the crash occurring during processing of a
>> local relcache invalidation message for the truncated relation.

> The test case program was the smallest thing I could write to reproduce the problem consistently on my machine, but I couldn't reproduce it consistently on other machines and architectures. I'm glad Heikki was able to also see the crash on his hardware. I can take Heikki's patch back out and get a new stack trace from the test program if that would be useful to you.

Yeah, I'm curious to know if the stack trace is the same for crashes in
the watered-down test case. AFAICS, relcache invals triggered by the
TRUNCATE itself ought to be serviced before we get out of the TRUNCATE;
and after that, the xact is holding exclusive lock on the table so
there's no way that we should get a remote inval on the specific relation
(not to mention that the trace is clearly about a local inval not an
incoming remote inval). So I don't understand why the trace says that
it's happening during a subsequent INSERT.

Mind you, Heikki's patch looks like a good idea in any case; but I'm
not sure that there isn't something else going on here.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Rusty Conover on 14 Apr 2010 16:07

On Apr 14, 2010, at 3:30 PM, Tom Lane wrote:

> Rusty Conover <rconover(a)infogears.com> writes:
>> On Apr 14, 2010, at 2:31 PM, Tom Lane wrote:
>>> There is another slightly odd thing here, which is that the stack trace
>>> Rusty provided clearly shows the crash occurring during processing of a
>>> local relcache invalidation message for the truncated relation.
>
>> The test case program was the smallest thing I could write to reproduce the problem consistently on my machine, but I couldn't reproduce it consistently on other machines and architectures. I'm glad Heikki was able to also see the crash on his hardware. I can take Heikki's patch back out and get a new stack trace from the test program if that would be useful to you.
>
> Yeah, I'm curious to know if the stack trace is the same for crashes in
> the watered-down test case. AFAICS, relcache invals triggered by the
> TRUNCATE itself ought to be serviced before we get out of the TRUNCATE;
> and after that, the xact is holding exclusive lock on the table so
> there's no way that we should get a remote inval on the specific relation
> (not to mention that the trace is clearly about a local inval not an
> incoming remote inval). So I don't understand why the trace says that
> it's happening during a subsequent INSERT.
>
> Mind you, Heikki's patch looks like a good idea in any case; but I'm
> not sure that there isn't something else going on here.
>
> regards, tom lane

Hi Tom,

Here is a backtrace from the test case script (using 8.4.3 as released):

#0 equalTupleDescs (tupdesc1=0x7f7f7f7f, tupdesc2=0x966c3b80) at tupdesc.c:311
311 if (tupdesc1->natts != tupdesc2->natts)
(gdb) bt
#0 equalTupleDescs (tupdesc1=0x7f7f7f7f, tupdesc2=0x966c3b80) at tupdesc.c:311
#1 0x0832451b in RelationClearRelation (relation=0x966c7998, rebuild=<value optimized out>) at relcache.c:1901
#2 0x08324b9f in RelationFlushRelation () at relcache.c:1991
#3 RelationCacheInvalidateEntry (relationId=7217555) at relcache.c:2042
#4 0x0831dd89 in LocalExecuteInvalidationMessage (msg=0x89e2b50) at inval.c:510
#5 0x0831d341 in ProcessInvalidationMessages (hdr=0x89d20f4, func=0x831dc50 <LocalExecuteInvalidationMessage>)
at inval.c:397
#6 0x0831d3dc in CommandEndInvalidationMessages () at inval.c:1006
#7 0x080c9035 in AtCommit_LocalCache () at xact.c:1009
#8 CommandCounterIncrement () at xact.c:634
#9 0x0826dcc9 in finish_xact_command () at postgres.c:2363
#10 0x0826ed4d in exec_simple_query (
query_string=0x89b9904 "insert into foo_dest (id, name, size) select id, name, size from foobar_source")
at postgres.c:1022
#11 0x0827042c in PostgresMain (argc=4, argv=0x894d5c8, username=0x894d590 "postgres") at postgres.c:3614
#12 0x0823a4d3 in BackendRun () at postmaster.c:3449
#13 BackendStartup () at postmaster.c:3063
#14 ServerLoop () at postmaster.c:1387
#15 0x0823b503 in PostmasterMain (argc=4, argv=0x894a678) at postmaster.c:1040
#16 0x081dc0a6 in main (argc=4, argv=0x894a678) at main.c:188
(gdb)

Hope that helps,

Rusty
--
Rusty Conover
rconover(a)infogears.com
InfoGears Inc / GearBuyer.com / FootwearBuyer.com
http://www.infogears.com
http://www.gearbuyer.com
http://www.footwearbuyer.com

| Next | Last
Pages: 1 2
Prev: shared_buffers documentation
Next: [HACKERS] gist README typo