A thought on Index Organized Tables [PgSql]

Prev: [COMMITTERS] pgsql: Oops, don't forget to rewind the directory before scanning it to
Next: Time travel on the buildfarm

From: Gokulakannan Somasundaram on 24 Feb 2010 16:46

> I haven't thought about whether this is sufficient but if it is then
> an initial useful thing to add would be to use it for queries where we
> have a qual that can be checked using the index key even though we
> can't do a range scan. For example if you have a btree index on
> <a,b,c> and you have a WHERE clause like "WHERE c=0"
>
> That would be a much smaller change than IOT but it would still be a
> pretty big project. Usually the hardest part is actually putting the
> logic in the planner to determine whether it's advantageous. I would
> suggest waiting until after 9.0 is out the door to make sure you have
> the attention of Heikki or Tom or someone else who can spend the time
> to check that it will actually work before putting lots of work coding
> it.
>
> I will try that. Thanks ...

From: Gokulakannan Somasundaram on 25 Feb 2010 02:19

> The WAL record of the heap insert/update/delete contains a flag
> indicating that the visibility map needs to be updated too. Thus no need
> for a separate WAL record.
>
>
Heikki,
Have you considered these cases?
a) The current WAL architecture makes sure that the WAL Log is written
before the actual page flush( i believe ). But you are changing that
architecture for Visibility maps. Visibility map might get flushed out
before the corresponding WAL gets written. I think you would then suggest
full page writes here
b) Say for a large table, you have multiple buffers of visibility map, then
there is a chance that one buffer gets flushed to the disk and the other
doesn't. If the WAL records are not in place, then this leads to a time
inconsistent visibility map.
c) If you are going to track all the WAL linked with a buffer of visibility
map, then you need to introduce another synchronization in the critical
path.

May be i am missing something? I am asking these questions only out of
curiosity.

Thanks,
Gokul.

From: Gokulakannan Somasundaram on 25 Feb 2010 02:39

On Thu, Feb 25, 2010 at 3:16 AM, Gokulakannan Somasundaram <
gokul007(a)gmail.com> wrote:

>
> I haven't thought about whether this is sufficient but if it is then
>> an initial useful thing to add would be to use it for queries where we
>> have a qual that can be checked using the index key even though we
>> can't do a range scan. For example if you have a btree index on
>> <a,b,c> and you have a WHERE clause like "WHERE c=0"
>>
>> That would be a much smaller change than IOT but it would still be a
>> pretty big project. Usually the hardest part is actually putting the
>> logic in the planner to determine whether it's advantageous. I would
>> suggest waiting until after 9.0 is out the door to make sure you have
>> the attention of Heikki or Tom or someone else who can spend the time
>> to check that it will actually work before putting lots of work coding
>> it.
>>
>> I will try that. Thanks ...
>

Some more ideas popped up. I am just recording those.
a) In place of block id( this has to be issued for every new/recycled block
and it is not there in postgres), we can even have SnapshotNow's transaction
id. I just feel the synchronization effect will be more here.
b) We can just record the currentTimestamp in the page. While this is
without any synch, it might create problems, when we decide to go for
Master-Master replication and Distributed databases. So when such things
happens, the clock on the various systems have to be synched.

Gokul.

From: Heikki Linnakangas on 25 Feb 2010 02:59

Gokulakannan Somasundaram wrote:
> a) The current WAL architecture makes sure that the WAL Log is written
> before the actual page flush( i believe ). But you are changing that
> architecture for Visibility maps. Visibility map might get flushed out
> before the corresponding WAL gets written.

Yes. When a bit is cleared, that's OK, because a cleared bit just means
"you need to check visibility in the heap tuple". When a bit is set,
however, it's important that it doesn't hit the disk before the
corresponding heap page update. That's why visibilitymap_set() sets the
LSN on the page.

> b) Say for a large table, you have multiple buffers of visibility map, then
> there is a chance that one buffer gets flushed to the disk and the other
> doesn't. If the WAL records are not in place, then this leads to a time
> inconsistent visibility map.

Huh?

> c) If you are going to track all the WAL linked with a buffer of visibility
> map, then you need to introduce another synchronization in the critical
> path.

Double huh?

I'd suggest that you take some time to read the code and comments in
visibilitymap.c and the call sites of those functions, to get a better
picture of how it works.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Gokulakannan Somasundaram on 25 Feb 2010 06:02

> Yes. When a bit is cleared, that's OK, because a cleared bit just means
> "you need to check visibility in the heap tuple". When a bit is set,
> however, it's important that it doesn't hit the disk before the
> corresponding heap page update. That's why visibilitymap_set() sets the
> LSN on the page.
>
> OK. Say a session doing the update, which is the fist update on the page,
resets the PD_ALL_VISIBLE and just before updating the visibility map
crashes. The subsequent inserts/updates/deletes, will see the PD_ALL_VISIBLE
flag cleared and never care to update the visibility map, but actually it
would have created tuples in index and table. So won't this return wrong
results?

Again it is not clear from your documentation, how you have handled this
situation?

Thanks,
Gokul.

First | Prev | Next | Last
Pages: 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Prev: [COMMITTERS] pgsql: Oops, don't forget to rewind the directory before scanning it to
Next: Time travel on the buildfarm