A thought on Index Organized Tables [PgSql]

Prev: [COMMITTERS] pgsql: Oops, don't forget to rewind the directory before scanning it to
Next: Time travel on the buildfarm

From: Greg Stark on 24 Feb 2010 13:12

On Wed, Feb 24, 2010 at 5:46 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> "Kevin Grittner" <Kevin.Grittner(a)wicourts.gov> writes:
>> Greg Stark <gsstark(a)mit.edu> wrote:
>>> That doesn't work because when you split an index page any
>>> sequential scan in progress will either see the same tuples twice
>>> or will miss some tuples depending on where the new page is
>>> allocated. Vacuum has a clever trick for solving this but it
>>> doesn't work for arbitrarily many concurrent scans.
>
>> It sounds like you're asserting that Index Scan nodes are inherently
>> unreliable, so I must be misunderstanding you.
>
> We handle splits in a manner that insures that concurrent index-order
> scans remain consistent. I'm not sure that it's possible to scale that
> to ensure that both index-order and physical-order scans would remain
> consistent. It might be soluble but it's certainly something to worry
> about.

It might be slightly easier given the assumption that you only want to
scan leaf tuples.

But there's an additional problem I didn't think of before. Currently
we optimize index scans by copying all relevant tuples to local memory
so we don't need to hold an index lock for an extended time or spend a
lot of time relocking and rechecking the index for changes. That
wouldn't be possible if we needed to get visibility info from the page
since we would need up-to-date information.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Gokulakannan Somasundaram on 24 Feb 2010 13:34

On Wed, Feb 24, 2010 at 10:09 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:

> "Kevin Grittner" <Kevin.Grittner(a)wicourts.gov> writes:
> > So you are essentially proposing that rather than moving the heap
> > data into the leaf tuples of the index in the index file, you will
> > move the leaf index data into the heap tuples? The pages in such a
> > IOT heap file would still need to look a lot like index pages, yes?
>
> > I'm not saying it's a bad idea, but I'm curious what benefits you
> > see to taking that approach.
>
> Isn't that just a variant on Heikki's "grouped index tuples" idea?
>
> regards, tom lane
>

No Tom, Grouped index tuple doesn't use the B+ Tree data structure to
achieve the sorting, so it will not guarantee 100% clustering of data.

Gokul.

From: Gokulakannan Somasundaram on 24 Feb 2010 13:52

> Yes. The visibility map doesn't need any new WAL records to be written.
>
> We probably will need to add some WAL logging to close the holes with
> crash recovery, required for relying on it for index-only-scans, but
> AFAICS only for VACUUM and probably only one WAL record for a whole
> bunch of heap pages, so it should be pretty insignificant.
>

Hmmm.... So whenever the update transaction changes a page, it will update
the visibility map, but won't enter the WAL record.
So after the crash we have a visibility map, which has false positives.
Isn't that wrong?

>
> Let me repeat myself: if you think the overhead of a visibility map is
> noticeable or meaningful in any scenario, the onus is on you to show
> what that scenario is. I am not aware of such a scenario, which doesn't
> mean that it doesn't exist, of course, but hand-waving is not helpful.
>

Well as a DB Tuner, i am requesting to make it a optional feature. If you
and everyone else feel convinced, consider my request.

>
>
> I'm not sure what you mean with "without any page level locking".
> Whenever a visibility map page is read or modified, a lock is taken on
> the buffer.
>
>
OK. I thought you are following some kind of lock-less algorithm there.
Then updaters/deleters of multiple pages might be waiting on the same lock
and hence there is a chance of a contention there right? Again correct me,
if i am wrong ( i might have understood things incorrectly )

Thanks,
Gokul.

From: Gokulakannan Somasundaram on 24 Feb 2010 14:04

Missed the group...

On Thu, Feb 25, 2010 at 12:34 AM, Gokulakannan Somasundaram <
gokul007(a)gmail.com> wrote:

>
>
> On Thu, Feb 25, 2010 at 12:28 AM, Gokulakannan Somasundaram <
> gokul007(a)gmail.com> wrote:
>
>>
>> That doesn't work because when you split an index page any sequential
>>> scan in progress will either see the same tuples twice or will miss
>>> some tuples depending on where the new page is allocated. Vacuum has a
>>> clever trick for solving this but it doesn't work for arbitrarily many
>>> concurrent scans.
>>>
>>> Consider how the range scans are working today, while the page split
>> happens.
>>
>> The Seq scan should follow the right sibling to do the seq scan.
>>
>> Gokul.
>>
>>
> Actually thinking about what you suggested for a while, i think it should
> be possible, because the Oracle Fast Full Index scan essentially scans the
> index like that. I will try to think a way of doing that with Lehman and
> Yao...
>
> Gokul.
>

From: Heikki Linnakangas on 24 Feb 2010 14:13

Gokulakannan Somasundaram wrote:
> Hmmm.... So whenever the update transaction changes a page, it will update
> the visibility map, but won't enter the WAL record.
> So after the crash we have a visibility map, which has false positives.

The WAL record of the heap insert/update/delete contains a flag
indicating that the visibility map needs to be updated too. Thus no need
for a separate WAL record.

>> Let me repeat myself: if you think the overhead of a visibility map is
>> noticeable or meaningful in any scenario, the onus is on you to show
>> what that scenario is. I am not aware of such a scenario, which doesn't
>> mean that it doesn't exist, of course, but hand-waving is not helpful.
>
> Well as a DB Tuner, i am requesting to make it a optional feature.

There is no point in making something optional, if there is no scenarios
where you would want to turn it off.

>> I'm not sure what you mean with "without any page level locking".
>> Whenever a visibility map page is read or modified, a lock is taken on
>> the buffer.
>>
> OK. I thought you are following some kind of lock-less algorithm there.
> Then updaters/deleters of multiple pages might be waiting on the same lock
> and hence there is a chance of a contention there right?

Yeah, there is some potential for contention. But again it doesn't seem
to be a problem in any real-life scenario; I didn't see any in the test
I performed, and IIRC I did try to invoke that case, and there has been
no reports of contention from users.

If it ever becomes a problem, maybe you could indeed switch to a
lock-less algorithm, but there doesn't seem to be any need for that.

--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Prev: [COMMITTERS] pgsql: Oops, don't forget to rewind the directory before scanning it to
Next: Time travel on the buildfarm