A thought on Index Organized Tables [PgSql]

Prev: [COMMITTERS] pgsql: Oops, don't forget to rewind the directory before scanning it to
Next: Time travel on the buildfarm

From: "Kevin Grittner" on 24 Feb 2010 10:46

Tom Lane <tgl(a)sss.pgh.pa.us> wrote:

> The fundamental point IMHO is that indexes are more complex and
> much more fragile than heaps. This is obviously true
> theoretically and we have years of experience that proves it to be
> true in the field as well. Broken comparison functions are just
> one of the possible hazards; there are many others.
>
> Now with standard indexes you can always recover from any problem
> via REINDEX; no matter how badly the index is messed up, your data
> is still there and not damaged. (Well, maybe it will fail a
> unique constraint check or something, but at least it's still
> there.)
>
> With an IOT I don't understand how you get out of index corruption
> without data loss. That's a showstopper for practical use, I
> think.

Having used the IOT implementation ("clustered indexes") in SQL
Server and then Sybase ASE starting with SQL Server 1.0, I can
relate my experiences on that. In about 18 years with over 100
databases we had maybe five or ten times that such damage made it
difficult to recover data -- typically the result of hardware
problems. This implementation had a double linked list of pointers
through the leaf level pages, so normally a query which generated a
full table scan would follow these and work. When said pointers
were damaged we would query through the index tree to see what we
could reach. There were usually other indexes on tables, which
would give us other paths to the data in these leaf pages. It was
sometimes necessary to subdivide a range in which we were getting an
error to find the "edges" of the damaged area.

There were sometimes small areas we could not reach, for which we
had to look to backups or source documents. There's clearly no
database technology which guarantees you will never have to do that
in the face of a hardware failure. To the extent that such a
technique reduces the redundant storage of values, it clearly
affects recovery options.

All in all, I suspect that it would be underrating the talent pool
available for PostgreSQL development to say we can't get to a
feature which SQL server had in version 1.0 and maintains through
their conversion to MVCC. Where it fits in the scheme of priorities
and cost/benefit is certainly a valid question. It does provide
significant benefits for some use cases, but it's certainly not
trivial to implement.

-Kevin

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Greg Stark on 24 Feb 2010 10:52

On Wed, Feb 24, 2010 at 3:12 PM, Tom Lane <tgl(a)sss.pgh.pa.us> wrote:
> With an IOT I don't understand how you get out of index corruption
> without data loss. That's a showstopper for practical use, I think.

That doesn't seem insurmountable to me. You could always allow a
REINDEX to scan the index sequentially, ignoring any index structure,
just using the tuples it finds.

However it seems to me this discussion has several only barely related
issues being covered.

1) transaction information in index

This seems like a lot of bloat in indexes. It also means breaking
a lot of other optimizations such as being able to read the tuples
directly from the heap page without locking. I'm not sure how much
those are worth though. But adding 24 bytes to every index entry seems
pretty unlikely to be a win anyways.

2) Index organized tables

This seems like a non-starter to me. We would lose the option of
doing sequential scans and the ability to have any other indexes on
the table. That would be comparable to Oracle circa 1985. We can do
better with stuff like Heikki's "grouped index tuple" and the
visibility map which don't interfere with other features as much.

3) Depending on refinding keys in the index for basic operatoin

Currently if your index procedure/operator is ill-behaved then your
index searches might fail to return matching keys. But vacuum will
work correctly and you will never have an index pointer pointing to a
dead tuple or a tuple different from the one that was originally
inserted. Things like "retail vacuum" were proposed in the past but
were rejected because the consequences of an incorrect index procedure
become much worse. You could get dangling index pointers pointing to
nonexistent tuples or even pointing to new tuples that have been
inserted into the same slot later.

I don't think these three are actually related. Afaict neither IOT nor
visibility information in indexes depend on refinding keys in the
index. But it's possible I'm missing something. Even so they're still
three separate issues.

--
greg

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Gokulakannan Somasundaram on 24 Feb 2010 11:05

>
>
> I think you're a barking up the wrong tree. AFAIUI, the need for the
> visibility map has not very much to do with whether the table has
> indices, and everything to do with avoiding unnecessary VACUUMs. In
> any event, you've not shown that the visibility map HAS any overhead,
> so talking about skipping it seems entirely premature. Keep in mind
> that the visibility map is quite small.
>

OK! i am not saying to remove the visibility map, if i am misunderstood. All
i am saying here is to remove the index only scan processing of visibility
map. If it is being used only for vacuums, you need not make it crash safe
and no WAL comes into picture.

>
> The point of the visibility map as far as index-only scans are
> concerned is that if all the needed column values can be extracted
> from the index, we still need to read the heap page to check tuple
> visibility - unless, of course, we already know from the visibility
> map that all the tuples on that heap page are guaranteed to be visible
> to all transactions. On a read-only or read-mostly table, this will
> reduce the cost of checking tuple visibility by several orders of
> magnitude.
>
> I understand that. As i suggested above, if you have no indexes for a
table, why do you need to spend the extra effort in making it crash safe for
that table? Hope i am clear.

Thanks,
Gokul.

From: Gokulakannan Somasundaram on 24 Feb 2010 11:12

>
> But adding 24 bytes to every index entry seems
> pretty unlikely to be a win anyways.
>

We actually wanted to make it optional. Not every index will be like that.
More than that we can make it into 16 bytes. Only commands within the same
transaction will not be able to do a index only scan.

> This seems like a non-starter to me. We would lose the option of
> doing sequential scans and the ability to have any other indexes on
> the table. That would be comparable to Oracle circa 1985. We can do
> better with stuff like Heikki's "grouped index tuple" and the
> visibility map which don't interfere with other features as much.
>

Sequential scans can be done on IOTs, just scan through the leaf pages. I
think you are talking about IOTs with overflow regions.
As i said already, this serves a different set of options to the DB
Designer.

>
>
> I don't think these three are actually related. Afaict neither IOT nor
> visibility information in indexes depend on refinding keys in the
> index. But it's possible I'm missing something. Even so they're still
> three separate issues.
>
> If we have visibility information in a heap, we need to goto the same index
tuple, whenever there is a update/delete. Now if there is a broken function,
it won't let us reach the index from the heap tuple . Hope you are able to
get it.

Thanks,
Gokul.

From: Gokulakannan Somasundaram on 24 Feb 2010 11:16

>
>
> With an IOT I don't understand how you get out of index corruption
> without data loss. That's a showstopper for practical use, I think.
>

For simplicity, say we are storing all the non-leaf pages of the index in a
seperate file, then the leaf pages are nothing but the table. So if we can
replicate the table, then we can replicate the non-leaf pages (say by some
modified version of reindex).

Thanks,
Gokul.

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Prev: [COMMITTERS] pgsql: Oops, don't forget to rewind the directory before scanning it to
Next: Time travel on the buildfarm