patch: preload dictionary new version [PgSql]

Prev: ALTER TABLE SET STATISTICS requires AccessExclusiveLock
Next: ALTER TABLE SET STATISTICS requiresAccessExclusiveLock

From: Robert Haas on 8 Jul 2010 07:53

On Thu, Jul 8, 2010 at 7:03 AM, Pavel Stehule <pavel.stehule(a)gmail.com> wrote:
> 2010/7/8 Robert Haas <robertmhaas(a)gmail.com>:
>> On Wed, Jul 7, 2010 at 10:50 PM, Takahiro Itagaki
>> <itagaki.takahiro(a)oss.ntt.co.jp> wrote:
>>> This patch allocates memory with non-file-based mmap() to preload text search
>>> dictionary files at the server start. Note that dist files are not mmap'ed
>>> directly in the patch; mmap() is used for reallocatable shared memory.
>>
>> I thought someone (Tom?) had proposed idea previously of writing a
>> dictionary precompiler that would produce a file which could then be
>> mmap()'d into the backend. �Has any thought been given to that
>> approach?
>
> The precompiler can save only some time related to parsing. But it
> isn't main issue. Without simple allocation the data from dictionary
> takes about 55 MB, with simple allocation about 10 MB. If you have a
> 100 max_session, then these data can be 100 x repeated in memory -
> about 1G (for Czech dictionary). �I think so memory can be used
> better.

A precompiler can give you all the same memory management benefits.

> Minimally you have to read these 10MB from disc - maybe from file
> cache - but it takes some time too - but it will be significantly
> better than now.

If you use mmap(), you don't need to anything of the sort. And the
EXEC_BACKEND case doesn't require as many gymnastics, either. And the
variable can be PGC_SIGHUP or even PGC_USERSET instead of
PGC_POSTMASTER.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Pavel Stehule on 8 Jul 2010 08:20

2010/7/8 Robert Haas <robertmhaas(a)gmail.com>:
> On Thu, Jul 8, 2010 at 7:03 AM, Pavel Stehule <pavel.stehule(a)gmail.com> wrote:
>> 2010/7/8 Robert Haas <robertmhaas(a)gmail.com>:
>>> On Wed, Jul 7, 2010 at 10:50 PM, Takahiro Itagaki
>>> <itagaki.takahiro(a)oss.ntt.co.jp> wrote:
>>>> This patch allocates memory with non-file-based mmap() to preload text search
>>>> dictionary files at the server start. Note that dist files are not mmap'ed
>>>> directly in the patch; mmap() is used for reallocatable shared memory.
>>>
>>> I thought someone (Tom?) had proposed idea previously of writing a
>>> dictionary precompiler that would produce a file which could then be
>>> mmap()'d into the backend. Has any thought been given to that
>>> approach?
>>
>> The precompiler can save only some time related to parsing. But it
>> isn't main issue. Without simple allocation the data from dictionary
>> takes about 55 MB, with simple allocation about 10 MB. If you have a
>> 100 max_session, then these data can be 100 x repeated in memory -
>> about 1G (for Czech dictionary). I think so memory can be used
>> better.
>
> A precompiler can give you all the same memory management benefits.
>
>> Minimally you have to read these 10MB from disc - maybe from file
>> cache - but it takes some time too - but it will be significantly
>> better than now.
>
> If you use mmap(), you don't need to anything of the sort. And the
> EXEC_BACKEND case doesn't require as many gymnastics, either. And the
> variable can be PGC_SIGHUP or even PGC_USERSET instead of
> PGC_POSTMASTER.

I use mmap(). And with mmap the precompiler are not necessary.
Dictionary is loaded only one time - in original ispell format. I
think, it is much more simple for administration - just copy ispell
files. There are not some possible problems with binary
incompatibility, you don't need to solve serialisation,
deserialiasation, ...you don't need to copy TSearch ispell parser code
to client application - probably we would to support not compiled
ispell dictionaries still. Using a precompiler means a new questions
for upgrade!

The real problem is using a some API on MS Windows, where mmap doesn't exist.

I think we can divide this problem to three parts

a) simple allocator - it can be used not only for TSearch dictionaries.
b) sharing a data - it is important for large dictionaries
c) preloading - it decrease load time of first TSearch query

Regards

Pavel Stehule

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise Postgres Company
>

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Tom Lane on 8 Jul 2010 10:18

Pavel Stehule <pavel.stehule(a)gmail.com> writes:
> 2010/7/8 Robert Haas <robertmhaas(a)gmail.com>:
>> A precompiler can give you all the same memory management benefits.

> I use mmap(). And with mmap the precompiler are not necessary.
> Dictionary is loaded only one time - in original ispell format. I
> think, it is much more simple for administration - just copy ispell
> files. There are not some possible problems with binary
> incompatibility, you don't need to solve serialisation,
> deserialiasation, ...you don't need to copy TSearch ispell parser code
> to client application - probably we would to support not compiled
> ispell dictionaries still. Using a precompiler means a new questions
> for upgrade!

You're inventing a bunch of straw men to attack. There's no reason that
a precompiler approach would have to put any new requirements on the
user. For example, the dictionary-load code could automatically execute
the precompile step if it observed that the precompiled copy of the
dictionary was missing or had an older file timestamp than the source.

I like the idea of a precompiler step mainly because it still gives you
most of the benefits of the patch on platforms without mmap. (Instead
of mmap'ing, just open and read() the precompiled file.) In particular,
you would still have a creditable improvement for Windows users without
writing any Windows-specific code.

> I think we can divide this problem to three parts

> a) simple allocator - it can be used not only for TSearch dictionaries.

I think that's a waste of time, frankly. There aren't enough potential
use cases.

> b) sharing a data - it is important for large dictionaries

Useful but not really essential.

> c) preloading - it decrease load time of first TSearch query

This is the part that is the make-or-break benefit of the patch.
You need a solution that cuts load time even when mmap isn't
available.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Pavel Stehule on 9 Jul 2010 02:44

2010/7/8 Tom Lane <tgl(a)sss.pgh.pa.us>:
> Pavel Stehule <pavel.stehule(a)gmail.com> writes:
>> 2010/7/8 Robert Haas <robertmhaas(a)gmail.com>:
>>> A precompiler can give you all the same memory management benefits.
>
>> I use mmap(). And with mmap the precompiler are not necessary.
>> Dictionary is loaded only one time - in original ispell format. I
>> think, it is much more simple for administration - just copy ispell
>> files. There are not some possible problems with binary
>> incompatibility, you don't need to solve serialisation,
>> deserialiasation, ...you don't need to copy TSearch ispell parser code
>> to client application - probably we would to support not compiled
>> ispell dictionaries still. Using a precompiler means a new questions
>> for upgrade!
>
> You're inventing a bunch of straw men to attack. There's no reason that
> a precompiler approach would have to put any new requirements on the
> user. For example, the dictionary-load code could automatically execute
> the precompile step if it observed that the precompiled copy of the
> dictionary was missing or had an older file timestamp than the source.

uff - just safe activation of precompiler needs lot of low level code
- but maybe I see it wrong, and I doesn't work directly with files
inside pg. But I can't to see it as simple solution.

>
> I like the idea of a precompiler step mainly because it still gives you
> most of the benefits of the patch on platforms without mmap. (Instead
> of mmap'ing, just open and read() the precompiled file.) In particular,
> you would still have a creditable improvement for Windows users without
> writing any Windows-specific code.
>

the loading cca 10 MB takes on my comp cca 30 ms - it is better than
90ms, but it isn't a win.

>> I think we can divide this problem to three parts
>
>> a) simple allocator - it can be used not only for TSearch dictionaries.
>
> I think that's a waste of time, frankly. There aren't enough potential
> use cases.
>
>> b) sharing a data - it is important for large dictionaries
>
> Useful but not really essential.
>
>> c) preloading - it decrease load time of first TSearch query
>
> This is the part that is the make-or-break benefit of the patch.
> You need a solution that cuts load time even when mmap isn't
> available.
>

I am not sure if this existing, and if it is necessary. Probably main
problem is with Czech language - we have a few specialities. For Czech
environment is UNIX and Windows platform the most important. I have
not information about using Postgres and Fulltext on other platforms
here. So, probably the solution doesn't need be core. I am thinking
about some pgfoundry project now - some like ispell dictionary
preload.

I can send only simplified version without preloading and sharing.
Just solving a memory issue - I think so there are not different
opinions.

best regards

Pavel Stehule

> regards, tom lane
>

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Itagaki Takahiro on 11 Jul 2010 20:47

2010/7/8 Tom Lane <tgl(a)sss.pgh.pa.us>:
> For example, the dictionary-load code could automatically execute
> the precompile step if it observed that the precompiled copy of the
> dictionary was missing or had an older file timestamp than the source.

There might be a problem in automatic precompiler -- Where should we
save the result? OS users of postgres servers don't have write-permission
to $PGSHARE in normal cases. Instead, we can store the precompiled
result to $PGDATA/pg_dict_cache or so.

> I like the idea of a precompiler step mainly because it still gives you
> most of the benefits of the patch on platforms without mmap.

I also like the precompiler solution. I think the most important benefit
in the approach is that we don't need to declare dictionaries to be preloaded
in configuration files; We can always use mmap() for all dictionary files.

--
Takahiro Itagaki

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3
Prev: ALTER TABLE SET STATISTICS requires AccessExclusiveLock
Next: ALTER TABLE SET STATISTICS requiresAccessExclusiveLock