patch: preload dictionary new version [PgSql]

Prev: ALTER TABLE SET STATISTICS requires AccessExclusiveLock
Next: ALTER TABLE SET STATISTICS requiresAccessExclusiveLock

From: Takahiro Itagaki on 7 Jul 2010 22:50

Pavel Stehule <pavel.stehule(a)gmail.com> wrote:

> this version has enhanced AllocSet allocator - it can use a mmap API.

I review your patch and will report some comments. However, I don't have
test cases for the patch because there is no large dictionaries in the
default postgres installation. I'd like to ask you to supply test data
for the patch.

This patch allocates memory with non-file-based mmap() to preload text search
dictionary files at the server start. Note that dist files are not mmap'ed
directly in the patch; mmap() is used for reallocatable shared memory.

The dictinary loader is also modified a bit to use simple_alloc() instead
of palloc() for long-lived cache. It can reduce calls of AllocSetAlloc(),
that have some overheads to support pfree(). Since the cache is never
released, simple_alloc() seems to have better performance than palloc().
Note that the optimization will also work for non-preloaded dicts.

=== Questions ===
- How do backends share the dict cache? You might expect postmaster's
catalog is inherited to backends with fork(), but we don't use fork()
on Windows.

- Why are SQL functions dpreloaddict_init() and dpreloaddict_lexize()
defined but not used?

=== Design ===
- You added 3 custom parameters (dict_preload.dictfile/afffile/stopwords),
but I think text search configuration names is better than file names.
However, it requires system catalog access but we cannot access any
catalog at the moment of preloading. If config-name-based setting is
difficult, we need to write docs about where we can get the dict names
to be preloaded instead. (from \dFd+ ?)

- Do we need to support multiple preloaded dicts? I think dict_preload.*
should accept a list of items to be loaded. GUC_LIST_INPUT will be a help.

- Server doesn't start when I added dict_preload to
shared_preload_libraries and didn't add any custom parameters.
FATAL: missing AffFile parameter
But server should start with no effects or print WARNING messages
for "no dicts are preloaded" in such case.

- We could replace simple_alloc() to a new MemoryContextMethods that
doesn't support pfree() but has better performance. It doesn't look
ideal for me to implement simple_alloc() on the top of palloc().

=== Implementation ===
I'm sure that your patch is WIP, but I'll note some issues just in case.

- We need Makefile for contrib/dict_preload.

- mmap() is not always portable. We should check the availability
in configure, and also have an alternative implementation for Win32.

Regards,
---
Takahiro Itagaki
NTT Open Source Software Center

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Pavel Stehule on 8 Jul 2010 04:52

Hello

2010/7/8 Takahiro Itagaki <itagaki.takahiro(a)oss.ntt.co.jp>:
>
> Pavel Stehule <pavel.stehule(a)gmail.com> wrote:
>
>> this version has enhanced AllocSet allocator - it can use a mmap API.
>
> I review your patch and will report some comments. However, I don't have
> test cases for the patch because there is no large dictionaries in the
> default postgres installation. I'd like to ask you to supply test data
> for the patch.

you can use a Czech dictionary - please, download it from
http://www.pgsql.cz/data/czech.tar.gz

CREATE TEXT SEARCH DICTIONARY cspell
(template=ispell, dictfile = czech, afffile=czech, stopwords=czech);
CREATE TEXT SEARCH CONFIGURATION cs (copy=english);
ALTER TEXT SEARCH CONFIGURATION cs
ALTER MAPPING FOR word, asciiword WITH cspell, simple;

postgres=# select * from ts_debug('cs','Příliš žluťoučký kůň se napil
žluté vody');
alias | description | token | dictionaries |
dictionary | lexemes
-----------+-------------------+-----------+-----------------+------------+-------------
word | Word, all letters | Příliš | {cspell,simple} | cspell
| {příliš}
blank | Space symbols | | {} | |
word | Word, all letters | žluťoučký | {cspell,simple} | cspell
| {žluťoučký}
blank | Space symbols | | {} | |
word | Word, all letters | kůň | {cspell,simple} | cspell
| {kůň}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | se | {cspell,simple} | cspell | {}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | napil | {cspell,simple} | cspell
| {napít}
blank | Space symbols | | {} | |
word | Word, all letters | žluté | {cspell,simple} | cspell
| {žlutý}
blank | Space symbols | | {} | |
asciiword | Word, all ASCII | vody | {cspell,simple} | cspell
| {voda}

>
> This patch allocates memory with non-file-based mmap() to preload text search
> dictionary files at the server start. Note that dist files are not mmap'ed
> directly in the patch; mmap() is used for reallocatable shared memory.
>
> The dictinary loader is also modified a bit to use simple_alloc() instead
> of palloc() for long-lived cache. It can reduce calls of AllocSetAlloc(),
> that have some overheads to support pfree(). Since the cache is never
> released, simple_alloc() seems to have better performance than palloc().
> Note that the optimization will also work for non-preloaded dicts.

it produce little bit better spead, but mainly it significant memory
reduction - palloc allocation is expensive, because add 4 bytes (8
bytes) to any allocations. And it is problem for thousands smalls
blocks like TSearch ispell dictionary uses. On 64 bit the overhead is
horrible

>
> === Questions ===
> - How do backends share the dict cache? You might expect postmaster's
> catalog is inherited to backends with fork(), but we don't use fork()
> on Windows.
>

I though about some variants
a) using a shared memory - but it needs more shared memory
reservation, maybe some GUC - but this variant was refused in
discussion.
b) using a mmap on Unix and CreateFileMapping API on windows - but it
is little bit problem for me. I am not have a develop tools for ms
windows. And I don't understand to MS Win platform :(

Magnus, can you do some tip?

Without MSWindows we don't need to solve a shared memory and can use
only fork. If we can think about MSWin too, then we have to calculate
only with some shared memory based solution. But it has more
possibilities - shared dictionary can be loaded in runtime too.

> - Why are SQL functions dpreloaddict_init() and dpreloaddict_lexize()
> defined but not used?

it is used, if I remember well. It uses ispell dictionary API. The
using is simlyfied - you can parametrize preload dictionary - and then
you use a preloaded dictionary - not some specific dictionary. This
has one advantage and one disadvantage + very simple configuration, +
there are not necessary some shared dictionary manager, - only one
preload dictionary can be used.

>
> === Design ===
> - You added 3 custom parameters (dict_preload.dictfile/afffile/stopwords),
> but I think text search configuration names is better than file names.
> However, it requires system catalog access but we cannot access any
> catalog at the moment of preloading. If config-name-based setting is
> difficult, we need to write docs about where we can get the dict names
> to be preloaded instead. (from \dFd+ ?)
>

yes - it is true argument - there are not possible access to these
data in preloaded time. I would to support preloading - (and possible
support sharing session loaded dictionaries), because it ensure a
constant time for TSearch queries everytime. Yes, some documentation,
some enhancing of dictionary list info can be solution.

> - Do we need to support multiple preloaded dicts? I think dict_preload.*
> should accept a list of items to be loaded. GUC_LIST_INPUT will be a help.
>

maybe yes. Personaly I would not to complicate a design and using. And
I don't know about request for multiple preloaded dicts now. The
preloaded dictionaries interface is only server side matter - so it
can be changed/enhanced later without problems. I have a idea about
enhancig a GUC parser to allow some like

preload_dictionary.patch = ...
preload_dictionary.czech = (template=ispell, dictfile = czech,
afffile=czech, stopwords=czech)
proload_dictionary.japan = (template=.....

> - Server doesn't start when I added dict_preload to
> shared_preload_libraries and didn't add any custom parameters.
> FATAL: missing AffFile parameter
> But server should start with no effects or print WARNING messages
> for "no dicts are preloaded" in such case.
>
> - We could replace simple_alloc() to a new MemoryContextMethods that
> doesn't support pfree() but has better performance. It doesn't look
> ideal for me to implement simple_alloc() on the top of palloc().
>

I don't agree. palloc API is designed to be general - so I implemented
a new memory context type via MMapAllocSetContextCreate and then I use
a palloc function. There isn't reason to design a some new API.

> === Implementation ===
> I'm sure that your patch is WIP, but I'll note some issues just in case.
>
> - We need Makefile for contrib/dict_preload.

sure, sorry

>
> - mmap() is not always portable. We should check the availability
> in configure, and also have an alternative implementation for Win32.

yes, it have to be first step. I need a established API for simple
allocation. Maybe divide this patch to two independent patches - and
to solve memory allocation first ? Dictionary preloading isn't complex
or large feature - so it can be handled in every commitfest. Memory
management is more importal, and can be handled first.

>
>
> Regards,
> ---
> Takahiro Itagaki
> NTT Open Source Software Center
>

Thank You very much for review

Pavel Stehule

>
>

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Pavel Stehule on 8 Jul 2010 05:11

Hello

I found a page http://www.genesys-e.org/jwalter//mix4win.htm where is
section >>Emulation of mmap/munmap<<. Can be a solution?

Regards

Pavel Stehule

2010/7/8 Pavel Stehule <pavel.stehule(a)gmail.com>:
> Hello
>
> 2010/7/8 Takahiro Itagaki <itagaki.takahiro(a)oss.ntt.co.jp>:
>>
>> Pavel Stehule <pavel.stehule(a)gmail.com> wrote:
>>
>>> this version has enhanced AllocSet allocator - it can use a mmap API.
>>
>> I review your patch and will report some comments. However, I don't have
>> test cases for the patch because there is no large dictionaries in the
>> default postgres installation. I'd like to ask you to supply test data
>> for the patch.
>
> you can use a Czech dictionary - please, download it from
> http://www.pgsql.cz/data/czech.tar.gz
>
> CREATE TEXT SEARCH DICTIONARY cspell
> (template=ispell, dictfile = czech, afffile=czech, stopwords=czech);
> CREATE TEXT SEARCH CONFIGURATION cs (copy=english);
> ALTER TEXT SEARCH CONFIGURATION cs
> ALTER MAPPING FOR word, asciiword WITH cspell, simple;
>
> postgres=# select * from ts_debug('cs','Příliš žluťoučký kůň se napil
> žluté vody');
> alias | description | token | dictionaries |
> dictionary | lexemes
> -----------+-------------------+-----------+-----------------+------------+-------------
> word | Word, all letters | Příliš | {cspell,simple} | cspell
> | {příliš}
> blank | Space symbols | | {} | |
> word | Word, all letters | žluťoučký | {cspell,simple} | cspell
> | {žluťoučký}
> blank | Space symbols | | {} | |
> word | Word, all letters | kůň | {cspell,simple} | cspell
> | {kůň}
> blank | Space symbols | | {} | |
> asciiword | Word, all ASCII | se | {cspell,simple} | cspell | {}
> blank | Space symbols | | {} | |
> asciiword | Word, all ASCII | napil | {cspell,simple} | cspell
> | {napít}
> blank | Space symbols | | {} | |
> word | Word, all letters | žluté | {cspell,simple} | cspell
> | {žlutý}
> blank | Space symbols | | {} | |
> asciiword | Word, all ASCII | vody | {cspell,simple} | cspell
> | {voda}
>
>
>>
>> This patch allocates memory with non-file-based mmap() to preload text search
>> dictionary files at the server start. Note that dist files are not mmap'ed
>> directly in the patch; mmap() is used for reallocatable shared memory.
>>
>> The dictinary loader is also modified a bit to use simple_alloc() instead
>> of palloc() for long-lived cache. It can reduce calls of AllocSetAlloc(),
>> that have some overheads to support pfree(). Since the cache is never
>> released, simple_alloc() seems to have better performance than palloc().
>> Note that the optimization will also work for non-preloaded dicts.
>
> it produce little bit better spead, but mainly it significant memory
> reduction - palloc allocation is expensive, because add 4 bytes (8
> bytes) to any allocations. And it is problem for thousands smalls
> blocks like TSearch ispell dictionary uses. On 64 bit the overhead is
> horrible
>
>>
>> === Questions ===
>> - How do backends share the dict cache? You might expect postmaster's
>> catalog is inherited to backends with fork(), but we don't use fork()
>> on Windows.
>>
>
> I though about some variants
> a) using a shared memory - but it needs more shared memory
> reservation, maybe some GUC - but this variant was refused in
> discussion.
> b) using a mmap on Unix and CreateFileMapping API on windows - but it
> is little bit problem for me. I am not have a develop tools for ms
> windows. And I don't understand to MS Win platform :(
>
> Magnus, can you do some tip?
>
> Without MSWindows we don't need to solve a shared memory and can use
> only fork. If we can think about MSWin too, then we have to calculate
> only with some shared memory based solution. But it has more
> possibilities - shared dictionary can be loaded in runtime too.
>
>> - Why are SQL functions dpreloaddict_init() and dpreloaddict_lexize()
>> defined but not used?
>
> it is used, if I remember well. It uses ispell dictionary API. The
> using is simlyfied - you can parametrize preload dictionary - and then
> you use a preloaded dictionary - not some specific dictionary. This
> has one advantage and one disadvantage + very simple configuration, +
> there are not necessary some shared dictionary manager, - only one
> preload dictionary can be used.
>
>
>>
>> === Design ===
>> - You added 3 custom parameters (dict_preload.dictfile/afffile/stopwords),
>> but I think text search configuration names is better than file names.
>> However, it requires system catalog access but we cannot access any
>> catalog at the moment of preloading. If config-name-based setting is
>> difficult, we need to write docs about where we can get the dict names
>> to be preloaded instead. (from \dFd+ ?)
>>
>
> yes - it is true argument - there are not possible access to these
> data in preloaded time. I would to support preloading - (and possible
> support sharing session loaded dictionaries), because it ensure a
> constant time for TSearch queries everytime. Yes, some documentation,
> some enhancing of dictionary list info can be solution.
>
>> - Do we need to support multiple preloaded dicts? I think dict_preload.*
>> should accept a list of items to be loaded. GUC_LIST_INPUT will be a help.
>>
>
> maybe yes. Personaly I would not to complicate a design and using. And
> I don't know about request for multiple preloaded dicts now. The
> preloaded dictionaries interface is only server side matter - so it
> can be changed/enhanced later without problems. I have a idea about
> enhancig a GUC parser to allow some like
>
> preload_dictionary.patch = ...
> preload_dictionary.czech = (template=ispell, dictfile = czech,
> afffile=czech, stopwords=czech)
> proload_dictionary.japan = (template=.....
>
>
>> - Server doesn't start when I added dict_preload to
>> shared_preload_libraries and didn't add any custom parameters.
>> FATAL: missing AffFile parameter
>> But server should start with no effects or print WARNING messages
>> for "no dicts are preloaded" in such case.
>>
>> - We could replace simple_alloc() to a new MemoryContextMethods that
>> doesn't support pfree() but has better performance. It doesn't look
>> ideal for me to implement simple_alloc() on the top of palloc().
>>
>
> I don't agree. palloc API is designed to be general - so I implemented
> a new memory context type via MMapAllocSetContextCreate and then I use
> a palloc function. There isn't reason to design a some new API.
>
>> === Implementation ===
>> I'm sure that your patch is WIP, but I'll note some issues just in case.
>>
>> - We need Makefile for contrib/dict_preload.
>
> sure, sorry
>
>>
>> - mmap() is not always portable. We should check the availability
>> in configure, and also have an alternative implementation for Win32.
>
> yes, it have to be first step. I need a established API for simple
> allocation. Maybe divide this patch to two independent patches - and
> to solve memory allocation first ? Dictionary preloading isn't complex
> or large feature - so it can be handled in every commitfest. Memory
> management is more importal, and can be handled first.
>
>>
>>
>> Regards,
>> ---
>> Takahiro Itagaki
>> NTT Open Source Software Center
>>
>
> Thank You very much for review
>
> Pavel Stehule
>
>>
>>
>

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 8 Jul 2010 05:53

On Wed, Jul 7, 2010 at 10:50 PM, Takahiro Itagaki
<itagaki.takahiro(a)oss.ntt.co.jp> wrote:
> This patch allocates memory with non-file-based mmap() to preload text search
> dictionary files at the server start. Note that dist files are not mmap'ed
> directly in the patch; mmap() is used for reallocatable shared memory.

I thought someone (Tom?) had proposed idea previously of writing a
dictionary precompiler that would produce a file which could then be
mmap()'d into the backend. Has any thought been given to that
approach?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Pavel Stehule on 8 Jul 2010 07:03

2010/7/8 Robert Haas <robertmhaas(a)gmail.com>:
> On Wed, Jul 7, 2010 at 10:50 PM, Takahiro Itagaki
> <itagaki.takahiro(a)oss.ntt.co.jp> wrote:
>> This patch allocates memory with non-file-based mmap() to preload text search
>> dictionary files at the server start. Note that dist files are not mmap'ed
>> directly in the patch; mmap() is used for reallocatable shared memory.
>
> I thought someone (Tom?) had proposed idea previously of writing a
> dictionary precompiler that would produce a file which could then be
> mmap()'d into the backend. Has any thought been given to that
> approach?

The precompiler can save only some time related to parsing. But it
isn't main issue. Without simple allocation the data from dictionary
takes about 55 MB, with simple allocation about 10 MB. If you have a
100 max_session, then these data can be 100 x repeated in memory -
about 1G (for Czech dictionary). I think so memory can be used
better.

Minimally you have to read these 10MB from disc - maybe from file
cache - but it takes some time too - but it will be significantly
better than now.

Regards
Pavel Stehule

>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise Postgres Company
>

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

| Next | Last
Pages: 1 2 3
Prev: ALTER TABLE SET STATISTICS requires AccessExclusiveLock
Next: ALTER TABLE SET STATISTICS requiresAccessExclusiveLock