multibyte charater set in levenshtein function [PgSql]

Prev: [HACKERS] crash-recovery replay of CREATE TABLESPACE is broken in HEAD
Next: patch: to_string, to_array functions

From: Itagaki Takahiro on 12 Jul 2010 00:46

Hi, I'm reviewing "Multibyte charater set in levenshtein function" patch.
https://commitfest.postgresql.org/action/patch_view?id=304

The main logic seems to be good, but I have some comments about
the coding style and refactoring.

* levenshtein_internal() and levenshtein_less_equal_internal() are very
similar. Can you merge the code? We can always use less_equal_internal()
if the overhead is ignorable. Did you compare them?

* There are many "if (!multibyte_encoding)" in levenshtein_internal().
How about split the function into two funcs for single-byte chars and
multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
use multi-byte version if the overhead is small.

* I prefer a struct rather than an array. "4 * m" and "3 * m" look like magic
numbers for me. Could you name the entries with definition of a struct?
/*
* For multibyte encoding we'll also store array of lengths of
* characters and array with character offsets in first string
* in order to avoid great number of pg_mblen calls.
*/
prev = (int *) palloc(4 * m * sizeof(int));

* There are some compiler warnings. Avoid them if possible.
fuzzystrmatch.c: In function 'levenshtein_less_equal_internal':
fuzzystrmatch.c:428: warning: 'char_lens' may be used uninitialized in
this function
fuzzystrmatch.c:428: warning: 'offsets' may be used uninitialized in
this function
fuzzystrmatch.c:430: warning: 'curr_right' may be used uninitialized
in this function
fuzzystrmatch.c: In function 'levenshtein_internal':
fuzzystrmatch.c:222: warning: 'char_lens' may be used uninitialized in
this function

* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
of 'm' is an integer.

* Need to fix the caution in docs.
http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
| Caution: At present, fuzzystrmatch does not work well with
| multi-byte encodings (such as UTF-8).
but now levenshtein supports multi-byte encoding! We should
mention which function supports mbchars not to confuse users.

* (Not an issue for the patch, but...)
Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
Unpacking versions make the core a bit faster.

--
Itagaki Takahiro

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Alexander Korotkov on 13 Jul 2010 03:04

Hi!

* levenshtein_internal() and levenshtein_less_equal_internal() are very
> similar. Can you merge the code? We can always use less_equal_internal()
> if the overhead is ignorable. Did you compare them?
>
With big value of max_d overhead is significant. Here is example on
american-english dictionary from Openoffice.

test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words;
sum
---------
1386456
(1 row)

Time: 195,083 ms
test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from
words;
sum
---------
1386456
(1 row)

Time: 317,821 ms

> * There are many "if (!multibyte_encoding)" in levenshtein_internal().
> How about split the function into two funcs for single-byte chars and
> multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always
> use multi-byte version if the overhead is small.
>
The overhead of multi-byte version was about 4 times slower. But I have
rewritten my CHAR_CMP macro with inline function. And now it's only about
1.5 times slower.

In database with muti-byte encoding:
test=# select * from words where levenshtein('qweqweqwe',word)<=5;
id | word
-------+----------
69053 | peewee
69781 | pewee
81279 | sequence
88421 | sweetie
(4 rows)

Time: 136,742 ms

In database with single-byte encoding:
test2=# select * from words where levenshtein('qweqweqwe',word)<=5;
id | word
-------+----------
69053 | peewee
69781 | pewee
81279 | sequence
88421 | sweetie
(4 rows)

Time: 88,471 ms

Anyway I think that overhead is not ignorable. That's why I have splited
levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
levenshtein_less_equal_internal_mb.

> * I prefer a struct rather than an array. "4 * m" and "3 * m" look like
> magic
> numbers for me. Could you name the entries with definition of a struct?
> /*
> * For multibyte encoding we'll also store array of lengths of
> * characters and array with character offsets in first string
> * in order to avoid great number of pg_mblen calls.
> */
> prev = (int *) palloc(4 * m * sizeof(int));
>
I this line of code the memory is allocated for 4 arrays: prev, curr,
offsets, char_lens. So I have joined offsets and char_lens into struct. But
I can't join prev and curr because of this trick:
temp = curr;
curr = prev;
prev = temp;

* There are some compiler warnings. Avoid them if possible.
> fuzzystrmatch.c: In function levenshtein_less_equal_internal:
> fuzzystrmatch.c:428: warning: char_lens may be used uninitialized in
> this function
> fuzzystrmatch.c:428: warning: offsets may be used uninitialized in
> this function
> fuzzystrmatch.c:430: warning: curr_right may be used uninitialized
> in this function
> fuzzystrmatch.c: In function levenshtein_internal:
> fuzzystrmatch.c:222: warning: char_lens may be used uninitialized in
> this function
>
Fixed.

* Coding style: Use "if (m == 0)" instead of "if (!m)" when the type
> of 'm' is an integer.
>
Fixed.

> * Need to fix the caution in docs.
> http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html
> | Caution: At present, fuzzystrmatch does not work well with
> | multi-byte encodings (such as UTF-8).
> but now levenshtein supports multi-byte encoding! We should
> mention which function supports mbchars not to confuse users.
>
I've updated this notification. Also I've added documentation for
levenshtein_less_equal function.

* (Not an issue for the patch, but...)
> Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to
> PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR?
> Unpacking versions make the core a bit faster.
>
Fixed.

With best regards,
Alexander Korotkov.

From: Itagaki Takahiro on 20 Jul 2010 03:37

2010/7/13 Alexander Korotkov <aekorotkov(a)gmail.com>:
> Anyway I think that overhead is not ignorable. That's why I have splited
> levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
> and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
> levenshtein_less_equal_internal_mb.

Thank you for good measurement! Then, it's reasonable to have multiple
implementations. It also has documentation. I'll change status of the
patch to "Ready for Committer".

The patch is good enough except argument types for some functions.
For example:
- char* vs. const char*
- text* vs. const char* + length
I hope committers would check whether there are better types for them.

--
Itagaki Takahiro

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 20 Jul 2010 21:54

On Tue, Jul 20, 2010 at 3:37 AM, Itagaki Takahiro
<itagaki.takahiro(a)gmail.com> wrote:
> 2010/7/13 Alexander Korotkov <aekorotkov(a)gmail.com>:
>> Anyway I think that overhead is not ignorable. That's why I have splited
>> levenshtein_internal into levenshtein_internal and levenshtein_internal_mb,
>> and levenshtein_less_equal_internal into levenshtein_less_equal_internal and
>> levenshtein_less_equal_internal_mb.
>
> Thank you for good measurement! Then, it's reasonable to have multiple
> implementations. It also has documentation. I'll change status of the
> patch to "Ready for Committer".
>
> The patch is good enough except argument types for some functions.
> For example:
> �- char* vs. const char*
> �- text* vs. const char* + length
> I hope committers would check whether there are better types for them.

This patch still needs some work. It includes a bunch of stylistic
changes that aren't relevant to the purpose of the patch. There's no
reason that I can see to change the existing levenshtein_internal
function to take text arguments instead of char *, or to change !m to
m == 0 in existing code, or to change the whitespace in the comments
of that function. All of those need to be reverted before we can
consider committing this.

There is a huge amount of duplicated code here. I think it makes
sense to have the multibyte version of the function be separate, but
perhaps we can merge the less-than-or-equal to bits into the main
code, so that we only have two copies instead of four. Perhaps we
can't just add a max_d argument max_distance to levenshtein_internal;
and if this value is >=0 then it represents the max allowable
distance, but if it is <0 then there is no limit. Sure, that might
slow down the existing code a bit, but it might not be significant.
I'd at least like to see some numbers showing that it is significant
before we go to this much trouble.

The code doesn't follow the project coding style. Braces must be
uncuddled. Comment paragraphs will be reflowed unless they begin and
end with ------. Function definitions should have the type
declaration on one line and the function name at the start of the
next.

Freeing memory with pfree is likely a waste of time; is there any
reason not to just rely on the memory context reset, as the original
coding did?

I think we might need to remove the acknowledgments section from this
code. If everyone who touches this code adds their name, we're
quickly going to have a mess. If we're not going to remove the
acknowledgments section, then please add my name, too, because I've
already patched this code once...

I'm going to set this back to "Waiting on Author".

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Alexander Korotkov on 21 Jul 2010 07:40

On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:

> This patch still needs some work. It includes a bunch of stylistic
> changes that aren't relevant to the purpose of the patch. There's no
> reason that I can see to change the existing levenshtein_internal
> function to take text arguments instead of char *, or to change !m to
> m == 0 in existing code, or to change the whitespace in the comments
> of that function. All of those need to be reverted before we can
> consider committing this.
>
I changed arguments of function from char * to text * in order to avoid
text_to_cstring call. Same benefit can be achived by replacing char * with
char * and length.
I changed !m to m == 0 because Itagaki asked me to make it conforming coding
style. Do you think there is no reason to fix coding style in existing
code?

> There is a huge amount of duplicated code here. I think it makes
> sense to have the multibyte version of the function be separate, but
> perhaps we can merge the less-than-or-equal to bits into the main
> code, so that we only have two copies instead of four. Perhaps we
> can't just add a max_d argument max_distance to levenshtein_internal;
> and if this value is >=0 then it represents the max allowable
> distance, but if it is <0 then there is no limit. Sure, that might
> slow down the existing code a bit, but it might not be significant.
> I'd at least like to see some numbers showing that it is significant
> before we go to this much trouble.
>
In these case we should add many checks of max_d in levenshtein_internal
function which make code more complex.
Actually, we can merge all four functions into one function. But such
function will have many checks about multibyte encoding and max_d. So, I see
four cases here:
1) one function with checks for multibyte encoding and max_d
2) two functions with checks for multibyte encoding
3) two functions with checks for max_d
4) four separate functions
If you prefer case number 3 you should argue your position little more.

> The code doesn't follow the project coding style. Braces must be
> uncuddled. Comment paragraphs will be reflowed unless they begin and
> end with ------. Function definitions should have the type
> declaration on one line and the function name at the start of the
> next.
>
> Freeing memory with pfree is likely a waste of time; is there any
> reason not to just rely on the memory context reset, as the original
> coding did?
>
Ok, I'll fix this things.

> I think we might need to remove the acknowledgments section from this
> code. If everyone who touches this code adds their name, we're
> quickly going to have a mess. If we're not going to remove the
> acknowledgments section, then please add my name, too, because I've
> already patched this code once...
>
In that case I think we can leave original acknowledgments section.

----
With best regards,
Alexander Korotkov.

| Next | Last
Pages: 1 2 3 4 5
Prev: [HACKERS] crash-recovery replay of CREATE TABLESPACE is broken in HEAD
Next: patch: to_string, to_array functions