multibyte charater set in levenshtein function [PgSql]

Prev: [HACKERS] crash-recovery replay of CREATE TABLESPACE is broken in HEAD
Next: patch: to_string, to_array functions

From: Alexander Korotkov on 29 Jul 2010 11:27

I forgot attribution in levenshtein.c file.

----
With best regards,
Alexander Korotkov.

From: Robert Haas on 1 Aug 2010 21:20

On Fri, Jul 30, 2010 at 1:14 PM, Alexander Korotkov
<aekorotkov(a)gmail.com> wrote:
> Ok, here is the patch for multi-byte characters.
> I changed arguments of levenshtein_internal function from text * to const
> char * and int. I think that it makes levenshtein_internal more reusable.
> For example, this function can be used for calculation distance of two
> fragments of strings without need of copying of these fragments.
> Actullay, I'm not fully satisfied with my solution in merging of single-byte
> and multi-byte versions of function. There are "ifdef"'s in body of function
> and non context-free macros. This is due to multi-byte needs to store
> characters lengths returned by pg_mblen when single-byte version doesn't
> need that. I tried multi-byte version without storing characters lengths,
> but in this case function is about 1.5 times slower (function needs to call
> pg_mblen n*m times).

I reviewed this code in a fair amount of detail today and ended up
rewriting it. In general terms, it's best to avoid changing things
that are not relevant to the central purpose of the patch. This patch
randomly adds a whole bunch of whitespace and removes a whole bunch of
comments which I find useful and do not want to see removed. It took
me about an hour to reverse out all of those changes, which then let
me get down to the business of actually analyzing the code. The
biggest problem I found is that, as coded, this is more than 50%
slower on UTF-8 databases, because the multi-byte path is really slow,
even if there are actually no multi-byte characters present.

The attached patch takes the approach of using a fast-path for just
the innermost loop when neither string contains any multi-byte
characters (regardless of whether or not the encoding allows
multi-byte characters). It's still slower than CVS HEAD, but for
strings containing no multi-byte characters it's only marginally, if
at all, slower than previous releases, because 9.0 doesn't contain
your fix to avoid the extra string copy; and it's faster than your
implementation even when multi-byte characters ARE present. It is
slightly slower on single-byte encoding databases (I tested
SQL_ASCII), but still faster than previous releases. It's also quite
a bit less code duplication.

Please review and let me know your thoughts.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From: Robert Haas on 2 Aug 2010 10:48

2010/8/2 Alexander Korotkov <aekorotkov(a)gmail.com>:
> On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>> I reviewed this code in a fair amount of detail today and ended up
>> rewriting it. Â In general terms, it's best to avoid changing things
>> that are not relevant to the central purpose of the patch. Â This patch
>> randomly adds a whole bunch of whitespace and removes a whole bunch of
>> comments which I find useful and do not want to see removed. Â It took
>> me about an hour to reverse out all of those changes, which then let
>> me get down to the business of actually analyzing the code.
> I'm sorry. This is my first patch, I will be more accurate next time. But, I
> think there is no unified opinion about some changes like replacement "!m"
> with "m==0".

Sure; if that were the only such change, I wouldn't have mentioned it.

> I think this line is not correct:
> "if (m != s_bytes && n != t_bytes)"

Good catch, fixed in the attached version.

>> The attached patch takes the approach of using a fast-path for just
>> the innermost loop when neither string contains any multi-byte
>> characters (regardless of whether or not the encoding allows
>> multi-byte characters). Â It's still slower than CVS HEAD, but for
>> strings containing no multi-byte characters it's only marginally, if
>> at all, slower than previous releases, because 9.0 doesn't contain
>> your fix to avoid the extra string copy; and it's faster than your
>> implementation even when multi-byte characters ARE present. Â It is
>> slightly slower on single-byte encoding databases (I tested
>> SQL_ASCII), but still faster than previous releases. Â It's also quite
>> a bit less code duplication.
>
> Can I look at the test which shows that your implementation is faster than
> my when multi-byte characters are present? My tests show reverse. For
> example, with russian dictionary of openoffice:

Hmm, with the correction you mention above, I'm getting about the same
results with multi-byte characters (but faster with only single-byte
characters). I did this:

CREATE TABLE words (a text not null primary key);
COPY words from '/usr/share/dict/words';

SELECT SUM(levenshtein(a, 'foo')) from words;
SELECT SUM(levenshtein(a, 'UrbaÅski')) FROM words;
SELECT SUM(levenshtein(a, 'aÅs')) FROM words;

With the updated patch attached below, these ran about 102 ms, 222 ms,
129 ms. With your patch, I get about 134 ms, 222 ms, 130 ms. Perhaps
this is because I only have multi-byte characters in one of the
inputs, not both. I don't have a dictionary handy with multi-byte
words in it. Can you send me a file?

> I think that the cause of slow down slowdown is memcmp function. This
> function is very fast for long parts of memory, but of few bytes inline
> function is faster, because compiler have more freedom for optimization.
> After replacing memcmp with inline function like following in your
> implementation:

Yeah, that does seem to be slightly better. I've done a version of
this in the attached patch.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From: Robert Haas on 2 Aug 2010 16:53

2010/8/2 Alexander Korotkov <aekorotkov(a)gmail.com>:
> The dump of the table with russian dictionary is in attachment.
>
> I use following tests:
> SELECT SUM(levenshtein(a, 'foo')) from words;
> SELECT SUM(levenshtein(a, 'UrbaÅski')) FROM words;
> SELECT SUM(levenshtein(a, 'aÅs')) FROM words;
> SELECT SUM(levenshtein(a, 'foo')) from words2;
> SELECT SUM(levenshtein(a, 'Ð´Ð¾Ð¼')) FROM words2;
> SELECT SUM(levenshtein(a, 'ÐºÐ¾Ð¼Ð¿ÑÑÑÐµÑ')) FROM words2;
>
> With your last version of patch results are:
> 63ms 94ms 61ms 100ms 121ms 217ms.
>
> But I found some other optimization. With this condition:
> if (x[x_char_len-1] == y[y_char_len-1] && x_char_len == y_char_len &&
> Â Â Â (x_char_len == 1 || char_same(x, y, x_char_len - 1)))
> results become:
> 58ms 96ms 63ms 102ms 107ms 171ms
>
> On single-byte characters results almost didn't change, but they come better
> with multi-byte characters. Generally, this improvement is because first
> bytes of different characters are frequently same (for example, almost all
> Cyrillic characters start from same byte, and I think that there is similar
> situation in other alphabets), but match of last bytes is just a
> coincidence.

Hey, that's really cool. It helps a lot here, too:

My previous patch, with your 5 test cases:
Time: 100.556 ms
Time: 205.254 ms
Time: 127.070 ms
Time: 77.734 ms
Time: 90.345 ms
Time: 166.954 ms

Your original patch, same 5 test cases:
Time: 131.489 ms
Time: 215.048 ms
Time: 125.471 ms
Time: 80.068 ms
Time: 87.110 ms
Time: 166.918 ms

My patch, with your proposed change to compare the last character
first, same 5 test cases:
Time: 96.489 ms
Time: 201.497 ms
Time: 121.876 ms
Time: 75.005 ms
Time: 80.219 ms
Time: 142.844 ms

Updated patch attached.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From: Alexander Korotkov on 2 Aug 2010 06:57

On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:

> I reviewed this code in a fair amount of detail today and ended up
> rewriting it. In general terms, it's best to avoid changing things
> that are not relevant to the central purpose of the patch. This patch
> randomly adds a whole bunch of whitespace and removes a whole bunch of
> comments which I find useful and do not want to see removed. It took
> me about an hour to reverse out all of those changes, which then let
> me get down to the business of actually analyzing the code.

I'm sorry. This is my first patch, I will be more accurate next time. But, I
think there is no unified opinion about some changes like replacement "!m"
with "m==0".

I think this line is not correct:
"if (m != s_bytes && n != t_bytes)"
The correct negation of assumption, that both strings are single-byte, is
the assumption, that at least one string is not single-byte. In this patch
levenshtein function can calculate distance incorrectly:
test=# select levenshtein('Æw', 'ww');
levenshtein
-------------
2
(1 row)
This line should be rewritten by this.
"if (m != s_bytes || n != t_bytes)"

The attached patch takes the approach of using a fast-path for just
> the innermost loop when neither string contains any multi-byte
> characters (regardless of whether or not the encoding allows
> multi-byte characters). It's still slower than CVS HEAD, but for
> strings containing no multi-byte characters it's only marginally, if
> at all, slower than previous releases, because 9.0 doesn't contain
> your fix to avoid the extra string copy; and it's faster than your
> implementation even when multi-byte characters ARE present. It is
> slightly slower on single-byte encoding databases (I tested
> SQL_ASCII), but still faster than previous releases. It's also quite
> a bit less code duplication.
>

Can I look at the test which shows that your implementation is faster than
my when multi-byte characters are present? My tests show reverse. For
example, with russian dictionary of openoffice:

With my version of patch:
test=# select sum(levenshtein(word, 'ÆÙ×ÁÆÙ×Á')) from test;
sum
---------
1675281
(1 row)

Time: 277,190 ms

With your version of patch:
test=# select sum(levenshtein(word, 'ÆÙ×ÁÆÙ×Á')) from test;
sum
---------
1675281
(1 row)

Time: 398,011 ms

I think that the cause of slow down slowdown is memcmp function. This
function is very fast for long parts of memory, but of few bytes inline
function is faster, because compiler have more freedom for optimization.
After replacing memcmp with inline function like following in your
implementation:

static inline bool char_cmp(const char *s, const char *d, int len)
{
while (len--)
{
if (*s++ != *d++)
return false;
}
return true;
}

Code becomes much faster:

test=# select sum(levenshtein(word, 'ÆÙ×ÁÆÙ×Á')) from test;
sum
---------
1675281
(1 row)

Time: 241,272 ms

----
With best regards,
Alexander Korotkov.

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: [HACKERS] crash-recovery replay of CREATE TABLESPACE is broken in HEAD
Next: patch: to_string, to_array functions