multibyte charater set in levenshtein function [PgSql]

Prev: [HACKERS] crash-recovery replay of CREATE TABLESPACE is broken in HEAD
Next: patch: to_string, to_array functions

From: Robert Haas on 21 Jul 2010 14:25

On Wed, Jul 21, 2010 at 7:40 AM, Alexander Korotkov
<aekorotkov(a)gmail.com> wrote:
> On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>> This patch still needs some work. �It includes a bunch of stylistic
>> changes that aren't relevant to the purpose of the patch. �There's no
>> reason that I can see to change the existing levenshtein_internal
>> function to take text arguments instead of char *, or to change !m to
>> m == 0 in existing code, or to change the whitespace in the comments
>> of that function. �All of those need to be reverted before we can
>> consider committing this.
>
> I changed arguments of function from char * to text * in order to avoid
> text_to_cstring call.

*scratches head* Aren't you just moving the same call to a different place?

> Same benefit can be achived by replacing char * with
> char * and length.
> I changed !m to m == 0 because Itagaki asked me to make it conforming coding
> style. Do you think there is no reason to fix coding style in existing
> code?

Yeah, we usually try to avoid changing that sort of thing in existing
code, unless there's a very good reason.

>> There is a huge amount of duplicated code here. �I think it makes
>> sense to have the multibyte version of the function be separate, but
>> perhaps we can merge the less-than-or-equal to bits �into the main
>> code, so that we only have two copies instead of four. �Perhaps we
>> can't just add a max_d argument max_distance to levenshtein_internal;
>> and if this value is >=0 then it represents the max allowable
>> distance, but if it is <0 then there is no limit. �Sure, that might
>> slow down the existing code a bit, but it might not be significant.
>> I'd at least like to see some numbers showing that it is significant
>> before we go to this much trouble.
>
> In these case we should add many checks of max_d in levenshtein_internal
> function which make code more complex.

When you say "many" checks, how many?

> Actually, we can merge all four functions into one function. But such
> function will have many checks about multibyte encoding and max_d. So, I see
> four cases here:
> 1) one function with checks for multibyte encoding and max_d
> 2) two functions with checks for multibyte encoding
> 3) two functions with checks for max_d
> 4) four separate functions
> If you prefer case number 3 you should argue your position little more.

I'm somewhat convinced that separating the multibyte case out has a
performance benefit both by intuition and because you posted some
numbers, but I haven't seen any argument for separating out the other
case, so I'm asking if you've checked and whether there is an effect
and whether it's significant. The default is always to try to avoid
maintaining multiple copies of substantially identical code, due to
the danger that a future patch might fail to update all of them and
thus introduce a bug.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Alexander Korotkov on 21 Jul 2010 14:47

On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:

> *scratches head* Aren't you just moving the same call to a different
> place?
>
So, where you can find this different place? :) In this patch
null-terminated strings are not used at all.

> Yeah, we usually try to avoid changing that sort of thing in existing
> code, unless there's a very good reason.
>
Ok.

> In these case we should add many checks of max_d in levenshtein_internal
> > function which make code more complex.
>
> When you say "many" checks, how many?
>
> > Actually, we can merge all four functions into one function. But such
> > function will have many checks about multibyte encoding and max_d. So, I
> see
> > four cases here:
> > 1) one function with checks for multibyte encoding and max_d
> > 2) two functions with checks for multibyte encoding
> > 3) two functions with checks for max_d
> > 4) four separate functions
> > If you prefer case number 3 you should argue your position little more.
>
> I'm somewhat convinced that separating the multibyte case out has a
> performance benefit both by intuition and because you posted some
> numbers, but I haven't seen any argument for separating out the other
> case, so I'm asking if you've checked and whether there is an effect
> and whether it's significant. The default is always to try to avoid
> maintaining multiple copies of substantially identical code, due to
> the danger that a future patch might fail to update all of them and
> thus introduce a bug.
>

I've tested it with big value of max_d and I thought that it's evident that
checking for negative value of max_d will not produce significant benefit.
Anyway, I tried to add checking for negative max_d into
levenshtein_less_equal_mb function.

static int
levenshtein_less_equal_internal_mb(char *s, char *t, int s_len, int t_len,
int ins_c, int del_c, int sub_c, int max_d)
{
int m,
n;
int *prev;
int *curr;
int i,
j;
const char *x;
const char *y;
CharLengthAndOffset *lengths_and_offsets;
int y_char_len;
int curr_left, curr_right, prev_left, prev_right, d;
int delta, min_d;

/*
* We should calculate number of characters for multibyte encodings
*/
m = pg_mbstrlen_with_len(s, s_len);
n = pg_mbstrlen_with_len(t, t_len);

/*
* We can transform an empty s into t with n insertions, or a non-empty
t
* into an empty s with m deletions.
*/
if (m == 0)
return n * ins_c;
if (n == 0)
return m * del_c;

/*
* We can find the minimal distance by the difference of lengths
*/
delta = m - n;
if (delta > 0)
min_d = delta * del_c;
else if (delta < 0)
min_d = - delta * ins_c;
else
min_d = 0;
if (max_d >= 0 && min_d > max_d)
return max_d + 1;

/*
* For security concerns, restrict excessive CPU+RAM usage. (This
* implementation uses O(m) memory and has O(mn) complexity.)
*/
if (m > MAX_LEVENSHTEIN_STRLEN ||
n > MAX_LEVENSHTEIN_STRLEN)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("argument exceeds the maximum length of %d bytes",
MAX_LEVENSHTEIN_STRLEN)));

/* One more cell for initialization column and row. */
++m;
++n;

/*
* Instead of building an (m+1)x(n+1) array, we'll use two different
* arrays of size m+1 for storing accumulated values. At each step one
* represents the "previous" row and one is the "current" row of the
* notional large array.
* For multibyte encoding we'll also store array of lengths of
* characters and array with character offsets in first string
* in order to avoid great number of
* pg_mblen calls.
*/
prev = (int *) palloc((2 * sizeof(int) + sizeof(CharLengthAndOffset)) *
m );
curr = prev + m;
lengths_and_offsets = (CharLengthAndOffset *)(prev + 2 * m);
lengths_and_offsets[0].offset = 0;
for (i = 0, x = s; i < m - 1; i++)
{
lengths_and_offsets[i].length = pg_mblen(x);
lengths_and_offsets[i + 1].offset = lengths_and_offsets[i].offset +
lengths_and_offsets[i].length;
x += lengths_and_offsets[i].length;
}
lengths_and_offsets[i].length = 0;

/* Initialize the "previous" row to 0..cols */
curr_left = 1;
d = min_d;
for (i = 0; i < delta; i++)
{
prev[i] = d;
}
curr_right = m;
for (; i < m; i++)
{
prev[i] = d;
d += (ins_c + del_c);
if (max_d >= 0 && d > max_d)
{
curr_right = i;
break;
}
}

/*
* There are following optimizations:
* 1) Actually the minimal possible value of final distance (in the case
of
* all possible matches) is stored is the cells of the matrix. In the
case
* of movement towards diagonal, which contain last cell, value should
be
* increased by ins_c + del_c. In the case of movement backwards this
* diagonal cell value should not be increased.
* 2) The range of indexes of previous row, where values, which is not
* greater than max_d, are stored, is [prev_left, prev_right]. So, only
* the cells connected with this range should be calculated.
* For multibyte encoding we should increase x and y pointers by the
* results of pg_mblen function. Also we should use CHAR_CMP macros
* for character comparison.
*/
for (y = t, j = 1; j < n; y+= y_char_len, j++)
{
int *temp;
y_char_len = pg_mblen(y);

if (max_d >= 0)
{
prev_left = curr_left;
prev_right = curr_right;
curr_left = -1;
}else{
prev_left = 1;
}

if (delta >= 0)
curr[0] = min_d + j * (ins_c + del_c);
else{
if (j <= - delta)
curr[0] = min_d;
else
curr[0] = min_d + (j + delta) * (ins_c + del_c);
}

for (i = prev_left, x = s + lengths_and_offsets[i - 1].offset; i <
m; x+= lengths_and_offsets[i - 1].length, i++)
{
if (max_d >= 0)
{
d = max_d + 1;

if (i <= prev_right){
d = Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0),
d);
}

if (i == 1 || i > prev_left)
{
d = Min(curr[i - 1] + ((i - delta > j)?(ins_c +
del_c):0), d);
d = Min(prev[i - 1] + (char_cmp(x, lengths_and_offsets[i
- 1].length, y, y_char_len) ? 0 : sub_c), d);
}

curr[i] = d;

if (curr_left == -1)
curr_left = i;
curr_right = i;
if (i > prev_right)
break;
}else{
d = Min(Min(prev[i] + ((j + delta > i)?(ins_c + del_c):0),
curr[i - 1] + ((i - delta > j)?(ins_c + del_c):0)),
prev[i - 1] + (char_cmp(x, lengths_and_offsets[i -
1].length, y, y_char_len) ? 0 : sub_c));
curr[i] = d;
}
}

if (curr_left == -1)
break;

temp = curr;
curr = prev;
prev = temp;
}

/*
* Because the final value was swapped from the previous row to the
* current row, that's where we'll find it.
*/
d = prev[m - 1];

/*
* If the last cell wasn't filled than return max_d + 1 otherwise
* return the final value.
*/
if (max_d >= 0 && (curr_left == -1 || curr_right < m - 1))
return max_d + 1;
else
return d;
}

I tested it with american-english dictionary with 98569 words.

test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words;
sum
---------
1074376
(1 row)

Time: 131,435 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from
words;
sum
---------
1074376
(1 row)

Time: 221,078 ms
test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from
words;
sum
---------
1074376
(1 row)

Time: 254,819 ms

The function with negative value of max_d didn't become faster than with
just big value of max_d.

----
With best regards,
Alexander Korotkov.

From: Alvaro Herrera on 21 Jul 2010 17:34

Excerpts from Robert Haas's message of mié jul 21 14:25:47 -0400 2010:
> On Wed, Jul 21, 2010 at 7:40 AM, Alexander Korotkov
> <aekorotkov(a)gmail.com> wrote:
> > On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:

> > Same benefit can be achived by replacing char * with
> > char * and length.
> > I changed !m to m == 0 because Itagaki asked me to make it conforming coding
> > style. Do you think there is no reason to fix coding style in existing
> > code?
>
> Yeah, we usually try to avoid changing that sort of thing in existing
> code, unless there's a very good reason.

I think fixing a stylistic issue in code that's being edited for other
purposes is fine, and a good idea going forward. We wouldn't commit a
patch that would *only* fix those, because that would cause a problem
for backpatches for no benefit, but if the patch touches something else,
then a backpatch of another patch is going to need manual intervention
anyway.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 21 Jul 2010 17:59

On Wed, Jul 21, 2010 at 2:47 PM, Alexander Korotkov
<aekorotkov(a)gmail.com> wrote:
> On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>>
>> *scratches head* �Aren't you just moving the same call to a different
>> place?
>
> So, where you can find this different place? :) In this patch
> null-terminated strings are not used at all.

I can't. You win. :-)

Actually, I wonder if there's enough performance improvement there
that we might think about extracting that part of the patch and apply
it separately. Then we could continue trying to figure out what to do
with the rest. Sometimes it's simpler to deal with one change at a
time.

> I tested it with american-english dictionary with 98569 words.
>
> test=# select sum(levenshtein(word, 'qwerqwerqwer')) from words;
> �� sum
> ---------
> �1074376
> (1 row)
>
> Time: 131,435 ms
> test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',100)) from
> words;
> �� sum
> ---------
> �1074376
> (1 row)
>
> Time: 221,078 ms
> test=# select sum(levenshtein_less_equal(word, 'qwerqwerqwer',-1)) from
> words;
> �� sum
> ---------
> �1074376
> (1 row)
>
> Time: 254,819 ms
>
> The function with negative value of max_d didn't become faster than with
> just big value of max_d.

Ah, I see. That's pretty compelling, I guess. Although it still
seems like a lot of code...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Robert Haas on 22 Jul 2010 09:23

On Thu, Jul 22, 2010 at 3:21 AM, Alexander Korotkov
<aekorotkov(a)gmail.com> wrote:
> On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>>
>> Ah, I see. �That's pretty compelling, I guess. �Although it still
>> seems like a lot of code...
>
> I think there is a way to merge single-byte and multi-byte versions of
> functions without loss in performance using macros and includes (like in
> 'backend/utils/adt/like.c'). Do you think it is acceptable in this case?

I'm not sure. I'd like to get a second opinion from someone else on
which way to go with this, but unfortunately 2 or 3 of the main
committers are on vacation at the moment.

Does anyone else have an opinion on what to do about this?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: [HACKERS] crash-recovery replay of CREATE TABLESPACE is broken in HEAD
Next: patch: to_string, to_array functions