multibyte charater set in levenshtein function [PgSql]

Prev: [HACKERS] crash-recovery replay of CREATE TABLESPACE is broken in HEAD
Next: patch: to_string, to_array functions

From: Alexander Korotkov on 22 Jul 2010 04:23

Such version with macros and includes can look like this:

#ifdef MULTIBYTE
#define NEXT_X (x+= char_lens[i-1])
#define NEXT_Y (y+= y_char_len)
#define CMP (char_cmp(x, char_lens[i-1], y, y_char_len))
#else
#define NEXT_X (x++)
#define NEXT_Y (y++)
#define CMP (*x == *y)
#endif

static int
levenshtein_internal(text *s, text *t, int ins_c, int del_c, int sub_c)
{
int m,
n,
d;
int *prev;
int *curr;
int i,
j;
const char *x;
const char *y;
#ifdef MULTIBYTE
int *char_lens;
int y_char_len;
#endif

/*
* We should calculate number of characters for multibyte encodings
*/
m = pg_mbstrlen_with_len(VARDATA_ANY(s), VARSIZE_ANY_EXHDR(s));
n = pg_mbstrlen_with_len(VARDATA_ANY(t), VARSIZE_ANY_EXHDR(t));

/*
* We can transform an empty s into t with n insertions, or a non-empty
t
* into an empty s with m deletions.
*/
if (m == 0)
return n * ins_c;
if (n == 0)
return m * del_c;

/*
* For security concerns, restrict excessive CPU+RAM usage. (This
* implementation uses O(m) memory and has O(mn) complexity.)
*/
if (m > MAX_LEVENSHTEIN_STRLEN ||
n > MAX_LEVENSHTEIN_STRLEN)
ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
errmsg("argument exceeds the maximum length of %d bytes",
MAX_LEVENSHTEIN_STRLEN)));

/* One more cell for initialization column and row. */
++m;
++n;

/*
* Instead of building an (m+1)x(n+1) array, we'll use two different
* arrays of size m+1 for storing accumulated values. At each step one
* represents the "previous" row and one is the "current" row of the
* notional large array.
* For multibyte encoding we'll also store array of lengths of
* characters of first string in order to avoid great number of
* pg_mblen calls.
*/
#ifdef MULTIBYTE
prev = (int *) palloc(3 * m * sizeof(int));
curr = prev + m;
char_lens = prev + 2 * m;
for (i = 0, x = VARDATA_ANY(s); i < m - 1; i++)
{
char_lens[i] = pg_mblen(x);
x += char_lens[i];
}
char_lens[i] = 0;
#else
prev = (int *) palloc(2 * m * sizeof(int));
curr = prev + m;
#endif

/* Initialize the "previous" row to 0..cols */
for (i = 0; i < m; i++)
prev[i] = i * del_c;

/*
* For multibyte encoding we should increase x and y pointers by the
* results of pg_mblen function. Also we should use CHAR_CMP macros
* for character comparison.
*/
for (y = VARDATA_ANY(t), j = 1; j < n; NEXT_Y, j++)
{
int *temp;
#ifdef MULTIBYTE
y_char_len = pg_mblen(y);
#endif

curr[0] = j * ins_c;

for (x = VARDATA_ANY(s), i = 1; i < m; NEXT_X, i++)
{
int ins;
int del;
int sub;

ins = prev[i] + ins_c;
del = curr[i - 1] + del_c;
sub = prev[i - 1] + (CMP ? 0 : sub_c);
curr[i] = Min(ins, del);
curr[i] = Min(curr[i], sub);
}

temp = curr;
curr = prev;
prev = temp;
}

d = prev[m - 1];

/*
* Because the final value was swapped from the previous row to the
* current row, that's where we'll find it.
*/
return d;
}

What do you thing about it?

----
With best regards,
Alexander Korotkov.

From: Alexander Korotkov on 22 Jul 2010 03:21

On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:

> Ah, I see. That's pretty compelling, I guess. Although it still
> seems like a lot of code...
>
I think there is a way to merge single-byte and multi-byte versions of
functions without loss in performance using macros and includes (like in
'backend/utils/adt/like.c'). Do you think it is acceptable in this case?

----
With best regards,
Alexander Korotkov.

From: Alvaro Herrera on 22 Jul 2010 22:35

Excerpts from Alexander Korotkov's message of jue jul 22 03:21:57 -0400 2010:
> On Thu, Jul 22, 2010 at 1:59 AM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>
> > Ah, I see. That's pretty compelling, I guess. Although it still
> > seems like a lot of code...
> >
> I think there is a way to merge single-byte and multi-byte versions of
> functions without loss in performance using macros and includes (like in
> 'backend/utils/adt/like.c'). Do you think it is acceptable in this case?

Using the trick of a single source file similar to like_match.c that
implements all the different behaviors, and include that file more than
once with different macro definitions, is probably a good idea.

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

From: Alexander Korotkov on 24 Jul 2010 11:27

Here is new version of my patch. There are following changes:

1) I've merged singlebyte and multibyte versions of levenshtein_internal and
levenshtein_less_equal_internal using macros and includes.
2) I found that levenshtein takes reasonable time even for long strings.
There is an example with strings which contain random combinations of words.

test=# select count(*) from words;
count
-------
98569
(1 row)

test=# select * from words limit 10;
id |
word | next_id
-------+------------------------------------------------------------------------------------+---------
200 | Agnew diodes drilled composts Wassermann nonprofit adjusting
Chautauqua | 17370
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed
zither | 70983
5418 | Eroses gauchos bowls smellier weeklies tormentors cornmeal
punctuality | 96685
6103 | G prompt boron overthrew cricking begonia snuffers
blazed | 34518
4707 | Djibouti Mari pencilling approves pointing's pashas magpie
rams | 71200
10125 | Lyle Edgardo livers gruelled capable cancels gnawing's
inhospitable | 28018
9615 | Leos deputy evener yogis assault palatals Octobers
surceased | 21878
15640 | Starr's arrests malapropism Koufax's aesthetes Howell rustier
Algeria | 19316
15776 | Sucre battle's misapprehension's predicating nutmegged
electrification bowl planks | 65739
16878 | Updike Forster ragout keenly exhalation grayed switch
guava's | 43013
(10 rows)

Time: 26,125 ms

Time: 137,061 ms
test=# select avg(length(word)) from words;
avg
---------------------
74.6052207083362923
(1 row)

Time: 96,415 ms
test=# select * from words where levenshtein(word, 'Dhaka craziness savvies
teeae luhs Barentss unwe zher')<=10;
id | word |
next_id
------+-----------------------------------------------------------------+---------
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 2957,426 ms
test=# select * from words where levenshtein_less_equal(word, 'Dhaka
craziness savvies teeae luhs Barentss unwe zher',10)<=10;
id | word |
next_id
------+-----------------------------------------------------------------+---------
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 84,755 ms

It takes O(max(n, m) * max_d / min(ins_c, max_c)) in worst case. That's why
I changed limitation of this function to max_d <= 255 * min(ins_c, del_c)

3) I found that there is no need for storing character offsets in
levenshtein_less_equal_internal_mb. Instead of this first position in
string, where value of distance wasn't greater than max_d, can be stored.

4) I found that there is theoretical maximum distance based of string
lengths. It represents the case, when no characters are matching. If
theoretical maximum distance is less than given maximum distance we can use
theoretical maximum distance. It improves behavior of levenshtein_less_equal
with high max_d, but it's still slower than levenshtein:

test=# select * from words where levenshtein_less_equal(word, 'Dhaka
craziness savvies teeae luhs Barentss unwe zher',1000)<=10;
id | word |
next_id
------+-----------------------------------------------------------------+---------
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 4102,731 ms
test=# select * from words where levenshtein(word, 'Dhaka craziness savvies
teeae luhs Barentss unwe zher')<=10;
id | word |
next_id
------+-----------------------------------------------------------------+---------
4589 | Dhaka craziness savvies teenager ploughs Barents's unwed zither |
70983
(1 row)

Time: 2983,244 ms

----
With best regards,
Alexander Korotkov.

From: Robert Haas on 29 Jul 2010 16:16

On Wed, Jul 21, 2010 at 5:59 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
> On Wed, Jul 21, 2010 at 2:47 PM, Alexander Korotkov
> <aekorotkov(a)gmail.com> wrote:
>> On Wed, Jul 21, 2010 at 10:25 PM, Robert Haas <robertmhaas(a)gmail.com> wrote:
>>>
>>> *scratches head* �Aren't you just moving the same call to a different
>>> place?
>>
>> So, where you can find this different place? :) In this patch
>> null-terminated strings are not used at all.
>
> I can't. �You win. �:-)
>
> Actually, I wonder if there's enough performance improvement there
> that we might think about extracting that part of the patch and apply
> it separately. �Then we could continue trying to figure out what to do
> with the rest. �Sometimes it's simpler to deal with one change at a
> time.

I tested this today and the answer was a resounding yes. I ran
sum(levenshtein(t, 'foo')) over a dictionary file with about 2 million
words and got a speedup of around 15% just by eliminating the
text_to_cstring() calls. So I've committed that part of this patch.

I'll try to look at the rest of the patch when I get a chance, but I'm
wondering if it might make sense to split it into two patches -
specifically, one patch to handle multi-byte characters correctly, and
then a second patch for the less-than-or-equal-to functions. I think
that might simplify reviewing a bit.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

First | Prev | Next | Last
Pages: 1 2 3 4 5
Prev: [HACKERS] crash-recovery replay of CREATE TABLESPACE is broken in HEAD
Next: patch: to_string, to_array functions