Prev: [HACKERS] crash-recovery replay of CREATE TABLESPACE is broken in HEAD
Next: patch: to_string, to_array functions
From: Itagaki Takahiro on 12 Jul 2010 00:46 Hi, I'm reviewing "Multibyte charater set in levenshtein function" patch. https://commitfest.postgresql.org/action/patch_view?id=304 The main logic seems to be good, but I have some comments about the coding style and refactoring. * levenshtein_internal() and levenshtein_less_equal_internal() are very similar. Can you merge the code? We can always use less_equal_internal() if the overhead is ignorable. Did you compare them? * There are many "if (!multibyte_encoding)" in levenshtein_internal(). How about split the function into two funcs for single-byte chars and multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always use multi-byte version if the overhead is small. * I prefer a struct rather than an array. "4 * m" and "3 * m" look like magic numbers for me. Could you name the entries with definition of a struct? /* * For multibyte encoding we'll also store array of lengths of * characters and array with character offsets in first string * in order to avoid great number of pg_mblen calls. */ prev = (int *) palloc(4 * m * sizeof(int)); * There are some compiler warnings. Avoid them if possible. fuzzystrmatch.c: In function 'levenshtein_less_equal_internal': fuzzystrmatch.c:428: warning: 'char_lens' may be used uninitialized in this function fuzzystrmatch.c:428: warning: 'offsets' may be used uninitialized in this function fuzzystrmatch.c:430: warning: 'curr_right' may be used uninitialized in this function fuzzystrmatch.c: In function 'levenshtein_internal': fuzzystrmatch.c:222: warning: 'char_lens' may be used uninitialized in this function * Coding style: Use "if (m == 0)" instead of "if (!m)" when the type of 'm' is an integer. * Need to fix the caution in docs. http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html | Caution: At present, fuzzystrmatch does not work well with | multi-byte encodings (such as UTF-8). but now levenshtein supports multi-byte encoding! We should mention which function supports mbchars not to confuse users. * (Not an issue for the patch, but...) Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR? Unpacking versions make the core a bit faster. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Alexander Korotkov on 13 Jul 2010 03:04 Hi! * levenshtein_internal() and levenshtein_less_equal_internal() are very > similar. Can you merge the code? We can always use less_equal_internal() > if the overhead is ignorable. Did you compare them? > With big value of max_d overhead is significant. Here is example on american-english dictionary from Openoffice. test=# select sum(levenshtein('qweqweqweqweqwe',word)) from words; sum --------- 1386456 (1 row) Time: 195,083 ms test=# select sum(levenshtein_less_equal('qweqweqweqweqwe',word,100)) from words; sum --------- 1386456 (1 row) Time: 317,821 ms > * There are many "if (!multibyte_encoding)" in levenshtein_internal(). > How about split the function into two funcs for single-byte chars and > multi-byte chars? (ex. levenshtein_internal_mb() ) Or, we can always > use multi-byte version if the overhead is small. > The overhead of multi-byte version was about 4 times slower. But I have rewritten my CHAR_CMP macro with inline function. And now it's only about 1.5 times slower. In database with muti-byte encoding: test=# select * from words where levenshtein('qweqweqwe',word)<=5; id | word -------+---------- 69053 | peewee 69781 | pewee 81279 | sequence 88421 | sweetie (4 rows) Time: 136,742 ms In database with single-byte encoding: test2=# select * from words where levenshtein('qweqweqwe',word)<=5; id | word -------+---------- 69053 | peewee 69781 | pewee 81279 | sequence 88421 | sweetie (4 rows) Time: 88,471 ms Anyway I think that overhead is not ignorable. That's why I have splited levenshtein_internal into levenshtein_internal and levenshtein_internal_mb, and levenshtein_less_equal_internal into levenshtein_less_equal_internal and levenshtein_less_equal_internal_mb. > * I prefer a struct rather than an array. "4 * m" and "3 * m" look like > magic > numbers for me. Could you name the entries with definition of a struct? > /* > * For multibyte encoding we'll also store array of lengths of > * characters and array with character offsets in first string > * in order to avoid great number of pg_mblen calls. > */ > prev = (int *) palloc(4 * m * sizeof(int)); > I this line of code the memory is allocated for 4 arrays: prev, curr, offsets, char_lens. So I have joined offsets and char_lens into struct. But I can't join prev and curr because of this trick: temp = curr; curr = prev; prev = temp; * There are some compiler warnings. Avoid them if possible. > fuzzystrmatch.c: In function levenshtein_less_equal_internal: > fuzzystrmatch.c:428: warning: char_lens may be used uninitialized in > this function > fuzzystrmatch.c:428: warning: offsets may be used uninitialized in > this function > fuzzystrmatch.c:430: warning: curr_right may be used uninitialized > in this function > fuzzystrmatch.c: In function levenshtein_internal: > fuzzystrmatch.c:222: warning: char_lens may be used uninitialized in > this function > Fixed. * Coding style: Use "if (m == 0)" instead of "if (!m)" when the type > of 'm' is an integer. > Fixed. > * Need to fix the caution in docs. > http://developer.postgresql.org/pgdocs/postgres/fuzzystrmatch.html > | Caution: At present, fuzzystrmatch does not work well with > | multi-byte encodings (such as UTF-8). > but now levenshtein supports multi-byte encoding! We should > mention which function supports mbchars not to confuse users. > I've updated this notification. Also I've added documentation for levenshtein_less_equal function. * (Not an issue for the patch, but...) > Could you rewrite PG_GETARG_TEXT_P, VARDATA, and VARSIZE to > PG_GETARG_TEXT_PP, VARDATA_ANY, and VARSIZE_ANY_EXHDR? > Unpacking versions make the core a bit faster. > Fixed. With best regards, Alexander Korotkov.
From: Itagaki Takahiro on 20 Jul 2010 03:37 2010/7/13 Alexander Korotkov <aekorotkov(a)gmail.com>: > Anyway I think that overhead is not ignorable. That's why I have splited > levenshtein_internal into levenshtein_internal and levenshtein_internal_mb, > and levenshtein_less_equal_internal into levenshtein_less_equal_internal and > levenshtein_less_equal_internal_mb. Thank you for good measurement! Then, it's reasonable to have multiple implementations. It also has documentation. I'll change status of the patch to "Ready for Committer". The patch is good enough except argument types for some functions. For example: - char* vs. const char* - text* vs. const char* + length I hope committers would check whether there are better types for them. -- Itagaki Takahiro -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Robert Haas on 20 Jul 2010 21:54 On Tue, Jul 20, 2010 at 3:37 AM, Itagaki Takahiro <itagaki.takahiro(a)gmail.com> wrote: > 2010/7/13 Alexander Korotkov <aekorotkov(a)gmail.com>: >> Anyway I think that overhead is not ignorable. That's why I have splited >> levenshtein_internal into levenshtein_internal and levenshtein_internal_mb, >> and levenshtein_less_equal_internal into levenshtein_less_equal_internal and >> levenshtein_less_equal_internal_mb. > > Thank you for good measurement! Then, it's reasonable to have multiple > implementations. It also has documentation. I'll change status of the > patch to "Ready for Committer". > > The patch is good enough except argument types for some functions. > For example: > �- char* vs. const char* > �- text* vs. const char* + length > I hope committers would check whether there are better types for them. This patch still needs some work. It includes a bunch of stylistic changes that aren't relevant to the purpose of the patch. There's no reason that I can see to change the existing levenshtein_internal function to take text arguments instead of char *, or to change !m to m == 0 in existing code, or to change the whitespace in the comments of that function. All of those need to be reverted before we can consider committing this. There is a huge amount of duplicated code here. I think it makes sense to have the multibyte version of the function be separate, but perhaps we can merge the less-than-or-equal to bits into the main code, so that we only have two copies instead of four. Perhaps we can't just add a max_d argument max_distance to levenshtein_internal; and if this value is >=0 then it represents the max allowable distance, but if it is <0 then there is no limit. Sure, that might slow down the existing code a bit, but it might not be significant. I'd at least like to see some numbers showing that it is significant before we go to this much trouble. The code doesn't follow the project coding style. Braces must be uncuddled. Comment paragraphs will be reflowed unless they begin and end with ------. Function definitions should have the type declaration on one line and the function name at the start of the next. Freeing memory with pfree is likely a waste of time; is there any reason not to just rely on the memory context reset, as the original coding did? I think we might need to remove the acknowledgments section from this code. If everyone who touches this code adds their name, we're quickly going to have a mess. If we're not going to remove the acknowledgments section, then please add my name, too, because I've already patched this code once... I'm going to set this back to "Waiting on Author". -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers(a)postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
From: Alexander Korotkov on 21 Jul 2010 07:40 On Wed, Jul 21, 2010 at 5:54 AM, Robert Haas <robertmhaas(a)gmail.com> wrote: > This patch still needs some work. It includes a bunch of stylistic > changes that aren't relevant to the purpose of the patch. There's no > reason that I can see to change the existing levenshtein_internal > function to take text arguments instead of char *, or to change !m to > m == 0 in existing code, or to change the whitespace in the comments > of that function. All of those need to be reverted before we can > consider committing this. > I changed arguments of function from char * to text * in order to avoid text_to_cstring call. Same benefit can be achived by replacing char * with char * and length. I changed !m to m == 0 because Itagaki asked me to make it conforming coding style. Do you think there is no reason to fix coding style in existing code? > There is a huge amount of duplicated code here. I think it makes > sense to have the multibyte version of the function be separate, but > perhaps we can merge the less-than-or-equal to bits into the main > code, so that we only have two copies instead of four. Perhaps we > can't just add a max_d argument max_distance to levenshtein_internal; > and if this value is >=0 then it represents the max allowable > distance, but if it is <0 then there is no limit. Sure, that might > slow down the existing code a bit, but it might not be significant. > I'd at least like to see some numbers showing that it is significant > before we go to this much trouble. > In these case we should add many checks of max_d in levenshtein_internal function which make code more complex. Actually, we can merge all four functions into one function. But such function will have many checks about multibyte encoding and max_d. So, I see four cases here: 1) one function with checks for multibyte encoding and max_d 2) two functions with checks for multibyte encoding 3) two functions with checks for max_d 4) four separate functions If you prefer case number 3 you should argue your position little more. > The code doesn't follow the project coding style. Braces must be > uncuddled. Comment paragraphs will be reflowed unless they begin and > end with ------. Function definitions should have the type > declaration on one line and the function name at the start of the > next. > > Freeing memory with pfree is likely a waste of time; is there any > reason not to just rely on the memory context reset, as the original > coding did? > Ok, I'll fix this things. > I think we might need to remove the acknowledgments section from this > code. If everyone who touches this code adds their name, we're > quickly going to have a mess. If we're not going to remove the > acknowledgments section, then please add my name, too, because I've > already patched this code once... > In that case I think we can leave original acknowledgments section. ---- With best regards, Alexander Korotkov.
|
Next
|
Last
Pages: 1 2 3 4 5 Prev: [HACKERS] crash-recovery replay of CREATE TABLESPACE is broken in HEAD Next: patch: to_string, to_array functions |