Prev: [HACKERS] crash-recovery replay of CREATE TABLESPACE is broken in HEAD
Next: patch: to_string, to_array functions
From: Alexander Korotkov on 29 Jul 2010 11:27 I forgot attribution in levenshtein.c file. ---- With best regards, Alexander Korotkov.
From: Robert Haas on 1 Aug 2010 21:20 On Fri, Jul 30, 2010 at 1:14 PM, Alexander Korotkov <aekorotkov(a)gmail.com> wrote: > Ok, here is the patch for multi-byte characters. > I changed arguments of levenshtein_internal function from text * to const > char * and int. I think that it makes levenshtein_internal more reusable. > For example, this function can be used for calculation distance of two > fragments of strings without need of copying of these fragments. > Actullay, I'm not fully satisfied with my solution in merging of single-byte > and multi-byte versions of function. There are "ifdef"'s in body of function > and non context-free macros. This is due to multi-byte needs to store > characters lengths returned by pg_mblen when single-byte version doesn't > need that. I tried multi-byte version without storing characters lengths, > but in this case function is about 1.5 times slower (function needs to call > pg_mblen n*m times). I reviewed this code in a fair amount of detail today and ended up rewriting it. In general terms, it's best to avoid changing things that are not relevant to the central purpose of the patch. This patch randomly adds a whole bunch of whitespace and removes a whole bunch of comments which I find useful and do not want to see removed. It took me about an hour to reverse out all of those changes, which then let me get down to the business of actually analyzing the code. The biggest problem I found is that, as coded, this is more than 50% slower on UTF-8 databases, because the multi-byte path is really slow, even if there are actually no multi-byte characters present. The attached patch takes the approach of using a fast-path for just the innermost loop when neither string contains any multi-byte characters (regardless of whether or not the encoding allows multi-byte characters). It's still slower than CVS HEAD, but for strings containing no multi-byte characters it's only marginally, if at all, slower than previous releases, because 9.0 doesn't contain your fix to avoid the extra string copy; and it's faster than your implementation even when multi-byte characters ARE present. It is slightly slower on single-byte encoding databases (I tested SQL_ASCII), but still faster than previous releases. It's also quite a bit less code duplication. Please review and let me know your thoughts. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
From: Robert Haas on 2 Aug 2010 10:48 2010/8/2 Alexander Korotkov <aekorotkov(a)gmail.com>: > On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas(a)gmail.com> wrote: >> I reviewed this code in a fair amount of detail today and ended up >> rewriting it. Â In general terms, it's best to avoid changing things >> that are not relevant to the central purpose of the patch. Â This patch >> randomly adds a whole bunch of whitespace and removes a whole bunch of >> comments which I find useful and do not want to see removed. Â It took >> me about an hour to reverse out all of those changes, which then let >> me get down to the business of actually analyzing the code. > I'm sorry. This is my first patch, I will be more accurate next time. But, I > think there is no unified opinion about some changes like replacement "!m" > with "m==0". Sure; if that were the only such change, I wouldn't have mentioned it. > I think this line is not correct: > "if (m != s_bytes && n != t_bytes)" Good catch, fixed in the attached version. >> The attached patch takes the approach of using a fast-path for just >> the innermost loop when neither string contains any multi-byte >> characters (regardless of whether or not the encoding allows >> multi-byte characters). Â It's still slower than CVS HEAD, but for >> strings containing no multi-byte characters it's only marginally, if >> at all, slower than previous releases, because 9.0 doesn't contain >> your fix to avoid the extra string copy; and it's faster than your >> implementation even when multi-byte characters ARE present. Â It is >> slightly slower on single-byte encoding databases (I tested >> SQL_ASCII), but still faster than previous releases. Â It's also quite >> a bit less code duplication. > > Can I look at the test which shows that your implementation is faster than > my when multi-byte characters are present? My tests show reverse. For > example, with russian dictionary of openoffice: Hmm, with the correction you mention above, I'm getting about the same results with multi-byte characters (but faster with only single-byte characters). I did this: CREATE TABLE words (a text not null primary key); COPY words from '/usr/share/dict/words'; SELECT SUM(levenshtein(a, 'foo')) from words; SELECT SUM(levenshtein(a, 'UrbaÅski')) FROM words; SELECT SUM(levenshtein(a, 'aÅs')) FROM words; With the updated patch attached below, these ran about 102 ms, 222 ms, 129 ms. With your patch, I get about 134 ms, 222 ms, 130 ms. Perhaps this is because I only have multi-byte characters in one of the inputs, not both. I don't have a dictionary handy with multi-byte words in it. Can you send me a file? > I think that the cause of slow down slowdown is memcmp function. This > function is very fast for long parts of memory, but of few bytes inline > function is faster, because compiler have more freedom for optimization. > After replacing memcmp with inline function like following in your > implementation: Yeah, that does seem to be slightly better. I've done a version of this in the attached patch. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
From: Robert Haas on 2 Aug 2010 16:53 2010/8/2 Alexander Korotkov <aekorotkov(a)gmail.com>: > The dump of the table with russian dictionary is in attachment. > > I use following tests: > SELECT SUM(levenshtein(a, 'foo')) from words; > SELECT SUM(levenshtein(a, 'UrbaÅski')) FROM words; > SELECT SUM(levenshtein(a, 'aÅs')) FROM words; > SELECT SUM(levenshtein(a, 'foo')) from words2; > SELECT SUM(levenshtein(a, 'дом')) FROM words2; > SELECT SUM(levenshtein(a, 'компÑÑÑеÑ')) FROM words2; > > With your last version of patch results are: > 63ms 94ms 61ms 100ms 121ms 217ms. > > But I found some other optimization. With this condition: > if (x[x_char_len-1] == y[y_char_len-1] && x_char_len == y_char_len && >    (x_char_len == 1 || char_same(x, y, x_char_len - 1))) > results become: > 58ms 96ms 63ms 102ms 107ms 171ms > > On single-byte characters results almost didn't change, but they come better > with multi-byte characters. Generally, this improvement is because first > bytes of different characters are frequently same (for example, almost all > Cyrillic characters start from same byte, and I think that there is similar > situation in other alphabets), but match of last bytes is just a > coincidence. Hey, that's really cool. It helps a lot here, too: My previous patch, with your 5 test cases: Time: 100.556 ms Time: 205.254 ms Time: 127.070 ms Time: 77.734 ms Time: 90.345 ms Time: 166.954 ms Your original patch, same 5 test cases: Time: 131.489 ms Time: 215.048 ms Time: 125.471 ms Time: 80.068 ms Time: 87.110 ms Time: 166.918 ms My patch, with your proposed change to compare the last character first, same 5 test cases: Time: 96.489 ms Time: 201.497 ms Time: 121.876 ms Time: 75.005 ms Time: 80.219 ms Time: 142.844 ms Updated patch attached. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise Postgres Company
From: Alexander Korotkov on 2 Aug 2010 06:57 On Mon, Aug 2, 2010 at 5:20 AM, Robert Haas <robertmhaas(a)gmail.com> wrote: > I reviewed this code in a fair amount of detail today and ended up > rewriting it. In general terms, it's best to avoid changing things > that are not relevant to the central purpose of the patch. This patch > randomly adds a whole bunch of whitespace and removes a whole bunch of > comments which I find useful and do not want to see removed. It took > me about an hour to reverse out all of those changes, which then let > me get down to the business of actually analyzing the code. I'm sorry. This is my first patch, I will be more accurate next time. But, I think there is no unified opinion about some changes like replacement "!m" with "m==0". I think this line is not correct: "if (m != s_bytes && n != t_bytes)" The correct negation of assumption, that both strings are single-byte, is the assumption, that at least one string is not single-byte. In this patch levenshtein function can calculate distance incorrectly: test=# select levenshtein('Æw', 'ww'); levenshtein ------------- 2 (1 row) This line should be rewritten by this. "if (m != s_bytes || n != t_bytes)" The attached patch takes the approach of using a fast-path for just > the innermost loop when neither string contains any multi-byte > characters (regardless of whether or not the encoding allows > multi-byte characters). It's still slower than CVS HEAD, but for > strings containing no multi-byte characters it's only marginally, if > at all, slower than previous releases, because 9.0 doesn't contain > your fix to avoid the extra string copy; and it's faster than your > implementation even when multi-byte characters ARE present. It is > slightly slower on single-byte encoding databases (I tested > SQL_ASCII), but still faster than previous releases. It's also quite > a bit less code duplication. > Can I look at the test which shows that your implementation is faster than my when multi-byte characters are present? My tests show reverse. For example, with russian dictionary of openoffice: With my version of patch: test=# select sum(levenshtein(word, 'ÆÙ×ÁÆÙ×Á')) from test; sum --------- 1675281 (1 row) Time: 277,190 ms With your version of patch: test=# select sum(levenshtein(word, 'ÆÙ×ÁÆÙ×Á')) from test; sum --------- 1675281 (1 row) Time: 398,011 ms I think that the cause of slow down slowdown is memcmp function. This function is very fast for long parts of memory, but of few bytes inline function is faster, because compiler have more freedom for optimization. After replacing memcmp with inline function like following in your implementation: static inline bool char_cmp(const char *s, const char *d, int len) { while (len--) { if (*s++ != *d++) return false; } return true; } Code becomes much faster: test=# select sum(levenshtein(word, 'ÆÙ×ÁÆÙ×Á')) from test; sum --------- 1675281 (1 row) Time: 241,272 ms ---- With best regards, Alexander Korotkov.
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 4 5 Prev: [HACKERS] crash-recovery replay of CREATE TABLESPACE is broken in HEAD Next: patch: to_string, to_array functions |