Determining the similarity between a user supplied short piece of text (between 5 and 15 characters) and a list of similar length text items. [General]

Prev: user login and access + headers already sent
Next: integrating lib (C++) into php

From: shiplu on 17 Jul 2010 18:29

There is an algorithm called longest common sub sequence.
If you can find the longest common sub sequence of the strings of
database for the given string and sort it, you'll get the most matched
word.
But I think this algo is developed already and available in your
context. It's name can be different.
As a developer I am not sure actually what function in php or mysql
serves the purpose.

Shiplu Mokadd.im
My talks, http://talk.cmyweb.net
Follow me, http://twitter.com/shiplu
SUST Programmers, http://groups.google.com/group/p2psust
Innovation distinguishes bet ... ... (ask Steve Jobs the rest)

From: Andrew Ballard on 19 Jul 2010 18:06

On Mon, Jul 19, 2010 at 2:46 PM, tedd <tedd.sperling(a)gmail.com> wrote:
> At 12:39 PM +0100 7/19/10, Richard Quadling wrote:
>>
>> I'm using MS SQL, not mySQL.
>>
>> Found a extended stored procedure with a UDF.
>>
>> Testing it looks excellent.
>>
>> Searching for a match on 30,000 vehicles next to no additional time -
>> a few seconds in total, compared to the over 3 minutes to search using
>> SQL code.
>
> That seems a bit slow.
>
> For example, currently I'm searching over 4,000 records (which contains
> 4,000 paragraphs taken from the text of the King James version of the Bible)
> for matching words, such as %created% and the times are typically around
> 0.009 seconds.
>
> As such, searching ten times that amount should be in the range of tenths of
> a second and not seconds -- so taking a few seconds to search 30,000 records
> seems excessive to me.
>
> Cheers,
>
> tedd

I would be surprised if a Levenshtein or similar_text comparison in a
database were NOT slower than even a wildcard search because of the
calculations that have to be performed on each row in the column being
compared. That, and the fact that user-defined functions in SQL Server
often have a performance penalty of their own.

Just for kicks, you could try loading the values in that column into
an array in PHP and then time iterating the array to calculate the
Levenshtein distances for each value to see how it compares.

Andrew

From: Richard Quadling on 20 Jul 2010 05:09

On 19 July 2010 19:46, tedd <tedd.sperling(a)gmail.com> wrote:
> At 12:39 PM +0100 7/19/10, Richard Quadling wrote:
>>
>> I'm using MS SQL, not mySQL.
>>
>> Found a extended stored procedure with a UDF.
>>
>> Testing it looks excellent.
>>
>> Searching for a match on 30,000 vehicles next to no additional time -
>> a few seconds in total, compared to the over 3 minutes to search using
>> SQL code.
>
> That seems a bit slow.
>
> For example, currently I'm searching over 4,000 records (which contains
> 4,000 paragraphs taken from the text of the King James version of the Bible)
> for matching words, such as %created% and the times are typically around
> 0.009 seconds.
>
> As such, searching ten times that amount should be in the range of tenths of
> a second and not seconds -- so taking a few seconds to search 30,000 records
> seems excessive to me.

Tedd,

I'm not looking for a "word". I'm looking for similar "wrds".

Word is closer to the misspelled wrds that it is to wars.

select dbo.DamerauLevenshteinDistance('words', 'wars'),
dbo.DamerauLevenshteinDistance('words', 'wrds')

(No column name) (No column name)
2 1

Lower is better.

Also, I have to compare every row in the set and then sort it to find
the lowest values for the Damerau-Levenshtein or the highest for the
JaroâWinkler distance.

As the value entered is always going to be the unknown, I can't
pre-calculate the distances.

I do an exact match test first.

First | Prev |
Pages: 1 2
Prev: user login and access + headers already sent
Next: integrating lib (C++) into php