Prev: FAQ 4.65 How can I get the unique keys from two hashes?
Next: Are there any MySQL queries or software packages for "finding similar items"
From: Ignoramus12110 on 5 Jul 2010 21:52 On 2010-07-05, Norman Peelman <npeelman(a)cfl.rr.com> wrote: > Ignoramus12110 wrote: >> On 2010-07-05, Axel Schwenke <axel.schwenke(a)gmx.de> wrote: >>> Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote: >>>> I am hoping that, perhaps, there is some free package that could take >>>> a few hundreds of thousands of text strings and could provide me with >>>> "find similar" functionality. >>>> >>>> Realizing the potential difficulty of the task, I would be content if >>>> it worked only moderately well. I just want something along the lines. >>>> >>>> Are there any MySQL functions or other software packages or perl >>>> modules that provide something of the sort. >>> CPAN has some packages for approximate string matching. Levenstein has >>> been named. And virtually all SQL databases have SOUNDEX(). Another >>> approach is trigram counting. >> >> Thanks. Do you know any package names? >> >>> The problem ist hard, especially when you look for a solution that runs >>> faster than O(n). Outside the database you cannot be faster than O(n) >>> anyway. For "few thousands" candidates it will however be fast enough. >> >> Right now I have 208,919 candidates and the number is growing by >> appx. 200 per day. >> >> ::~==>algsql "select count(*) from XXXXXXX where yyyyy = 1" >> count(*) >> 208919 >> >> I agree that it is a hard problem. >> >> Perl levenshtein module seems to be more single word oriented. >> >> i > > So does soundex now that i've tried it a bit. > > Yes, soundex ias for misspellings.
From: Jerry Stuckle on 5 Jul 2010 21:56 Ignoramus12110 wrote: > On 2010-07-05, Norman Peelman <npeelman(a)cfl.rr.com> wrote: >> Ignoramus12110 wrote: >>> On 2010-07-05, Axel Schwenke <axel.schwenke(a)gmx.de> wrote: >>>> Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote: >>>>> I am hoping that, perhaps, there is some free package that could take >>>>> a few hundreds of thousands of text strings and could provide me with >>>>> "find similar" functionality. >>>>> >>>>> Realizing the potential difficulty of the task, I would be content if >>>>> it worked only moderately well. I just want something along the lines. >>>>> >>>>> Are there any MySQL functions or other software packages or perl >>>>> modules that provide something of the sort. >>>> CPAN has some packages for approximate string matching. Levenstein has >>>> been named. And virtually all SQL databases have SOUNDEX(). Another >>>> approach is trigram counting. >>> Thanks. Do you know any package names? >>> >>>> The problem ist hard, especially when you look for a solution that runs >>>> faster than O(n). Outside the database you cannot be faster than O(n) >>>> anyway. For "few thousands" candidates it will however be fast enough. >>> Right now I have 208,919 candidates and the number is growing by >>> appx. 200 per day. >>> >>> ::~==>algsql "select count(*) from XXXXXXX where yyyyy = 1" >>> count(*) >>> 208919 >>> >>> I agree that it is a hard problem. >>> >>> Perl levenshtein module seems to be more single word oriented. >>> >>> i >> So does soundex now that i've tried it a bit. >> >> > > Yes, soundex ias for misspellings. Agreed. Soundex is not for trying to understand phrases or sentences. It is to find words which "sound" alike - i.e. misspelled words. It can't, for instance, tell the difference between "here" and "hear" - but it can tell they sound alike. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex(a)attglobal.net ==================
From: Ignoramus12110 on 5 Jul 2010 22:11 On 2010-07-06, Jerry Stuckle <jstucklex(a)attglobal.net> wrote: > Ignoramus12110 wrote: >> On 2010-07-05, Norman Peelman <npeelman(a)cfl.rr.com> wrote: >>> Ignoramus12110 wrote: >>>> On 2010-07-05, Axel Schwenke <axel.schwenke(a)gmx.de> wrote: >>>>> Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote: >>>>>> I am hoping that, perhaps, there is some free package that could take >>>>>> a few hundreds of thousands of text strings and could provide me with >>>>>> "find similar" functionality. >>>>>> >>>>>> Realizing the potential difficulty of the task, I would be content if >>>>>> it worked only moderately well. I just want something along the lines. >>>>>> >>>>>> Are there any MySQL functions or other software packages or perl >>>>>> modules that provide something of the sort. >>>>> CPAN has some packages for approximate string matching. Levenstein has >>>>> been named. And virtually all SQL databases have SOUNDEX(). Another >>>>> approach is trigram counting. >>>> Thanks. Do you know any package names? >>>> >>>>> The problem ist hard, especially when you look for a solution that runs >>>>> faster than O(n). Outside the database you cannot be faster than O(n) >>>>> anyway. For "few thousands" candidates it will however be fast enough. >>>> Right now I have 208,919 candidates and the number is growing by >>>> appx. 200 per day. >>>> >>>> ::~==>algsql "select count(*) from XXXXXXX where yyyyy = 1" >>>> count(*) >>>> 208919 >>>> >>>> I agree that it is a hard problem. >>>> >>>> Perl levenshtein module seems to be more single word oriented. >>>> >>>> i >>> So does soundex now that i've tried it a bit. >>> >>> >> >> Yes, soundex ias for misspellings. > > Agreed. Soundex is not for trying to understand phrases or sentences. > It is to find words which "sound" alike - i.e. misspelled words. > > It can't, for instance, tell the difference between "here" and "hear" - > but it can tell they sound alike. > I actually looked quite a bit, and did not find anything. Maybe my brother in law could find something. i
From: Jerry Stuckle on 6 Jul 2010 07:01 Ignoramus12110 wrote: > On 2010-07-06, Jerry Stuckle <jstucklex(a)attglobal.net> wrote: >> Ignoramus12110 wrote: >>> On 2010-07-05, Norman Peelman <npeelman(a)cfl.rr.com> wrote: >>>> Ignoramus12110 wrote: >>>>> On 2010-07-05, Axel Schwenke <axel.schwenke(a)gmx.de> wrote: >>>>>> Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote: >>>>>>> I am hoping that, perhaps, there is some free package that could take >>>>>>> a few hundreds of thousands of text strings and could provide me with >>>>>>> "find similar" functionality. >>>>>>> >>>>>>> Realizing the potential difficulty of the task, I would be content if >>>>>>> it worked only moderately well. I just want something along the lines. >>>>>>> >>>>>>> Are there any MySQL functions or other software packages or perl >>>>>>> modules that provide something of the sort. >>>>>> CPAN has some packages for approximate string matching. Levenstein has >>>>>> been named. And virtually all SQL databases have SOUNDEX(). Another >>>>>> approach is trigram counting. >>>>> Thanks. Do you know any package names? >>>>> >>>>>> The problem ist hard, especially when you look for a solution that runs >>>>>> faster than O(n). Outside the database you cannot be faster than O(n) >>>>>> anyway. For "few thousands" candidates it will however be fast enough. >>>>> Right now I have 208,919 candidates and the number is growing by >>>>> appx. 200 per day. >>>>> >>>>> ::~==>algsql "select count(*) from XXXXXXX where yyyyy = 1" >>>>> count(*) >>>>> 208919 >>>>> >>>>> I agree that it is a hard problem. >>>>> >>>>> Perl levenshtein module seems to be more single word oriented. >>>>> >>>>> i >>>> So does soundex now that i've tried it a bit. >>>> >>>> >>> Yes, soundex ias for misspellings. >> Agreed. Soundex is not for trying to understand phrases or sentences. >> It is to find words which "sound" alike - i.e. misspelled words. >> >> It can't, for instance, tell the difference between "here" and "hear" - >> but it can tell they sound alike. >> > > I actually looked quite a bit, and did not find anything. Maybe my > brother in law could find something. > > i But what you're looking for is to get a computer to be a natural language processor, which is still beyond our current programming capabilities. IBM has recently come up with a test system ("Watson") which does a fair job, but still has a long ways to go. Once we get there, we'll have a Star Trek capability :) With that said, it doesn't mean all is hopeless. Levenstein can help, as can trigram matching and other things mentioned (except SoundEx). But it will also require a lot of work on your part to "train" the system as to whether two questions are similar or not. -- ================== Remove the "x" from my email address Jerry Stuckle JDS Computer Training Corp. jstucklex(a)attglobal.net ==================
From: Marc Espie on 6 Jul 2010 07:21
In article <maudnQZR9LURoK_RnZ2dnUVZ_qadnZ2d(a)giganews.com>, Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote: >I have a MySQL database of answered algebra questions. Questions are >stored as text strings. > >Examples are > >``two dice are rolled. find the odds that the score on the dice is >either 10 or at most 5'' >``if x is the first of three consecutive even intethe product of twice a >number and three is the same as the difference'' >``Write the equation of the line with a slope of -1/3 and passing >through the point (6, -4).'' >``A flagpole casts a shadow of 32 ft, Nearby, a 10-ft tree casts a >shadow of 2 ft. What is the height of the flag pole?'' >``A flag pole casts a shadow of 32 feet. Nearby, a 10 foot tree casts a >shadow of 2 ft. Find the height of the flag pole?'' > >When students ask questions, often (if not usually) there is already >something similar answered in the database. Note that I am not >defining what is "similar" and I do realize that it is a difficult >definition to make. Are you hell-bent on mysql ? Because sqlite has a fts3 extension that looks like a prime candidate for trying to locate similar questions before using some perl approximate code to figure out whether it's the same or not... |