Prev: Are there any MySQL queries or software packages for "findingsimilar items"
Next: NYC LOCAL: Tuesday 6 July 2010 NYLUG: The Smalltalk System Pharo
From: Erick T. Barkhuis on 5 Jul 2010 16:38 Ignoramus12110: .... >So... Any suggestion for software to ran strings by similarity and >provide "top 5" or something like that? All I can come up with is Levenshtein (not much experience using it, though). May I suggest you use "levenshtein mysql" or "levenshtein php" as a search phrase? -- Erick
From: Axel Schwenke on 5 Jul 2010 17:33 Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote: > > I am hoping that, perhaps, there is some free package that could take > a few hundreds of thousands of text strings and could provide me with > "find similar" functionality. > > Realizing the potential difficulty of the task, I would be content if > it worked only moderately well. I just want something along the lines. > > Are there any MySQL functions or other software packages or perl > modules that provide something of the sort. CPAN has some packages for approximate string matching. Levenstein has been named. And virtually all SQL databases have SOUNDEX(). Another approach is trigram counting. The problem ist hard, especially when you look for a solution that runs faster than O(n). Outside the database you cannot be faster than O(n) anyway. For "few thousands" candidates it will however be fast enough. XL
From: Axel Schwenke on 6 Jul 2010 01:30 Ignoramus12110 <ignoramus12110(a)NOSPAM.12110.invalid> wrote: > On 2010-07-05, Axel Schwenke <axel.schwenke(a)gmx.de> wrote: >> >> CPAN has some packages for approximate string matching. Levenstein has >> been named. And virtually all SQL databases have SOUNDEX(). Another >> approach is trigram counting. > > Thanks. Do you know any package names? CPAN does: http://search.cpan.org/search?query=levenshtein&mode=all This also lists some non-Levenstein implementations >> The problem ist hard, especially when you look for a solution that runs >> faster than O(n). Outside the database you cannot be faster than O(n) >> anyway. For "few thousands" candidates it will however be fast enough. > > Right now I have 208,919 candidates and the number is growing by > appx. 200 per day. Then non-indexed solutions migh be a little slow. Approximate matching is one facette of fulltext search engines. So you might want to try one of those. MySQL itself comes with a (limited) implementation of a FULLTEXT index. And there is a wealth of fulltext engines: Sphinx, Lucene, Xapian, Swish++, mnogosearch, ... SOUNDEX() I named only for completeness. The nice thing about it is that it is virtually everywhere available. Though its usefulness is quite limited because it is very coarse and works only for English. XL
From: Philipp Pagel on 6 Jul 2010 09:15 In comp.os.linux.misc Ignoramus12110 <ignoramus12110(a)nospam.12110.invalid> wrote: > ``A flagpole casts a shadow of 32 ft, Nearby, a 10-ft tree casts a > shadow of 2 ft. What is the height of the flag pole?'' > ``A flag pole casts a shadow of 32 feet. Nearby, a 10 foot tree > casts a shadow of 2 ft. Find the height of the flag pole?'' > are similar. > > I am hoping that, perhaps, there is some free package that could take > a few hundreds of thousands of text strings and could provide me with > "find similar" functionality. The BLAST software roughly does what you are looking for - but outside of mysql and specifically written for finding similar sequences in DNA or protein databases. That said, it should be possible to tweak it to accept input with arbitrary alphabets. Another programm from the bioinformatics world would be FASTA. At least BLAST can be found as a debian package (and probably for others, too). ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST/ http://faculty.virginia.edu/wrpearson/fasta/ For a few hundred or thousand comparisons and a little patience, you may also consider using the sequence alignment algorithms by Needleman and Wunsch or Smith and Waterman -- depending on your needs. If you would like to explore this route and are willing to do some code tweaking/expanding I could provide a simple implementation written in Python (just email me). And of course you can find tons of other implementations on the web because it's a classic for bioinformatics students. cu Philipp -- Dr. Philipp Pagel Lehrstuhl f. Genomorientierte Bioinformatik Technische Universit�t M�nchen http://webclu.bio.wzw.tum.de/~pagel/
From: ccc31807 on 6 Jul 2010 11:02
On Jul 5, 4:16 pm, Ignoramus12110 <ignoramus12...(a)NOSPAM. 12110.invalid> wrote: > I have a MySQL database of answered algebra questions. Questions are > stored as text strings. > When students ask questions, often (if not usually) there is already > something similar answered in the database. Note that I am not > defining what is "similar" and I do realize that it is a difficult > definition to make. As strong as Perl is at string manipulation, this is the kind of problem domain that Lisp is ideally suited for. At least one introduction, Lisp 3rd (Winston and Horn) devotes the last half of the book to consideration of these kinds of problems, and can be had for $1.68 as a used book on amazon.com, half.com, etc. I don't know what kind of time you have to devote to solving the problem, or the strength of your interest, or your previous experience, but I would strongly suggest that if you have the time, interest, and experience, that you would do well to read through W&H, Lisp 3rd. If you want something meatier, Paradigms of Artificial Intelligence Programming: Case Studies in Common Lisp (Norvig) seems to make the top ten list of everyone's Best Books in computer science. Essentially, what you would want to do is parse the student query for key words, perhaps building a database of common search terms, and match them against your database, perhaps iteratively using random permutations, using the standard Lisp pattern matching techniques. CC. |