From: Allen Kistler on 11 Oct 2009 20:15 As the title states, I'm seeking suggestions for a site search engine to search wikis, regular web sites, and possibly CVS. At the highest level, the requirements are: 1. Must be open source & fee-free 2. Must not be Java or C/C++ (not debatable, don't try) 3. Should be Python (I might be able to sell Perl, though) 4. Should have an active development community 5. Should have an API that allows apps to query 6. Should have ability to tweak results administratively (e.g., choose which pages get listed for a certain word, even if they don't have that word, and which pages don't get listed, even if they do have the word Most stuff that I've been able to find that meets Req 1 gets killed by Req 2. What I've found so far that survives Req 2: Gonzui - written in Ruby, doesn't appear to be actively maintained Lucene - although written in Java, it has ports to Perl and Ruby Namazu - written in Perl OpenFTS - written in Perl, doesn't appear to be actively maintained I haven't dug deeply into those above, but are there any others I should consider? Any experience with those above?
From: Keith Keller on 11 Oct 2009 23:08 On 2009-10-12, Allen Kistler <ackistler(a)oohay.moc> wrote: > As the title states, I'm seeking suggestions for a site search engine to > search wikis, regular web sites, and possibly CVS. You didn't say: searching from the front-end (i.e., screenscraping someone else's site) or the back-end (i.e., indexing your own site)? I'm assuming the latter. > Lucene - although written in Java, it has ports to Perl and Ruby I'm assuming you're talking about KinoSearch here as the Perl port? We use it for a fairly large chunk of data, and still literally tens of millions of ''documents'' fit into an index about 10GB in size, and it's incredibly fast to return results. But it does require you to write code to build the index--you can't just throw it at a wiki off the shelf and hope it works. So you'll also have to figure out how to get the list of documents to feed to it, and what sort of data to stuff into the index. The KinoSearch page is here: http://www.rectangular.com/kinosearch/ If you're talking about Plucene, forget about it, it's dog slow. See http://www.rectangular.com/kinosearch/benchmarks.html if you're willing to trust their benchmarks (I've never done the benchmarks myself). --keith -- kkeller-usenet(a)wombat.san-francisco.ca.us (try just my userid to email me) AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt see X- headers for PGP signature information
From: Allen Kistler on 12 Oct 2009 02:50 Keith Keller wrote: > On 2009-10-12, Allen Kistler <ackistler(a)oohay.moc> wrote: >> As the title states, I'm seeking suggestions for a site search engine to >> search wikis, regular web sites, and possibly CVS. > > You didn't say: searching from the front-end (i.e., screenscraping > someone else's site) or the back-end (i.e., indexing your own site)? > I'm assuming the latter. Yes, indexing our own site, which would actually be multiple content sources, so I was expecting there to be some crawling involved. If I can avoid crawling, that's okay, too. >> Lucene - although written in Java, it has ports to Perl and Ruby > > I'm assuming you're talking about KinoSearch here as the Perl port? We > use it for a fairly large chunk of data, and still literally tens of > millions of ''documents'' fit into an index about 10GB in size, and it's > incredibly fast to return results. But it does require you to write > code to build the index--you can't just throw it at a wiki off the shelf > and hope it works. So you'll also have to figure out how to get the > list of documents to feed to it, and what sort of data to stuff into the > index. > > The KinoSearch page is here: http://www.rectangular.com/kinosearch/ > > If you're talking about Plucene, forget about it, it's dog slow. See > http://www.rectangular.com/kinosearch/benchmarks.html if you're willing > to trust their benchmarks (I've never done the benchmarks myself). I was thinking of both. I like the statement in the benchmark: "Lucene's data structures are almost pathologically ill-matched with Perl." Thanks for the feedback.
|
Pages: 1 Prev: badblocks Next: Plextor PX-712A DVD+-RW can't write in CentOS 5.4 |