From: Nam Quang Tran on 16 Jan 2010 09:24 I have uploaded a beta version of DocFetcher 1.0.2: http://sourceforge.net/projects/docfetcher/files/docfetcher/1.0.2%20beta/docfetcher_1.0.2-beta_portable.zip/download and I'd like to invite everyone to try it out. Due to time constraints, I won't make another release within the next 4-5 months, so this is your *only* chance to complain some more and make me fix any remaining issues ;-) What has been fixed in DocFetcher 1.0.2 beta: - Problems with PDF indexing - Problems with MS Excel indexing (now supports the old Excel 5.0/7.0 format) - Search in filenames - (and some other stuff...)
From: mike on 16 Jan 2010 09:55 Nam Quang Tran wrote: > It seems I was able to fix the problem with the PDF indexing discussed > in this thread, thanks to a problematic PDF file Mike sent me. > So it's very likely that there will be a new release of DocFetcher > next week (DocFetcher 1.0.2). Thanks for the quick response. I tried the fix. DocFetcher now indexes the PDF files. It still gives a "parser error" on some of my XL2000 .xls files, but the file is indexed. I don't see that as a show stopper. I did uncover some serious issues that I think ARE show-stoppers. Maybe it's just operator error. I claim to be the best "dumb user" on the planet. If I can us a program, anybody can. You can't put a wild card at the beginning of a search term. The file I sent you is a user manual for a TEK CFG280 function generator. If I search for CFG280, I get a hit. If I search for CFG, I get no hit. If I search for CFG*, I get a hit. If I search for *280, I get a popup about no wild card at beginning. So, If all I can remember is that it's something280, I'm out of luck. I think this seriously degrades the utility of the program. You gotta fix that. THERE IS A SERIOUS DESIGN FLAW IN THE USER INTERFACE!!!! I BLEW AWAY ALL MY FILES. When you right-click various subwindows, you get a context menu. Right-clicking the "Search Scope" window brings up some indexing options. I wanted to test your patch, so needed to remove the old folder from the index so I could re-index that folder. Hmmmm....what shall I do? Delete folder looks like the best option. I clicked it. "do you really want to remove the following folders...?" Yep, I want to remove the folder from the search scope... I clicked OK AND WATCHED AS IT DELETED ALL MY FILES. And they ain't in the recycle bin. THIS IS A SERIOUS PROBLEM. You can argue that I'm an idiot and the program did exactly what I asked it to do. And you'd be right. I am an idiot! I wasn't paying attention! But I still have lost all my files. I'm not the only idiot out here in cyberspace. THIS IS UNACCEPTABLE!!!!! A program designed to index my files should NOT make it easy for me to DELETE all my files. You cannot put a "delete files" option in a context menu for a "Search Scope" box. Search Scope should let me define the scope of the search. It should let me add/delete directories from the scope of the search. It should NOT let me delete the actual files. I can understand that there may be instances where I might want to delete some files. I already have many programs designed to manipulate my file system. I don't think file deletion should be a part of an indexing program...it's a disaster waiting to happen...but if you have that capability, it MUST be in a SEPARATE file management menu. Suggest you remove all the file deletion capability until you decide where to put it and RECALL the old versions of the program... to the extent possible. I block all attempts for a program to call home, but you may have update options that let you notify most users. In my case, I really didn't lose anything. I always test freeware on a throw-away computer. All it cost me was 20 minutes to restore the files to restart testing. Others may not be so lucky. What if I'd used the portable version on someone else's computer and blown away irreplaceable stuff? "Hey buddy, let me show you how to index your files...we really shouldn't have indexed that directory of irreplaceable baby pictures, I'll fix it...oops!!!" I'll do more indexing and report anything else interesting I find. DocFetcher is shaping up to be a really nice program. But it needs a couple more patches. mike
From: Nam Quang Tran on 16 Jan 2010 11:24 @ mike: Thank you very much for your suggestions. I implemented them (as far as it was technically feasible) and uploaded a new beta: http://sourceforge.net/projects/docfetcher/files/docfetcher/1.0.2%20beta/docfetcher_1.0.2-beta2_portable.zip/download Here's what I did: 1) Deletion of files: You're quite right here. In fact, I've been thinking about this myself, but somehow forgot about it. There's a line in the user.properties file (this is where DocFetcher's preferences are stored) to disable all document-modifying operations: AllowRepositoryModification=true I'll just set the default value to false, and the "delete folder" option will be gone. If the user wants it back, he has to manually enable it in the user.properties. Anyway, thanks for reminding me of this issue. Should have fixed it earlier. Btw, no, I can't recall old versions of DocFetcher. It doesn't have a "call home" feature. (And never will, I don't like call home apps either ;-)) 2) Leading wildcards: Technically, this is possible, but it's slower than other searches. Instead of disabling leading wildcards, I'll just show a warning message telling the user about this performance issue. 3) The "something280" problem: I hope that with leading wildcards this will become less of an issue. However, I can't give you a hit for "CFG280" if you're searching for "CFG". This has something to do with the way indexing works: After text extraction, the text is split into separate words using a so-called "tokenizer". If I make the tokenizer split "CFG280" into two halves, then you'll be able to search for "CFG" and "280", but not for "CFG280". If the tokenizer doesn't split it, then you can search for "CFG280", but not for "CFG" or "280". I had to pick one tokenizer, so I chose the second one.
From: mike on 16 Jan 2010 12:21 Nam Quang Tran wrote: > @ mike: > > Thank you very much for your suggestions. I implemented them (as far > as it was technically feasible) and uploaded a new beta: > http://sourceforge.net/projects/docfetcher/files/docfetcher/1.0.2%20beta/docfetcher_1.0.2-beta2_portable.zip/download > > Here's what I did: > > 1) Deletion of files: You're quite right here. In fact, I've been > thinking about this myself, but somehow forgot about it. There's a > line in the user.properties file (this is where DocFetcher's > preferences are stored) to disable all document-modifying operations: > AllowRepositoryModification=true Is there a manpage that describes manual configurations? > I'll just set the default value to false, and the "delete folder" > option will be gone. If the user wants it back, he has to manually > enable it in the user.properties. Anyway, thanks for reminding me of > this issue. Should have fixed it earlier. > Btw, no, I can't recall old versions of DocFetcher. It doesn't have a > "call home" feature. (And never will, I don't like call home apps > either ;-)) Good for you. I hate that too. > > 2) Leading wildcards: Technically, this is possible, but it's slower > than other searches. Instead of disabling leading wildcards, I'll just > show a warning message telling the user about this performance issue. Sounds good. > > 3) The "something280" problem: I hope that with leading wildcards this > will become less of an issue. I think so. However, I can't give you a hit for > "CFG280" if you're searching for "CFG". This has something to do with > the way indexing works: After text extraction, the text is split into > separate words using a so-called "tokenizer". If I make the tokenizer > split "CFG280" into two halves, then you'll be able to search for > "CFG" and "280", but not for "CFG280". If the tokenizer doesn't split > it, then you can search for "CFG280", but not for "CFG" or "280". I > had to pick one tokenizer, so I chose the second one. I don't quite understand what you're telling me, but if I get a hit on any word that matches *cf* or *280 or cfg* , I'm a happy camper. I'll have to try it to see if the performance issue is a problem. I've discovered another issue that's a problem for me. I index the directory, Search for a term, get a hit. Delete the target file. Search for the term, no hit. Pull the file out of the recycle bin, get a hit. So the index is still there when the file isn't. It just won't let me search it. I have offline repositories of file archives. I leave the external drives turned off most of the time, 'cause I'm clumsy and have a propensity to delete stuff accidentally ;-) And most are on DVD's that are not currently mounted at the time of the search. I need to be able to search the index when the actual target files are not online. Obviously, I can't expect the preview to work, but that's ok. I just want to know if and where the file exists so I can mount the media. One of the BEST features of DocFetcher is the ability to manually and quickly index any part of the file system without reindexing the whole thing again. It doesn't try to automatically track changes. Or, I didn't think it did. There's a WatchFS=true in the properties file. That relevant? Anyway, is there any way to set it up so I can search indexes for currently unmounted files? I did a bunch more indexing. Doesn't index files inside archives. I use .zip files to organize related stuff. That's more my problem than yours, but would be nice to have in a future release. Thanks again for the speedy response. mike
From: Nam Quang Tran on 16 Jan 2010 12:52
On Jan 16, 6:21 pm, mike <spam...(a)go.com> wrote: > Is there a manpage that describes manual configurations? Yup. We have a wiki over here: http://sourceforge.net/apps/mediawiki/docfetcher/index.php?title=Main_Page In the 'advanced usage' section there's a list of the most useful keys in the user.properties file. I can add some more if requested. > > However, I can't give you a hit for > > "CFG280" if you're searching for "CFG". This has something to do with > > the way indexing works: After text extraction, the text is split into > > separate words using a so-called "tokenizer". If I make the tokenizer > > split "CFG280" into two halves, then you'll be able to search for > > "CFG" and "280", but not for "CFG280". If the tokenizer doesn't split > > it, then you can search for "CFG280", but not for "CFG" or "280". I > > had to pick one tokenizer, so I chose the second one. > > I don't quite understand what you're telling me, but if I get a hit > on any word that matches *cf* or *280 or cfg* , I'm a happy camper. I was trying to explain why you can't find "CFG280" if you search for "CFG". It's a technical limitation of index-based search, which is quite different from the Ctrl+F search in your average text editor. (If DocFetcher's search worked like that of a text-editor, searches would be 1000x slower!!) > I've discovered another issue that's a problem for me. > > I index the directory, > Search for a term, get a hit. > Delete the target file. > Search for the term, no hit. > Pull the file out of the recycle bin, get a hit. > So the index is still there when the file isn't. > It just won't let me search it. > > I have offline repositories of file archives. I leave the external > drives turned off most of the time, 'cause I'm clumsy and have a propensity > to delete stuff accidentally ;-) And most are on DVD's that are > not currently mounted at the time of the search. > I need to be able to search the index when the actual target files are not > online. Obviously, I can't expect the preview to work, but that's ok. > I just want to know if and where the file exists so I can mount the media.. > > One of the BEST features of DocFetcher is the ability to manually > and quickly > index any part of the file system without reindexing the whole > thing again. It doesn't try to automatically track changes. > Or, I didn't think it did. > > There's a WatchFS=true in the properties file. That relevant? > > Anyway, is there any way to set it up so I can search indexes for > currently unmounted files? Yes, the WatchFS=true is relevant, but you can just click on the "Watch indexed folders" checkbox on the preferences dialog. They're one and the same :-) In fact, all of what you see on the preferences dialog is stored somewhere in the user.properties file. > I did a bunch more indexing. Doesn't index files inside archives. > I use .zip files to organize related stuff. That's more my problem > than yours, but would be nice to have in a future release. On the 'feature request' section of the aforementioned wiki, archive indexing is listed as one of the planned features. The actual implementation however isn't as easy as you might think. I'll have to rewrite large portions of the program for this to work. :/ q:-) <= Quang |