From: Nam Quang Tran on
I have uploaded a beta version of DocFetcher 1.0.2:
http://sourceforge.net/projects/docfetcher/files/docfetcher/1.0.2%20beta/docfetcher_1.0.2-beta_portable.zip/download

and I'd like to invite everyone to try it out. Due to time
constraints, I won't make another release within the next 4-5 months,
so this is your *only* chance to complain some more and make me fix
any remaining issues ;-)

What has been fixed in DocFetcher 1.0.2 beta:
- Problems with PDF indexing
- Problems with MS Excel indexing (now supports the old Excel 5.0/7.0
format)
- Search in filenames
- (and some other stuff...)
From: mike on
Nam Quang Tran wrote:
> It seems I was able to fix the problem with the PDF indexing discussed
> in this thread, thanks to a problematic PDF file Mike sent me.
> So it's very likely that there will be a new release of DocFetcher
> next week (DocFetcher 1.0.2).

Thanks for the quick response.
I tried the fix.
DocFetcher now indexes the PDF files.
It still gives a "parser error" on some of my XL2000 .xls files,
but the file is indexed. I don't see that as a show stopper.

I did uncover some serious issues that I think ARE show-stoppers.
Maybe it's just operator error. I claim to be the best "dumb user"
on the planet. If I can us a program, anybody can.

You can't put a wild card at the beginning of a search term.
The file I sent you is a user manual for a TEK CFG280 function generator.
If I search for CFG280, I get a hit.
If I search for CFG, I get no hit.
If I search for CFG*, I get a hit.
If I search for *280, I get a popup about no wild card at beginning.
So, If all I can remember is that it's something280, I'm out of luck.

I think this seriously degrades the utility of the program.
You gotta fix that.

THERE IS A SERIOUS DESIGN FLAW IN THE USER INTERFACE!!!!

I BLEW AWAY ALL MY FILES.

When you right-click various subwindows, you get a context menu.
Right-clicking the "Search Scope" window brings up some indexing
options.
I wanted to test your patch, so needed to remove the old folder
from the index so I could re-index that folder.
Hmmmm....what shall I do? Delete folder looks like the best option.
I clicked it. "do you really want to remove the following folders...?"
Yep, I want to remove the folder from the search scope...

I clicked OK AND WATCHED AS IT DELETED ALL MY FILES.
And they ain't in the recycle bin.

THIS IS A SERIOUS PROBLEM.

You can argue that I'm an idiot and the program did exactly what I
asked it to do.
And you'd be right.
I am an idiot!
I wasn't paying attention!
But I still have lost all my files.
I'm not the only idiot out here in cyberspace.

THIS IS UNACCEPTABLE!!!!!

A program designed to index my files should NOT make it easy for
me to DELETE all my files.
You cannot put a "delete files" option in a context menu for a
"Search Scope" box. Search Scope should let me define the scope
of the search. It should let me add/delete directories
from the scope of the search. It should NOT let me delete
the actual files.

I can understand that there may be instances where I might want
to delete some files. I already have many programs designed to
manipulate my file system. I don't think file deletion should
be a part of an indexing program...it's a disaster waiting to
happen...but if you have that
capability, it MUST be in a SEPARATE file management menu.

Suggest you remove all the file deletion capability until you
decide where to put it and RECALL the old versions of the program...
to the extent possible. I block all attempts for a program to call
home, but you may have update options that let you notify most users.

In my case, I really didn't lose anything. I always test freeware
on a throw-away computer. All it cost me was 20 minutes to restore the
files to restart testing. Others may not be so lucky.
What if I'd used the portable version on someone else's computer
and blown away irreplaceable stuff? "Hey buddy, let me show you
how to index your files...we really shouldn't have indexed that
directory of irreplaceable baby pictures, I'll fix it...oops!!!"

I'll do more indexing and report anything else interesting I find.

DocFetcher is shaping up to be a really nice program. But it needs
a couple more patches.

mike
From: Nam Quang Tran on
@ mike:

Thank you very much for your suggestions. I implemented them (as far
as it was technically feasible) and uploaded a new beta:
http://sourceforge.net/projects/docfetcher/files/docfetcher/1.0.2%20beta/docfetcher_1.0.2-beta2_portable.zip/download

Here's what I did:

1) Deletion of files: You're quite right here. In fact, I've been
thinking about this myself, but somehow forgot about it. There's a
line in the user.properties file (this is where DocFetcher's
preferences are stored) to disable all document-modifying operations:
AllowRepositoryModification=true
I'll just set the default value to false, and the "delete folder"
option will be gone. If the user wants it back, he has to manually
enable it in the user.properties. Anyway, thanks for reminding me of
this issue. Should have fixed it earlier.
Btw, no, I can't recall old versions of DocFetcher. It doesn't have a
"call home" feature. (And never will, I don't like call home apps
either ;-))

2) Leading wildcards: Technically, this is possible, but it's slower
than other searches. Instead of disabling leading wildcards, I'll just
show a warning message telling the user about this performance issue.

3) The "something280" problem: I hope that with leading wildcards this
will become less of an issue. However, I can't give you a hit for
"CFG280" if you're searching for "CFG". This has something to do with
the way indexing works: After text extraction, the text is split into
separate words using a so-called "tokenizer". If I make the tokenizer
split "CFG280" into two halves, then you'll be able to search for
"CFG" and "280", but not for "CFG280". If the tokenizer doesn't split
it, then you can search for "CFG280", but not for "CFG" or "280". I
had to pick one tokenizer, so I chose the second one.
From: mike on
Nam Quang Tran wrote:
> @ mike:
>
> Thank you very much for your suggestions. I implemented them (as far
> as it was technically feasible) and uploaded a new beta:
> http://sourceforge.net/projects/docfetcher/files/docfetcher/1.0.2%20beta/docfetcher_1.0.2-beta2_portable.zip/download
>
> Here's what I did:
>
> 1) Deletion of files: You're quite right here. In fact, I've been
> thinking about this myself, but somehow forgot about it. There's a
> line in the user.properties file (this is where DocFetcher's
> preferences are stored) to disable all document-modifying operations:
> AllowRepositoryModification=true

Is there a manpage that describes manual configurations?

> I'll just set the default value to false, and the "delete folder"
> option will be gone. If the user wants it back, he has to manually
> enable it in the user.properties. Anyway, thanks for reminding me of
> this issue. Should have fixed it earlier.
> Btw, no, I can't recall old versions of DocFetcher. It doesn't have a
> "call home" feature. (And never will, I don't like call home apps
> either ;-))
Good for you. I hate that too.
>
> 2) Leading wildcards: Technically, this is possible, but it's slower
> than other searches. Instead of disabling leading wildcards, I'll just
> show a warning message telling the user about this performance issue.
Sounds good.
>
> 3) The "something280" problem: I hope that with leading wildcards this
> will become less of an issue.

I think so.

However, I can't give you a hit for
> "CFG280" if you're searching for "CFG". This has something to do with
> the way indexing works: After text extraction, the text is split into
> separate words using a so-called "tokenizer". If I make the tokenizer
> split "CFG280" into two halves, then you'll be able to search for
> "CFG" and "280", but not for "CFG280". If the tokenizer doesn't split
> it, then you can search for "CFG280", but not for "CFG" or "280". I
> had to pick one tokenizer, so I chose the second one.

I don't quite understand what you're telling me, but if I get a hit
on any word that matches *cf* or *280 or cfg* , I'm a happy camper.
I'll have to try it to see if the performance
issue is a problem.

I've discovered another issue that's a problem for me.

I index the directory,
Search for a term, get a hit.
Delete the target file.
Search for the term, no hit.
Pull the file out of the recycle bin, get a hit.
So the index is still there when the file isn't.
It just won't let me search it.

I have offline repositories of file archives. I leave the external
drives turned off most of the time, 'cause I'm clumsy and have a propensity
to delete stuff accidentally ;-) And most are on DVD's that are
not currently mounted at the time of the search.
I need to be able to search the index when the actual target files are not
online. Obviously, I can't expect the preview to work, but that's ok.
I just want to know if and where the file exists so I can mount the media.

One of the BEST features of DocFetcher is the ability to manually
and quickly
index any part of the file system without reindexing the whole
thing again. It doesn't try to automatically track changes.
Or, I didn't think it did.

There's a WatchFS=true in the properties file. That relevant?

Anyway, is there any way to set it up so I can search indexes for
currently unmounted files?

I did a bunch more indexing. Doesn't index files inside archives.
I use .zip files to organize related stuff. That's more my problem
than yours, but would be nice to have in a future release.

Thanks again for the speedy response.
mike


From: Nam Quang Tran on
On Jan 16, 6:21 pm, mike <spam...(a)go.com> wrote:

> Is there a manpage that describes manual configurations?

Yup. We have a wiki over here: http://sourceforge.net/apps/mediawiki/docfetcher/index.php?title=Main_Page
In the 'advanced usage' section there's a list of the most useful keys
in the user.properties file. I can add some more if requested.


> > However, I can't give you a hit for
> > "CFG280" if you're searching for "CFG". This has something to do with
> > the way indexing works: After text extraction, the text is split into
> > separate words using a so-called "tokenizer". If I make the tokenizer
> > split "CFG280" into two halves, then you'll be able to search for
> > "CFG" and "280", but not for "CFG280". If the tokenizer doesn't split
> > it, then you can search for "CFG280", but not for "CFG" or "280". I
> > had to pick one tokenizer, so I chose the second one.
>
> I don't quite understand what you're telling me, but if I get a hit
> on any word that matches *cf* or *280 or cfg* , I'm a happy camper.

I was trying to explain why you can't find "CFG280" if you search for
"CFG". It's a technical limitation of index-based search, which is
quite different from the Ctrl+F search in your average text editor.
(If DocFetcher's search worked like that of a text-editor, searches
would be 1000x slower!!)


> I've discovered another issue that's a problem for me.
>
> I index the directory,
> Search for a term, get a hit.
> Delete the target file.
> Search for the term, no hit.
> Pull the file out of the recycle bin, get a hit.
> So the index is still there when the file isn't.
> It just won't let me search it.
>
> I have offline repositories of file archives.  I leave the external
> drives turned off most of the time, 'cause I'm clumsy and have a propensity
> to delete stuff accidentally ;-)  And most are on DVD's that are
> not currently mounted at the time of the search.
> I need to be able to search the index when the actual target files are not
> online.  Obviously, I can't expect the preview to work, but that's ok.
> I just want to know if and where the file exists so I can mount the media..
>
> One of the BEST features of DocFetcher is the  ability to manually
> and quickly
> index any part of the file system without reindexing the whole
> thing again.  It doesn't try to automatically track changes.
> Or, I didn't think it did.
>
> There's a WatchFS=true in the properties file.  That relevant?
>
> Anyway, is there any way to set it up so I can search indexes for
> currently unmounted files?

Yes, the WatchFS=true is relevant, but you can just click on the
"Watch indexed folders" checkbox on the preferences dialog. They're
one and the same :-) In fact, all of what you see on the preferences
dialog is stored somewhere in the user.properties file.


> I did a bunch more indexing.  Doesn't index files inside archives.
> I use .zip files to organize related stuff.  That's more my problem
> than yours, but would be nice to have in a future release.

On the 'feature request' section of the aforementioned wiki, archive
indexing is listed as one of the planned features. The actual
implementation however isn't as easy as you might think. I'll have to
rewrite large portions of the program for this to work. :/

q:-) <= Quang