DocFetcher - java-based desktop searcher [Freeware]

Prev: [webapp] Storytlr: a lifestreamer
Next: What Good is a Portable App that Won't Run?

From: Nam Quang Tran on 17 Jan 2010 15:38

On Jan 17, 6:28 pm, mike <spam...(a)go.com> wrote:
> I attempted three times to index a whole dvd worth of data.
> Once, it locked up completely, "program not responding". The other
> two got a repeatable non-fatal error with some kind of error dump.
>
> Out of 12000 files, I only had 600 indexable files. I also found
> that I had a lot more stuff inside zip archives than I thought.
> Not having the filenames in the index killed the project. Finding
> the other 95% of the files is way more important to me than indexing
> 5%.
>
> Now, I'm back to looking for a way to index file names inside
> zip archives. Not found anything yet that will let me search
> filenames from an index when the actual files are not mounted.
>
> For now, looks like I'm stuck with mounting the archive media
> and using totalcommander to search it interactively. It's
> a better fit for my current needs.

Some remarks:

1) The 1.0.2 beta versions searches in filenames, too.

2) Program freezes at the end of the indexing: I've seen this problem
before, it happens on very large folder structures. Essentially,
DocFetcher freezes because it's busy trying to register "watches" for
each indexed folder. That means, it shouldn't freeze if you disable
the "Watch indexed folders" option before indexing.

3) Could you post the error dumps, please? They contain lots of
valuable debugging info. These error dumps are automatically written
to disk as "stacktrace_XXXXX.txt". If you're using the portable
version, they're in the DocFetcher folder. In the installed version,
they're inside "C:\Program Files\DocFetcher\" if I remember correctly.

4) Low percentage of indexed files: Do you by any chance have tried to
index lots of HTML files? DocFetcher has a so-called HTML pairing
feature, i.e. it sees HTML files and all the stuff in the associated
HTML folders as a single file. For example, "foo.html" and everything
in the folder "foo_files" is treated as a single document. This could
partly explain why DocFetcher indexes only 5% of your files. Also, if
you have a lot of files in some obscure file formats currently not
supported by DocFetcher, tell me about it, I'll see what I can do.
Other than that, I have absolutely no idea why the percentage of
indexed files is so low for you. Everything seems to work fine for
most of my users.

q:-) <= qforce

From: mike on 17 Jan 2010 16:18

Nam Quang Tran wrote:
> On Jan 17, 6:28 pm, mike <spam...(a)go.com> wrote:
>> I attempted three times to index a whole dvd worth of data.
>> Once, it locked up completely, "program not responding". The other
>> two got a repeatable non-fatal error with some kind of error dump.
>>
>> Out of 12000 files, I only had 600 indexable files. I also found
>> that I had a lot more stuff inside zip archives than I thought.
>> Not having the filenames in the index killed the project. Finding
>> the other 95% of the files is way more important to me than indexing
>> 5%.
>>
>> Now, I'm back to looking for a way to index file names inside
>> zip archives. Not found anything yet that will let me search
>> filenames from an index when the actual files are not mounted.
>>
>> For now, looks like I'm stuck with mounting the archive media
>> and using totalcommander to search it interactively. It's
>> a better fit for my current needs.
>
> Some remarks:
>
> 1) The 1.0.2 beta versions searches in filenames, too.

All I have is 1.0.1 with the update you pointed me to. I'll have to check
the website.
>
> 2) Program freezes at the end of the indexing:

The program freeze was after 423 files indexed. Just stopped running.

The other error was not a lockup. There was a stack trace emitted
at the end of the indexing,
but the program seemed to recover.

I've seen this problem
> before, it happens on very large folder structures. Essentially,
> DocFetcher freezes because it's busy trying to register "watches" for
> each indexed folder. That means, it shouldn't freeze if you disable
> the "Watch indexed folders" option before indexing.

Yes, I disabled watch indexed folders first.

>
> 3) Could you post the error dumps, please?
I got this one twice at the end of the indexing.
Seemed to recover after displaying the error screen

org.apache.pdfbox.exceptions.WrappedIOException
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:779)
at net.sourceforge.docfetcher.parse.PDFParser.parse(PDFParser.java:68)
at net.sourceforge.docfetcher.model.FileWrapper.parse(FileWrapper.java:66)
at
net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:347)
at
net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387)
at
net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387)
at
net.sourceforge.docfetcher.model.RootScope.updateIndex(RootScope.java:190)
at
net.sourceforge.docfetcher.model.ScopeRegistry$2.run(ScopeRegistry.java:390)
Caused by: java.util.NoSuchElementException
at java.util.AbstractList$Itr.next(Unknown Source)
at
org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
... 10 more

This is the one that locked up no disk or processor activity

Exception in thread "Thread-9"
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:396)
at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:401)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1897)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1880)
at
net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:347)
at
net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387)
at
net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387)
at
net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387)
at
net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387)
at
net.sourceforge.docfetcher.model.RootScope.updateIndex(RootScope.java:190)
at
net.sourceforge.docfetcher.model.ScopeRegistry$2.run(ScopeRegistry.java:390)

They contain lots of
> valuable debugging info. These error dumps are automatically written
> to disk as "stacktrace_XXXXX.txt". If you're using the portable
> version, they're in the DocFetcher folder. In the installed version,
> they're inside "C:\Program Files\DocFetcher\" if I remember correctly.
>
> 4) Low percentage of indexed files: Do you by any chance have tried to
> index lots of HTML files?

This is my archive of downloaded stuff from the web. Many are .exe
program install files that are actually zip archives. Also lots
of drivers that end up as .zip files or .exe files.

DocFetcher has a so-called HTML pairing
> feature, i.e. it sees HTML files and all the stuff in the associated
> HTML folders as a single file. For example, "foo.html" and everything
> in the folder "foo_files" is treated as a single document. This could
> partly explain why DocFetcher indexes only 5% of your files. Also, if
> you have a lot of files in some obscure file formats currently not
> supported by DocFetcher, tell me about it, I'll see what I can do.
> Other than that, I have absolutely no idea why the percentage of
> indexed files is so low for you. Everything seems to work fine for
> most of my users.

I don't think this is your problem. It's a consequence of me having many
..exe and .zip files in the archive...also got a few .gz and .rar

I currently use an ancient program from win3.1 days called catfish16.
Indexes filenames just fine. Just doesn't index into the zip files
or index contents.

Indexing filenames is a step forward.
>
> q:-) <= qforce

From: mike on 17 Jan 2010 18:10

Nam Quang Tran wrote:
> On Jan 17, 6:28 pm, mike <spam...(a)go.com> wrote:
>> I attempted three times to index a whole dvd worth of data.
>> Once, it locked up completely, "program not responding". The other
>> two got a repeatable non-fatal error with some kind of error dump.
>>
>> Out of 12000 files, I only had 600 indexable files. I also found
>> that I had a lot more stuff inside zip archives than I thought.
>> Not having the filenames in the index killed the project. Finding
>> the other 95% of the files is way more important to me than indexing
>> 5%.
>>
>> Now, I'm back to looking for a way to index file names inside
>> zip archives. Not found anything yet that will let me search
>> filenames from an index when the actual files are not mounted.
>>
>> For now, looks like I'm stuck with mounting the archive media
>> and using totalcommander to search it interactively. It's
>> a better fit for my current needs.
>
> Some remarks:
>
> 1) The 1.0.2 beta versions searches in filenames, too.
>
> 2) Program freezes at the end of the indexing: I've seen this problem
> before, it happens on very large folder structures. Essentially,
> DocFetcher freezes because it's busy trying to register "watches" for
> each indexed folder. That means, it shouldn't freeze if you disable
> the "Watch indexed folders" option before indexing.
>
> 3) Could you post the error dumps, please? They contain lots of
> valuable debugging info. These error dumps are automatically written
> to disk as "stacktrace_XXXXX.txt". If you're using the portable
> version, they're in the DocFetcher folder. In the installed version,
> they're inside "C:\Program Files\DocFetcher\" if I remember correctly.
>
> 4) Low percentage of indexed files: Do you by any chance have tried to
> index lots of HTML files? DocFetcher has a so-called HTML pairing
> feature, i.e. it sees HTML files and all the stuff in the associated
> HTML folders as a single file. For example, "foo.html" and everything
> in the folder "foo_files" is treated as a single document. This could
> partly explain why DocFetcher indexes only 5% of your files. Also, if
> you have a lot of files in some obscure file formats currently not
> supported by DocFetcher, tell me about it, I'll see what I can do.
> Other than that, I have absolutely no idea why the percentage of
> indexed files is so low for you. Everything seems to work fine for
> most of my users.
>
> q:-) <= qforce

I downloaded DocFetcher v 1.0.2 Beta portable and indexed the dvd.

Got the erorr while indexing

org.apache.pdfbox.exceptions.WrappedIOException
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:779)
at net.sourceforge.docfetcher.parse.PDFParser.parse(PDFParser.java:68)
at net.sourceforge.docfetcher.model.FileWrapper.parse(FileWrapper.java:66)
at
net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:347)
at
net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387)
at
net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387)
at
net.sourceforge.docfetcher.model.RootScope.updateIndex(RootScope.java:190)
at
net.sourceforge.docfetcher.model.ScopeRegistry$2.run(ScopeRegistry.java:390)
Caused by: java.util.NoSuchElementException
at java.util.AbstractList$Itr.next(Unknown Source)
at
org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115)
at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203)
... 10 more

Finished with errors after 612 files in the progress window.
The directory being indexed contains 9340 files.
Using dos dir command to get number of files by extension, there are at
least
800 files that should be indexed.

I'm not sure what I'm looking at in the dropdown box for "extensions"
It's dependent on what's being indexed, as if it scans the files and lists
all that it finds. Needs a "check all" box.

Not clear what it means to do a text scan on a zip file.
Need to do scans on specified extensions that are inside the zip file...
or .exe file if it's an archive. Scanning executable code for text
seems like not the right thing to do.

For executable .exe files, I'd just like the filename. I'd like to see
all the filenames inside archives and the contents of selected file
types inside
archives. NOt sure how
to define a user interface that makes it all work.

I scanned a smaller section of the archive. It does seem to list all the
filenames, but I still can't search them.

I misunderstood your earlier
posting that a wildcard at the beginning of a substring would be allowed.
Doesn't seem to be the case.

_________________________________________________________________________________

### skipped: unable to read file.
This is the file: ubuntupocketguide-v1-1.pdf

http://www.4shared.com/get/83091857/eb5bb617/ubuntupocketguide-v1-1.html;jsessionid=8E46584F020D601C1EEEE01B95678646.dc116

I downloaded it again and it's identical to the file in my archive.

________________________________________________________________________________

Got a parser error on a very vanilla excel 2000 .xls file.

__________________________________________________________________________________

The more I get into this, the more I realize that I mostly want a list
of filenames that can search for. Indexing text inside .pdf files
is a valuable, but completely different thing.

I'm gonna go see if I can learn how to get the filenames out of a zip
archive in visual basic.
___________________________________________________________________________________

From: Nam Quang Tran on 17 Jan 2010 19:45

New beta available:
http://sourceforge.net/projects/docfetcher/files/docfetcher/1.0.2%20beta/docfetcher_1.0.2-beta3_portable.zip/download

The crash should be gone now.

> Finished with errors after 612 files in the progress window.
> The directory being indexed contains 9340 files.
> Using dos dir command to get number of files by extension, there are at
> least
> 800 files that should be indexed.

I really have no idea why only 612 files out of more than 800 files
are indexed.

> I'm not sure what I'm looking at in the dropdown box for "extensions"
> It's dependent on what's being indexed, as if it scans the files and lists
> all that it finds. Needs a "check all" box.

In the dropdown box for file extensions, there's actually a "Check
All" entry in the context menu. You have to right-click on the list.
Although I don't think it will help, because you can't index ZIP and
EXE files with a text parser. You'd just get some binary junk out of
them.

> For executable .exe files, I'd just like the filename. I'd like to see
> all the filenames inside archives and the contents of selected file
> types inside
> archives. NOt sure how
> to define a user interface that makes it all work.

If you just want to search in filenames, there are other programs out
there specifically written for that purpose. Some people say
'Everything' is the best filename searcher.

> I misunderstood your earlier
> posting that a wildcard at the beginning of a substring would be allowed.
> Doesn't seem to be the case.

Wildcards at the beginning are supported now. (This worked in the
earlier betas as well.)

> ### skipped: unable to read file.
> This is the file: ubuntupocketguide-v1-1.pdf
>
> http://www.4shared.com/get/83091857/eb5bb617/ubuntupocketguide-v1-1.h...

Doesn't work for me either. This seems to be one of the cases where
the PDF library simply blows up, so there's nothing I can do about it.

Btw, DocFetcher will be less likely to crash if you split your archive
into multiple folders and create separate indexes for each of them.

Oh, and mike, you're one hell of a beta-tester ;-)

From: mike on 17 Jan 2010 22:44

Nam Quang Tran wrote:
> New beta available:
> http://sourceforge.net/projects/docfetcher/files/docfetcher/1.0.2%20beta/docfetcher_1.0.2-beta3_portable.zip/download
>
> The crash should be gone now.

My brain is mush, but I'll give it a try later.
>
>
>> Finished with errors after 612 files in the progress window.
>> The directory being indexed contains 9340 files.
>> Using dos dir command to get number of files by extension, there are at
>> least
>> 800 files that should be indexed.
>
> I really have no idea why only 612 files out of more than 800 files
> are indexed.
>
>
>> I'm not sure what I'm looking at in the dropdown box for "extensions"
>> It's dependent on what's being indexed, as if it scans the files and lists
>> all that it finds. Needs a "check all" box.
>
> In the dropdown box for file extensions, there's actually a "Check
> All" entry in the context menu. You have to right-click on the list.
> Although I don't think it will help, because you can't index ZIP and
> EXE files with a text parser. You'd just get some binary junk out of
> them.
>
>
>> For executable .exe files, I'd just like the filename. I'd like to see
>> all the filenames inside archives and the contents of selected file
>> types inside
>> archives. NOt sure how
>> to define a user interface that makes it all work.
>
> If you just want to search in filenames, there are other programs out
> there specifically written for that purpose. Some people say
> 'Everything' is the best filename searcher.

Thanks, I'll take a look at it. I've found many that work fine online,
but can't search the index for offline files.

I spent the afternoon figuring out how to recursively list the contents
of zip files inside zip files with VB6. Was pretty easy, 'cause I just
copied what someone else already figured out, but I've still got all manner
of issues with file permissions and error recovery.

Zip Component
License: Freeware
Type: ActiveX dll
Vendor: Belus Technology

This component provides industry-standard Zip archive functionality. It
is designed to be easy to use. You can pack/unpack a file or folder with
a single line of code.

Documentation, API reference and examples are available at
http://xstandard.com
>
>
>> I misunderstood your earlier
>> posting that a wildcard at the beginning of a substring would be allowed.
>> Doesn't seem to be the case.
>
> Wildcards at the beginning are supported now. (This worked in the
> earlier betas as well.)
>
>
>> ### skipped: unable to read file.
>> This is the file: ubuntupocketguide-v1-1.pdf
>>
>> http://www.4shared.com/get/83091857/eb5bb617/ubuntupocketguide-v1-1.h...
>
> Doesn't work for me either. This seems to be one of the cases where
> the PDF library simply blows up, so there's nothing I can do about it.
>
> Btw, DocFetcher will be less likely to crash if you split your archive
> into multiple folders and create separate indexes for each of them.

Well, I've already split it into 50 DVD's ;-)

I got very annoyed during the testing because I had to navigate to the
test directory every time. Maybe the create archive process could remember
the last place you created...the .ini file is already there.
I didn't try drag/drop. Maybe that solves the problem.
>
> Oh, and mike, you're one hell of a beta-tester ;-)
Thanks, it's a curse...
I used to be the poster-boy for Murphy's Law, but so much went wrong
when I was around that they fired me. Funny how employers
don't like to hear what's wrong. ;-(

First | Prev | Next | Last
Pages: 1 2 3 4 5 6 7 8 9 10 11 12
Prev: [webapp] Storytlr: a lifestreamer
Next: What Good is a Portable App that Won't Run?