From: Nam Quang Tran on 17 Jan 2010 15:38 On Jan 17, 6:28 pm, mike <spam...(a)go.com> wrote: > I attempted three times to index a whole dvd worth of data. > Once, it locked up completely, "program not responding". The other > two got a repeatable non-fatal error with some kind of error dump. > > Out of 12000 files, I only had 600 indexable files. I also found > that I had a lot more stuff inside zip archives than I thought. > Not having the filenames in the index killed the project. Finding > the other 95% of the files is way more important to me than indexing > 5%. > > Now, I'm back to looking for a way to index file names inside > zip archives. Not found anything yet that will let me search > filenames from an index when the actual files are not mounted. > > For now, looks like I'm stuck with mounting the archive media > and using totalcommander to search it interactively. It's > a better fit for my current needs. Some remarks: 1) The 1.0.2 beta versions searches in filenames, too. 2) Program freezes at the end of the indexing: I've seen this problem before, it happens on very large folder structures. Essentially, DocFetcher freezes because it's busy trying to register "watches" for each indexed folder. That means, it shouldn't freeze if you disable the "Watch indexed folders" option before indexing. 3) Could you post the error dumps, please? They contain lots of valuable debugging info. These error dumps are automatically written to disk as "stacktrace_XXXXX.txt". If you're using the portable version, they're in the DocFetcher folder. In the installed version, they're inside "C:\Program Files\DocFetcher\" if I remember correctly. 4) Low percentage of indexed files: Do you by any chance have tried to index lots of HTML files? DocFetcher has a so-called HTML pairing feature, i.e. it sees HTML files and all the stuff in the associated HTML folders as a single file. For example, "foo.html" and everything in the folder "foo_files" is treated as a single document. This could partly explain why DocFetcher indexes only 5% of your files. Also, if you have a lot of files in some obscure file formats currently not supported by DocFetcher, tell me about it, I'll see what I can do. Other than that, I have absolutely no idea why the percentage of indexed files is so low for you. Everything seems to work fine for most of my users. q:-) <= qforce
From: mike on 17 Jan 2010 16:18 Nam Quang Tran wrote: > On Jan 17, 6:28 pm, mike <spam...(a)go.com> wrote: >> I attempted three times to index a whole dvd worth of data. >> Once, it locked up completely, "program not responding". The other >> two got a repeatable non-fatal error with some kind of error dump. >> >> Out of 12000 files, I only had 600 indexable files. I also found >> that I had a lot more stuff inside zip archives than I thought. >> Not having the filenames in the index killed the project. Finding >> the other 95% of the files is way more important to me than indexing >> 5%. >> >> Now, I'm back to looking for a way to index file names inside >> zip archives. Not found anything yet that will let me search >> filenames from an index when the actual files are not mounted. >> >> For now, looks like I'm stuck with mounting the archive media >> and using totalcommander to search it interactively. It's >> a better fit for my current needs. > > Some remarks: > > 1) The 1.0.2 beta versions searches in filenames, too. All I have is 1.0.1 with the update you pointed me to. I'll have to check the website. > > 2) Program freezes at the end of the indexing: The program freeze was after 423 files indexed. Just stopped running. The other error was not a lockup. There was a stack trace emitted at the end of the indexing, but the program seemed to recover. I've seen this problem > before, it happens on very large folder structures. Essentially, > DocFetcher freezes because it's busy trying to register "watches" for > each indexed folder. That means, it shouldn't freeze if you disable > the "Watch indexed folders" option before indexing. Yes, I disabled watch indexed folders first. > > 3) Could you post the error dumps, please? I got this one twice at the end of the indexing. Seemed to recover after displaying the error screen org.apache.pdfbox.exceptions.WrappedIOException at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:779) at net.sourceforge.docfetcher.parse.PDFParser.parse(PDFParser.java:68) at net.sourceforge.docfetcher.model.FileWrapper.parse(FileWrapper.java:66) at net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:347) at net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387) at net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387) at net.sourceforge.docfetcher.model.RootScope.updateIndex(RootScope.java:190) at net.sourceforge.docfetcher.model.ScopeRegistry$2.run(ScopeRegistry.java:390) Caused by: java.util.NoSuchElementException at java.util.AbstractList$Itr.next(Unknown Source) at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115) at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203) ... 10 more This is the one that locked up no disk or processor activity Exception in thread "Thread-9" org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:396) at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:401) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1897) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1880) at net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:347) at net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387) at net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387) at net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387) at net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387) at net.sourceforge.docfetcher.model.RootScope.updateIndex(RootScope.java:190) at net.sourceforge.docfetcher.model.ScopeRegistry$2.run(ScopeRegistry.java:390) They contain lots of > valuable debugging info. These error dumps are automatically written > to disk as "stacktrace_XXXXX.txt". If you're using the portable > version, they're in the DocFetcher folder. In the installed version, > they're inside "C:\Program Files\DocFetcher\" if I remember correctly. > > 4) Low percentage of indexed files: Do you by any chance have tried to > index lots of HTML files? This is my archive of downloaded stuff from the web. Many are .exe program install files that are actually zip archives. Also lots of drivers that end up as .zip files or .exe files. DocFetcher has a so-called HTML pairing > feature, i.e. it sees HTML files and all the stuff in the associated > HTML folders as a single file. For example, "foo.html" and everything > in the folder "foo_files" is treated as a single document. This could > partly explain why DocFetcher indexes only 5% of your files. Also, if > you have a lot of files in some obscure file formats currently not > supported by DocFetcher, tell me about it, I'll see what I can do. > Other than that, I have absolutely no idea why the percentage of > indexed files is so low for you. Everything seems to work fine for > most of my users. I don't think this is your problem. It's a consequence of me having many ..exe and .zip files in the archive...also got a few .gz and .rar I currently use an ancient program from win3.1 days called catfish16. Indexes filenames just fine. Just doesn't index into the zip files or index contents. Indexing filenames is a step forward. > > q:-) <= qforce
From: mike on 17 Jan 2010 18:10 Nam Quang Tran wrote: > On Jan 17, 6:28 pm, mike <spam...(a)go.com> wrote: >> I attempted three times to index a whole dvd worth of data. >> Once, it locked up completely, "program not responding". The other >> two got a repeatable non-fatal error with some kind of error dump. >> >> Out of 12000 files, I only had 600 indexable files. I also found >> that I had a lot more stuff inside zip archives than I thought. >> Not having the filenames in the index killed the project. Finding >> the other 95% of the files is way more important to me than indexing >> 5%. >> >> Now, I'm back to looking for a way to index file names inside >> zip archives. Not found anything yet that will let me search >> filenames from an index when the actual files are not mounted. >> >> For now, looks like I'm stuck with mounting the archive media >> and using totalcommander to search it interactively. It's >> a better fit for my current needs. > > Some remarks: > > 1) The 1.0.2 beta versions searches in filenames, too. > > 2) Program freezes at the end of the indexing: I've seen this problem > before, it happens on very large folder structures. Essentially, > DocFetcher freezes because it's busy trying to register "watches" for > each indexed folder. That means, it shouldn't freeze if you disable > the "Watch indexed folders" option before indexing. > > 3) Could you post the error dumps, please? They contain lots of > valuable debugging info. These error dumps are automatically written > to disk as "stacktrace_XXXXX.txt". If you're using the portable > version, they're in the DocFetcher folder. In the installed version, > they're inside "C:\Program Files\DocFetcher\" if I remember correctly. > > 4) Low percentage of indexed files: Do you by any chance have tried to > index lots of HTML files? DocFetcher has a so-called HTML pairing > feature, i.e. it sees HTML files and all the stuff in the associated > HTML folders as a single file. For example, "foo.html" and everything > in the folder "foo_files" is treated as a single document. This could > partly explain why DocFetcher indexes only 5% of your files. Also, if > you have a lot of files in some obscure file formats currently not > supported by DocFetcher, tell me about it, I'll see what I can do. > Other than that, I have absolutely no idea why the percentage of > indexed files is so low for you. Everything seems to work fine for > most of my users. > > q:-) <= qforce I downloaded DocFetcher v 1.0.2 Beta portable and indexed the dvd. Got the erorr while indexing org.apache.pdfbox.exceptions.WrappedIOException at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:237) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:841) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:808) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:779) at net.sourceforge.docfetcher.parse.PDFParser.parse(PDFParser.java:68) at net.sourceforge.docfetcher.model.FileWrapper.parse(FileWrapper.java:66) at net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:347) at net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387) at net.sourceforge.docfetcher.model.RootScope.indexNewFiles(RootScope.java:387) at net.sourceforge.docfetcher.model.RootScope.updateIndex(RootScope.java:190) at net.sourceforge.docfetcher.model.ScopeRegistry$2.run(ScopeRegistry.java:390) Caused by: java.util.NoSuchElementException at java.util.AbstractList$Itr.next(Unknown Source) at org.apache.pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:115) at org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:538) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203) ... 10 more Finished with errors after 612 files in the progress window. The directory being indexed contains 9340 files. Using dos dir command to get number of files by extension, there are at least 800 files that should be indexed. I'm not sure what I'm looking at in the dropdown box for "extensions" It's dependent on what's being indexed, as if it scans the files and lists all that it finds. Needs a "check all" box. Not clear what it means to do a text scan on a zip file. Need to do scans on specified extensions that are inside the zip file... or .exe file if it's an archive. Scanning executable code for text seems like not the right thing to do. For executable .exe files, I'd just like the filename. I'd like to see all the filenames inside archives and the contents of selected file types inside archives. NOt sure how to define a user interface that makes it all work. I scanned a smaller section of the archive. It does seem to list all the filenames, but I still can't search them. I misunderstood your earlier posting that a wildcard at the beginning of a substring would be allowed. Doesn't seem to be the case. _________________________________________________________________________________ ### skipped: unable to read file. This is the file: ubuntupocketguide-v1-1.pdf http://www.4shared.com/get/83091857/eb5bb617/ubuntupocketguide-v1-1.html;jsessionid=8E46584F020D601C1EEEE01B95678646.dc116 I downloaded it again and it's identical to the file in my archive. ________________________________________________________________________________ Got a parser error on a very vanilla excel 2000 .xls file. __________________________________________________________________________________ The more I get into this, the more I realize that I mostly want a list of filenames that can search for. Indexing text inside .pdf files is a valuable, but completely different thing. I'm gonna go see if I can learn how to get the filenames out of a zip archive in visual basic. ___________________________________________________________________________________
From: Nam Quang Tran on 17 Jan 2010 19:45 New beta available: http://sourceforge.net/projects/docfetcher/files/docfetcher/1.0.2%20beta/docfetcher_1.0.2-beta3_portable.zip/download The crash should be gone now. > Finished with errors after 612 files in the progress window. > The directory being indexed contains 9340 files. > Using dos dir command to get number of files by extension, there are at > least > 800 files that should be indexed. I really have no idea why only 612 files out of more than 800 files are indexed. > I'm not sure what I'm looking at in the dropdown box for "extensions" > It's dependent on what's being indexed, as if it scans the files and lists > all that it finds. Needs a "check all" box. In the dropdown box for file extensions, there's actually a "Check All" entry in the context menu. You have to right-click on the list. Although I don't think it will help, because you can't index ZIP and EXE files with a text parser. You'd just get some binary junk out of them. > For executable .exe files, I'd just like the filename. I'd like to see > all the filenames inside archives and the contents of selected file > types inside > archives. NOt sure how > to define a user interface that makes it all work. If you just want to search in filenames, there are other programs out there specifically written for that purpose. Some people say 'Everything' is the best filename searcher. > I misunderstood your earlier > posting that a wildcard at the beginning of a substring would be allowed. > Doesn't seem to be the case. Wildcards at the beginning are supported now. (This worked in the earlier betas as well.) > ### skipped: unable to read file. > This is the file: ubuntupocketguide-v1-1.pdf > > http://www.4shared.com/get/83091857/eb5bb617/ubuntupocketguide-v1-1.h... Doesn't work for me either. This seems to be one of the cases where the PDF library simply blows up, so there's nothing I can do about it. Btw, DocFetcher will be less likely to crash if you split your archive into multiple folders and create separate indexes for each of them. Oh, and mike, you're one hell of a beta-tester ;-)
From: mike on 17 Jan 2010 22:44
Nam Quang Tran wrote: > New beta available: > http://sourceforge.net/projects/docfetcher/files/docfetcher/1.0.2%20beta/docfetcher_1.0.2-beta3_portable.zip/download > > The crash should be gone now. My brain is mush, but I'll give it a try later. > > >> Finished with errors after 612 files in the progress window. >> The directory being indexed contains 9340 files. >> Using dos dir command to get number of files by extension, there are at >> least >> 800 files that should be indexed. > > I really have no idea why only 612 files out of more than 800 files > are indexed. > > >> I'm not sure what I'm looking at in the dropdown box for "extensions" >> It's dependent on what's being indexed, as if it scans the files and lists >> all that it finds. Needs a "check all" box. > > In the dropdown box for file extensions, there's actually a "Check > All" entry in the context menu. You have to right-click on the list. > Although I don't think it will help, because you can't index ZIP and > EXE files with a text parser. You'd just get some binary junk out of > them. > > >> For executable .exe files, I'd just like the filename. I'd like to see >> all the filenames inside archives and the contents of selected file >> types inside >> archives. NOt sure how >> to define a user interface that makes it all work. > > If you just want to search in filenames, there are other programs out > there specifically written for that purpose. Some people say > 'Everything' is the best filename searcher. Thanks, I'll take a look at it. I've found many that work fine online, but can't search the index for offline files. I spent the afternoon figuring out how to recursively list the contents of zip files inside zip files with VB6. Was pretty easy, 'cause I just copied what someone else already figured out, but I've still got all manner of issues with file permissions and error recovery. Zip Component License: Freeware Type: ActiveX dll Vendor: Belus Technology This component provides industry-standard Zip archive functionality. It is designed to be easy to use. You can pack/unpack a file or folder with a single line of code. Documentation, API reference and examples are available at http://xstandard.com > > >> I misunderstood your earlier >> posting that a wildcard at the beginning of a substring would be allowed. >> Doesn't seem to be the case. > > Wildcards at the beginning are supported now. (This worked in the > earlier betas as well.) > > >> ### skipped: unable to read file. >> This is the file: ubuntupocketguide-v1-1.pdf >> >> http://www.4shared.com/get/83091857/eb5bb617/ubuntupocketguide-v1-1.h... > > Doesn't work for me either. This seems to be one of the cases where > the PDF library simply blows up, so there's nothing I can do about it. > > Btw, DocFetcher will be less likely to crash if you split your archive > into multiple folders and create separate indexes for each of them. Well, I've already split it into 50 DVD's ;-) I got very annoyed during the testing because I had to navigate to the test directory every time. Maybe the create archive process could remember the last place you created...the .ini file is already there. I didn't try drag/drop. Maybe that solves the problem. > > Oh, and mike, you're one hell of a beta-tester ;-) Thanks, it's a curse... I used to be the poster-boy for Murphy's Law, but so much went wrong when I was around that they fired me. Funny how employers don't like to hear what's wrong. ;-( |