From: Nam Quang Tran on 15 Jan 2010 11:39 Hello again, In response to the various comments about how DocFetcher fails to index certain files: 1) DocFetcher has a small built-in debugging tool called "Parser Testbox", which you can open by pressing F11. This lets you perform text extraction on single files, so you can see exactly what is extracted from a particular file. 2.1) Problems with PDF files: Some PDF files have a "can't extract text" permission flag. If this flag is set, you need the master password for the PDF file in order to extract text. DocFetcher does not support PDF passwords and decryption. (And you probably don't have the master password anyway.) 2.2) We, the DocFetcher developers, are not directly involved in the development of the various text extraction libraries that are used in DocFetcher. For PDF indexing, a library called PDFBox is used, which is a rock solid piece of software used by many other Java applications. If this library fails, then it's usually because (a) the PDF file is encrypted, or (b) the PDF consists of scanned images without any real text. 3) The current version of DocFetcher (1.0.1) does not search in filenames. We might include this in a future release. 4) For more information on wildcards, have a look at the "Query syntax" section in the manual. Best regards q:-) <= qforce
From: Nam Quang Tran on 15 Jan 2010 11:52 By the way, there's an easy way to find out if the "can't extract text" flag in a PDF file is set: If you can't copy text from the PDF file to the clipboard using a standard PDF reader such as Adobe or Foxit Reader, then this flag is probably set. On Jan 15, 5:39 pm, Nam Quang Tran <qforce....(a)googlemail.com> wrote: > Hello again, > > In response to the various comments about howDocFetcherfails to > index certain files: > > 1)DocFetcherhas a small built-in debugging tool called "Parser > Testbox", which you can open by pressing F11. This lets you perform > text extraction on single files, so you can see exactly what is > extracted from a particular file. > > 2.1) Problems with PDF files: Some PDF files have a "can't extract > text" permission flag. If this flag is set, you need the master > password for the PDF file in order to extract text.DocFetcherdoes > not support PDF passwords and decryption. (And you probably don't have > the master password anyway.) > > 2.2) We, theDocFetcherdevelopers, are not directly involved in the > development of the various text extraction libraries that are used inDocFetcher. For PDF indexing, a library called PDFBox is used, which > is a rock solid piece of software used by many other Java > applications. If this library fails, then it's usually because (a) the > PDF file is encrypted, or (b) the PDF consists of scanned images > without any real text. > > 3) The current version ofDocFetcher(1.0.1) does not search in > filenames. We might include this in a future release. > > 4) For more information on wildcards, have a look at the "Query > syntax" section in the manual. > > Best regards > q:-) <= qforce
From: mike on 15 Jan 2010 13:46 Nam Quang Tran wrote: > By the way, there's an easy way to find out if the "can't extract > text" flag in a PDF file is set: If you can't copy text from the PDF > file to the clipboard using a standard PDF reader such as Adobe or > Foxit Reader, then this flag is probably set. > > On Jan 15, 5:39 pm, Nam Quang Tran <qforce....(a)googlemail.com> wrote: >> Hello again, >> >> In response to the various comments about howDocFetcherfails to >> index certain files: >> >> 1)DocFetcherhas a small built-in debugging tool called "Parser >> Testbox", which you can open by pressing F11. This lets you perform >> text extraction on single files, so you can see exactly what is >> extracted from a particular file. >> >> 2.1) Problems with PDF files: Some PDF files have a "can't extract >> text" permission flag. If this flag is set, you need the master >> password for the PDF file in order to extract text.DocFetcherdoes >> not support PDF passwords and decryption. (And you probably don't have >> the master password anyway.) >> >> 2.2) We, theDocFetcherdevelopers, are not directly involved in the >> development of the various text extraction libraries that are used inDocFetcher. For PDF indexing, a library called PDFBox is used, which >> is a rock solid piece of software used by many other Java >> applications. If this library fails, then it's usually because (a) the >> PDF file is encrypted, or (b) the PDF consists of scanned images >> without any real text. >> >> 3) The current version ofDocFetcher(1.0.1) does not search in >> filenames. We might include this in a future release. >> >> 4) For more information on wildcards, have a look at the "Query >> syntax" section in the manual. >> >> Best regards >> q:-) <= qforce > IF you send me a direct email, my email address is valid, with a preferred email address on your end, I can attach a 160KByte pdf file that I have not been able to index. But I can open it in foxit reader and cut text out of it. mike
From: Nam Quang Tran on 15 Jan 2010 14:09 > IF you send me a direct email, my email address is valid, > with a preferred email address on your end, I can attach a 160KByte > pdf file that I have not been able to index. But I can open it in > foxit reader and cut text out of it. > > mike I'm not familiar with Google Groups, so could you just send it to my official developer address? users.sourceforge.net <- qforce@
From: Howldog on 15 Jan 2010 15:01
On Fri, 15 Jan 2010 11:09:00 -0800 (PST), Nam Quang Tran wrote: >> IF you send me a direct email, my email address is valid, >> with a preferred email address on your end, I can attach a 160KByte >> pdf file that I have not been able to index. But I can open it in >> foxit reader and cut text out of it. >> >> mike > > I'm not familiar with Google Groups, so could you just send it to my > official developer address? > users.sourceforge.net <- qforce@ I remember you, you were the one who took orders from that Italian slob Saladini. Did you trick the Jewtalians into forking over their dough to Ho Ho Ho Chi Minh? |