Prev: PyODBC Stored proc calling
Next: win32com.client
From: tubby on 25 Jan 2007 16:05 I know this question comes up a lot, so here goes again. I want to read text from a PDF file, run re searches on the text, etc. I do not care about layout, fonts, borders, etc. I just want the text. I've been reading Adobe's PDF Reference Guide and I'm beginning to develop a better understanding of PDF in general, but I need a bit of help... this seems like it should be easier than it is. Here's some code: import zlib fp = open('test.pdf', 'rb') bytes = [] while 1: byte = fp.read(1) #print byte bytes.append(byte) if not byte: break for byte in bytes: op = open('pdf.txt', 'a') dco = zlib.decompressobj() try: s = dco.decompress(byte) #print >> op, s print s except Exception, e: print e op.close() fp.close() I know the text is compressed... that it would have stream and endstream makers and BT (Begin Text) and ET (End Text) and that the uncompressed text is enclosed in parenthesis (this is my text). Has anyone here done this in a simple fashion? I've played with the pyPdf library some, but it seems overly complex for my needs (merge PDFs, write PDFs, etc). I just want a simple PDF text extractor. Thanks
From: Nils Oliver Kröger on 25 Jan 2007 16:40 -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 have a look at the pdflib (www.pdflib.com). Their Text Extraction Toolkit might be what you are looking for, though I'm not sure whether you can use it detached from the pdflib itself. hth Nils tubby schrieb: > I know this question comes up a lot, so here goes again. I want to read > text from a PDF file, run re searches on the text, etc. I do not care > about layout, fonts, borders, etc. I just want the text. I've been > reading Adobe's PDF Reference Guide and I'm beginning to develop a > better understanding of PDF in general, but I need a bit of help... this > seems like it should be easier than it is. Here's some code: > > import zlib > > fp = open('test.pdf', 'rb') > bytes = [] > while 1: > byte = fp.read(1) > #print byte > bytes.append(byte) > if not byte: > break > > for byte in bytes: > > op = open('pdf.txt', 'a') > > dco = zlib.decompressobj() > > try: > s = dco.decompress(byte) > #print >> op, s > print s > except Exception, e: > print e > > op.close() > > fp.close() > > I know the text is compressed... that it would have stream and endstream > makers and BT (Begin Text) and ET (End Text) and that the uncompressed > text is enclosed in parenthesis (this is my text). Has anyone here done > this in a simple fashion? I've played with the pyPdf library some, but > it seems overly complex for my needs (merge PDFs, write PDFs, etc). I > just want a simple PDF text extractor. > > Thanks -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.3 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFFuSPozvGJy8WEGTcRAnY0AJ0VZez3XRbLm/JXZKhn/rgHP0R3qwCfWAnT EupBECHab2kG33Rmnh+xf74= =INM5 -----END PGP SIGNATURE-----
From: David Boddie on 25 Jan 2007 16:46 On Thursday 25 January 2007 22:05, tubby wrote: > I know this question comes up a lot, so here goes again. I want to read > text from a PDF file, run re searches on the text, etc. I do not care > about layout, fonts, borders, etc. I just want the text. I've been > reading Adobe's PDF Reference Guide and I'm beginning to develop a > better understanding of PDF in general, but I need a bit of help... this > seems like it should be easier than it is. It _seems_ that way. ;-) One of the more promising suggestions for a way to solve this came up in a comp.lang.python thread last year: http://groups.google.com/group/comp.lang.python/msg/cb6c97a44ce4cbe9?dmode=source Basically, if you have access to the pdftotext command on a system that supports xpdf, you should be able to get something reasonable out of a PDF file. > I know the text is compressed... that it would have stream and endstream > makers and BT (Begin Text) and ET (End Text) and that the uncompressed > text is enclosed in parenthesis (this is my text). Has anyone here done > this in a simple fashion? I've played with the pyPdf library some, but > it seems overly complex for my needs (merge PDFs, write PDFs, etc). I > just want a simple PDF text extractor. The pdftotext tool may do what you want: http://www.foolabs.com/xpdf/download.html Let us know how you get on with it. David
From: tubby on 25 Jan 2007 16:54 David Boddie wrote: > The pdftotext tool may do what you want: > > http://www.foolabs.com/xpdf/download.html > > Let us know how you get on with it. I have used this tool. However, I need PDF read ability on Windows and Linux and in the future Macs. pdftotext works great on Linux, but poorly on Windows (100% sustained CPU usage, etc). Thank you for the suggestion. I'll keep hammering away at a simple Python solution to this. Over the years, I have come to loath Adobe's Portable Document Format!
From: tubby on 25 Jan 2007 17:09
David Boddie wrote: > The pdftotext tool may do what you want: > > http://www.foolabs.com/xpdf/download.html > > Let us know how you get on with it. > > David Perhaps I'm just using pdftotext wrong? Here's how I was using it: f = filename try: sout = os.popen('pdftotext "%s" - ' %f) data = sout.read().strip() print data sout.close() except Exception, e: print e |