Prev: ANNOUNCE: Exscript 2.0
Next: Fast GUI pipemeter: gprog
From: David Boddie on 17 Mar 2010 18:40 On Wednesday 17 March 2010 00:47, Aahz wrote: > In article > <af0830ae-1d24-4db9-b721-d6602fedd540(a)15g2000yqi.googlegroups.com>, > Peng Yu <pengyu.ut(a)gmail.com> wrote: >> >>I don't find a general pdf library in python that can do any >>operations on pdfs. >> >>I want to automatically highlight certain words (using regex) in a >>pdf. Could somebody let me know if there is a tool to do so in python? > > Did you Google at all? "python pdf" finds this as the first link, though > I have no clue whether it does what you want: > > http://pybrary.net/pyPdf/ The original poster might also be interested in displaying the highlighted words without modifying the original file. In which case, the Poppler library is worth investigating: http://poppler.freedesktop.org/ David
From: TP on 18 Mar 2010 15:36
On Wed, Mar 17, 2010 at 7:53 AM, Peng Yu <pengyu.ut(a)gmail.com> wrote: > On Tue, Mar 16, 2010 at 11:12 PM, Patrick Maupin <pmaupin(a)gmail.com> wrote: >> On Mar 4, 6:57 pm, Peng Yu <pengyu...(a)gmail.com> wrote: >>> I don't find a general pdf library in python that can do any >>> operations on pdfs. >>> >>> I want to automatically highlight certain words (using regex) in a >>> pdf. Could somebody let me know if there is a tool to do so in python? >> >> The problem with PDFs is that they can be quite complicated. There is >> the outer container structure, which isn't too bad (unless the >> document author applied encryption or fancy multi-object compression), >> but then inside the graphics elements, things could be stored as >> regular ASCII, or as fancy indexes into font-specific tables. Not >> rocket science, but the only industrial-strength solution for this is >> probably reportlab's pagecatcher. >> >> I have a library which works (primarily with the outer container) for >> reading and writing, called pdfrw. I also maintain a list of other >> PDF tools at http://code.google.com/p/pdfrw/wiki/OtherLibraries It >> may be that pdfminer (link on that page) will do what you want -- it >> is certainly trying to be complete as a PDF reader. But I've never >> personally used pdfminer. >> >> One of my pdfrw examples at http://code.google.com/p/pdfrw/wiki/ExampleTools >> will read in preexisting PDFs and write them out to a reportlab >> canvas. This works quite well on a few very simple ASCII PDFs, but >> the font handling needs a lot of work and probably won't work at all >> right now on unicode. (But if you wanted to improve it, I certainly >> would accept patches or give you commit rights!) >> >> That pdfrw example does graphics reasonably well. I was actually >> going down that path for getting better vector graphics into rst2pdf >> (both uniconvertor and svglib were broken for my purposes), but then I >> realized that the PDF spec allows you to include a page from another >> PDF quite easily (the spec calls it a form xObject), so you don't >> actually need to parse down into the graphics stream for that. So, >> right now, the best way to do vector graphics with rst2pdf is either >> to give it a preexisting PDF (which it passes off to pdfrw for >> conversion into a form xObject), or to give it a .svg file and invoke >> it with -e inkscape, and then it will use inkscape to convert the svg >> to a pdf and then go through the same path. > > Thank you for your long reply! But I'm not sure if you get my question or not. > > Acrobat can highlight certain words in pdfs. I could add notes to the > highlighted words as well. However, I find that I frequently end up > with highlighting some words that can be expressed by a regular > expression. > > To improve my productivity, I don't want do this manually in Acrobat > but rather do it in an automatic way, if there is such a tool > available. People in reportlab mailing list said this is not possible > with reportlab. And I don't see PyPDF can do this. If you know there > is an API to for this purpose, please let me know. Thank you! > > Regards, > Peng > -- > http://mail.python.org/mailman/listinfo/python-list > Take a look at the Acrobat SDK (http://www.adobe.com/devnet/acrobat/?view=downloads). In particular see the Acrobat Interapplication Communication information at http://www.adobe.com/devnet/acrobat/interapplication_communication.html. "Spell-checking a document" shows how to spell check a PDF using visual basic at http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat9_HTMLHelp&file=IAC_DevApp_OLE_Support.100.17.html "Working with annotations" shows how to add an annotation with visual basic at http://livedocs.adobe.com/acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Acrobat9_HTMLHelp&file=IAC_DevApp_OLE_Support.100.16.html. Presumably combining the two examples with Python's win32com should allow you to do what you want. |