Prev: ANNOUNCE: Exscript 2.0
Next: Fast GUI pipemeter: gprog
From: Peng Yu on 4 Mar 2010 18:57 I don't find a general pdf library in python that can do any operations on pdfs. I want to automatically highlight certain words (using regex) in a pdf. Could somebody let me know if there is a tool to do so in python?
From: Aahz on 16 Mar 2010 19:47 In article <af0830ae-1d24-4db9-b721-d6602fedd540(a)15g2000yqi.googlegroups.com>, Peng Yu <pengyu.ut(a)gmail.com> wrote: > >I don't find a general pdf library in python that can do any >operations on pdfs. > >I want to automatically highlight certain words (using regex) in a >pdf. Could somebody let me know if there is a tool to do so in python? Did you Google at all? "python pdf" finds this as the first link, though I have no clue whether it does what you want: http://pybrary.net/pyPdf/ -- Aahz (aahz(a)pythoncraft.com) <*> http://www.pythoncraft.com/ "Many customs in this life persist because they ease friction and promote productivity as a result of universal agreement, and whether they are precisely the optimal choices is much less important." --Henry Spencer
From: Patrick Maupin on 17 Mar 2010 00:12 On Mar 4, 6:57 pm, Peng Yu <pengyu...(a)gmail.com> wrote: > I don't find a general pdf library in python that can do any > operations on pdfs. > > I want to automatically highlight certain words (using regex) in a > pdf. Could somebody let me know if there is a tool to do so in python? The problem with PDFs is that they can be quite complicated. There is the outer container structure, which isn't too bad (unless the document author applied encryption or fancy multi-object compression), but then inside the graphics elements, things could be stored as regular ASCII, or as fancy indexes into font-specific tables. Not rocket science, but the only industrial-strength solution for this is probably reportlab's pagecatcher. I have a library which works (primarily with the outer container) for reading and writing, called pdfrw. I also maintain a list of other PDF tools at http://code.google.com/p/pdfrw/wiki/OtherLibraries It may be that pdfminer (link on that page) will do what you want -- it is certainly trying to be complete as a PDF reader. But I've never personally used pdfminer. One of my pdfrw examples at http://code.google.com/p/pdfrw/wiki/ExampleTools will read in preexisting PDFs and write them out to a reportlab canvas. This works quite well on a few very simple ASCII PDFs, but the font handling needs a lot of work and probably won't work at all right now on unicode. (But if you wanted to improve it, I certainly would accept patches or give you commit rights!) That pdfrw example does graphics reasonably well. I was actually going down that path for getting better vector graphics into rst2pdf (both uniconvertor and svglib were broken for my purposes), but then I realized that the PDF spec allows you to include a page from another PDF quite easily (the spec calls it a form xObject), so you don't actually need to parse down into the graphics stream for that. So, right now, the best way to do vector graphics with rst2pdf is either to give it a preexisting PDF (which it passes off to pdfrw for conversion into a form xObject), or to give it a .svg file and invoke it with -e inkscape, and then it will use inkscape to convert the svg to a pdf and then go through the same path. HTH, Pat
From: Peng Yu on 17 Mar 2010 10:53 On Tue, Mar 16, 2010 at 11:12 PM, Patrick Maupin <pmaupin(a)gmail.com> wrote: > On Mar 4, 6:57 pm, Peng Yu <pengyu...(a)gmail.com> wrote: >> I don't find a general pdf library in python that can do any >> operations on pdfs. >> >> I want to automatically highlight certain words (using regex) in a >> pdf. Could somebody let me know if there is a tool to do so in python? > > The problem with PDFs is that they can be quite complicated. There is > the outer container structure, which isn't too bad (unless the > document author applied encryption or fancy multi-object compression), > but then inside the graphics elements, things could be stored as > regular ASCII, or as fancy indexes into font-specific tables. Not > rocket science, but the only industrial-strength solution for this is > probably reportlab's pagecatcher. > > I have a library which works (primarily with the outer container) for > reading and writing, called pdfrw. I also maintain a list of other > PDF tools at http://code.google.com/p/pdfrw/wiki/OtherLibraries It > may be that pdfminer (link on that page) will do what you want -- it > is certainly trying to be complete as a PDF reader. But I've never > personally used pdfminer. > > One of my pdfrw examples at http://code.google.com/p/pdfrw/wiki/ExampleTools > will read in preexisting PDFs and write them out to a reportlab > canvas. This works quite well on a few very simple ASCII PDFs, but > the font handling needs a lot of work and probably won't work at all > right now on unicode. (But if you wanted to improve it, I certainly > would accept patches or give you commit rights!) > > That pdfrw example does graphics reasonably well. I was actually > going down that path for getting better vector graphics into rst2pdf > (both uniconvertor and svglib were broken for my purposes), but then I > realized that the PDF spec allows you to include a page from another > PDF quite easily (the spec calls it a form xObject), so you don't > actually need to parse down into the graphics stream for that. So, > right now, the best way to do vector graphics with rst2pdf is either > to give it a preexisting PDF (which it passes off to pdfrw for > conversion into a form xObject), or to give it a .svg file and invoke > it with -e inkscape, and then it will use inkscape to convert the svg > to a pdf and then go through the same path. Thank you for your long reply! But I'm not sure if you get my question or not. Acrobat can highlight certain words in pdfs. I could add notes to the highlighted words as well. However, I find that I frequently end up with highlighting some words that can be expressed by a regular expression. To improve my productivity, I don't want do this manually in Acrobat but rather do it in an automatic way, if there is such a tool available. People in reportlab mailing list said this is not possible with reportlab. And I don't see PyPDF can do this. If you know there is an API to for this purpose, please let me know. Thank you! Regards, Peng
From: Patrick Maupin on 17 Mar 2010 11:11
On Wed, Mar 17, 2010 at 9:53 AM, Peng Yu <pengyu.ut(a)gmail.com> wrote: > Thank you for your long reply! But I'm not sure if you get my question or not. > > Acrobat can highlight certain words in pdfs. I could add notes to the > highlighted words as well. However, I find that I frequently end up > with highlighting some words that can be expressed by a regular > expression. > > To improve my productivity, I don't want do this manually in Acrobat > but rather do it in an automatic way, if there is such a tool > available. People in reportlab mailing list said this is not possible > with reportlab. And I don't see PyPDF can do this. If you know there > is an API to for this purpose, please let me know. Thank you! I do not know of any API specific to this purpose, no. But I mentioned three libraries (pagecatcher, pdfminer, and pdfrw) that are capable, to a greater or lesser extent, of reading in PDFs and giving you the data from them, which you can then do your replacement on and then write back out. I would imagine this would be a piece of cake with pagecatcher. (I noticed you just posted on the reportlab mailing list, but you did not specifically mention pagecatcher.) It will probably take more work with either of the other two. It is probable that none of them do exactly what you want, but also that any of them is a better starting point than coding what you want from scratch. Regards, Pat |