From: rlevesque on 24 Jul 2010 10:59 Hi I am working on a program that generates various pdf files in the / results folder. "scenario1.pdf" results from scenario1 "scenario2.pdf" results from scenario2 etc Once I am happy with scenario1.pdf and scenario2.pdf files, I would like to save them in the /check folder. Now after having developed/modified the program to produce scenario3.pdf, I would like to be able to re-generate files /results/scenario1.pdf /results/scenario2.pdf and compare them with /check/scenario1.pdf /check/scenario2.pdf I tried using the md5 module to compare these files but md5 reports differences even though the code has *not* changed at all. Is there a way to compare 2 pdf files generated at different time but identical in every other respect and validate by program that the files are identical (for all practical purposes)?
From: Peter Chant on 24 Jul 2010 11:38 rlevesque wrote: > Is there a way to compare 2 pdf files generated at different time but > identical in every other respect and validate by program that the > files are identical (for all practical purposes)? I wonder, do the PDFs have a timestamp within them from when they are created? That would ruin your MD5 plan. Pete -- http://www.petezilla.co.uk
From: Peter Otten on 24 Jul 2010 11:50 rlevesque wrote: > Hi > > I am working on a program that generates various pdf files in the / > results folder. > > "scenario1.pdf" results from scenario1 > "scenario2.pdf" results from scenario2 > etc > > Once I am happy with scenario1.pdf and scenario2.pdf files, I would > like to save them in the /check folder. > > Now after having developed/modified the program to produce > scenario3.pdf, I would like to be able to re-generate > files > /results/scenario1.pdf > /results/scenario2.pdf > > and compare them with > /check/scenario1.pdf > /check/scenario2.pdf > > I tried using the md5 module to compare these files but md5 reports > differences even though the code has *not* changed at all. > > Is there a way to compare 2 pdf files generated at different time but > identical in every other respect and validate by program that the > files are identical (for all practical purposes)? Here's a naive approach, but it may be good enough for your purpose. I've printed the same small text into 1.pdf and 2.pdf (Bad practice warning: this session is slightly doctored; I hope I haven't introduced an error) >>> a = open("1.pdf").read() >>> b = open("2.pdf").read() >>> diff = [i for i, (x, y) in enumerate(zip(a, c)) if x != y] >>> len(diff) 2 >>> diff [160, 161] >>> a[150:170] '0100724151412)\n>>\nen' >>> a[140:170] 'nDate (D:20100724151412)\n>>\nen' >>> a[130:170] ')\n/CreationDate (D:20100724151412)\n>>\nen' OK, let's ignore "lines" starting with "/CreationDate " for our custom comparison function: >>> def equal_pdf(fa, fb): .... with open(fa) as a: .... with open(fb) as b: .... for la, lb in izip_longest(a, b, fillvalue=""): .... if la != lb: .... if not la.startswith("/CreationDate "): return False .... if not lb.startswith("/CreationDate "): return False .... return True .... >>> from itertools import izip_longest >>> equal_pdf("1.pdf", "2.pdf") True Peter
From: rlevesque on 24 Jul 2010 12:49 On Jul 24, 11:50 am, Peter Otten <__pete...(a)web.de> wrote: > rlevesque wrote: > > Hi > > > I am working on a program that generates various pdf files in the / > > results folder. > > > "scenario1.pdf" results from scenario1 > > "scenario2.pdf" results from scenario2 > > etc > > > Once I am happy with scenario1.pdf and scenario2.pdf files, I would > > like to save them in the /check folder. > > > Now after having developed/modified the program to produce > > scenario3.pdf, I would like to be able to re-generate > > files > > /results/scenario1.pdf > > /results/scenario2.pdf > > > and compare them with > > /check/scenario1.pdf > > /check/scenario2.pdf > > > I tried using the md5 module to compare these files but md5 reports > > differences even though the code has *not* changed at all. > > > Is there a way to compare 2 pdf files generated at different time but > > identical in every other respect and validate by program that the > > files are identical (for all practical purposes)? > > Here's a naive approach, but it may be good enough for your purpose. > I've printed the same small text into 1.pdf and 2.pdf > > (Bad practice warning: this session is slightly doctored; I hope I haven't > introduced an error) > > >>> a = open("1.pdf").read() > >>> b = open("2.pdf").read() > >>> diff = [i for i, (x, y) in enumerate(zip(a, c)) if x != y] > >>> len(diff) > 2 > >>> diff > [160, 161] > >>> a[150:170] > > '0100724151412)\n>>\nen'>>> a[140:170] > > 'nDate (D:20100724151412)\n>>\nen'>>> a[130:170] > > ')\n/CreationDate (D:20100724151412)\n>>\nen' > > OK, let's ignore "lines" starting with "/CreationDate " for our custom > comparison function: > > >>> def equal_pdf(fa, fb): > > ... with open(fa) as a: > ... with open(fb) as b: > ... for la, lb in izip_longest(a, b, fillvalue=""): > ... if la != lb: > ... if not la.startswith("/CreationDate > "): return False > ... if not lb.startswith("/CreationDate > "): return False > ... return True > ...>>> from itertools import izip_longest > >>> equal_pdf("1.pdf", "2.pdf") > > True > > Peter Thanks a lot Peter. Unfortunately there is an other pair of values that does not match and it is not obvious to me how to exclude it (as is done with the " / CreationDate" pair). To illustrate the problem, I have modified your code as follows: def equal_pdf(fa, fb): idx=0 with open(fa) as a: with open(fb) as b: for la, lb in izip_longest(a, b, fillvalue=""): idx+=1 #print idx if la != lb: #if not la.startswith(" /CreationDate"): print "***", idx , la,'\n',lb #return False print "Last idx:",idx return True from itertools import izip_longest file1='K/results/Test2.pdf' file1c='K:/check/Test2.pdf' print equal_pdf(file1, file1c) I got the following output: *** 237 /CreationDate (D:20100724123129+05'00') /CreationDate (D:20100724122802+05'00') *** 324 [(,\315'\347\003_\253\325\365\265\006\)J\216\252\215) (, \315'\347\003_\253\325\365\265\006\)J\216\252\215)] [(~s\211VIA\3426}\242XuV2\302\002) (~s\211VIA \3426}\242XuV2\302\002)] Last idx: 331 True As you can see, there are 331 pair comparisons and 2 of the comparisons do not match. Your code correctly handles the " /CreationDate" pair but the other one does not have a common element that can be used to handle it. :-( As additional information in case it matters, the first pair compared equals '%PDF-1.4\n' and the pdf document is created using reportLab. One hope I have is that item 324 which is near to the last item (331) could be part of the 'trailing code' of the pdf file and might not reflect actual differences between the 2 files. In other words, maybe it would be sufficient for me to check all but the last 8 pairs...
From: Peter Otten on 24 Jul 2010 13:34
rlevesque wrote: > Unfortunately there is an other pair of values that does not match and > it is not obvious to me how to exclude it (as is done with the " / > CreationDate" pair). > and the pdf document is created using reportLab. I dug into the reportlab source and in reportlab/rl_config.py found the line invariant= 0 #produces repeatable,identical PDFs with same timestamp info (for regression testing) I suggest that you edit that file or add from reportlab import rl_config rl_config.invariant = True to your code. Peter |