From: Lie Ryan on 5 Jun 2010 03:53 On 06/05/10 15:43, GZ wrote: > On Jun 4, 8:37 pm, Lie Ryan <lie.1...(a)gmail.com> wrote: >> On06/05/10 07:51, GZ wrote: >>> No, rsync does not solve my problem. >> >>> I want a library that does unix 'diff' like function, i.e. compare two >>> strings line by line and output the difference. Python's difflib does >>> not work perfectly for me, because the resulting differences are >>> pretty big. I would like an algorithm that generates the smallest >>> differences. >> >> is n=0 not short enough? >> >> pprint.pprint(list(difflib.context_diff(s, t, n=0))) > > This still does not do what I want it to do. It only displays the diff > results in a different format. I want a different algorithm to > generate a smaller diff -- in other words less differences No, I meant I was confirming that you already have turned off context lines (i.e. the n=0 part), right? Also, what's the nature of the changes? You might be able to minimize difflib's output by using word-based or character-based diff-ing instead of traditional line-based diff-ing. diff output is fairly compressable, so you might want to look at zipping the output.
From: Steven D'Aprano on 5 Jun 2010 04:04 On Fri, 04 Jun 2010 22:43:48 -0700, GZ wrote: > This still does not do what I want it to do. It only displays the diff > results in a different format. I want a different algorithm to generate > a smaller diff -- in other words less differences Can you give a *short* example, showing the output from Python's diff and what you would prefer instead? -- Steven
From: GZ on 5 Jun 2010 21:32 Hi Lie, On Jun 5, 2:53 am, Lie Ryan <lie.1...(a)gmail.com> wrote: > On 06/05/10 15:43, GZ wrote: > > > > > > > On Jun 4, 8:37 pm, Lie Ryan <lie.1...(a)gmail.com> wrote: > >> On06/05/10 07:51, GZ wrote: > >>> No, rsync does not solve my problem. > > >>> I want a library that does unix 'diff' like function, i.e. compare two > >>> strings line by line and output the difference. Python's difflib does > >>> not work perfectly for me, because the resulting differences are > >>> pretty big. I would like an algorithm that generates the smallest > >>> differences. > > >> is n=0 not short enough? > > >> pprint.pprint(list(difflib.context_diff(s, t, n=0))) > > > This still does not do what I want it to do. It only displays the diff > > results in a different format. I want a different algorithm to > > generate a smaller diff -- in other words less differences > > No, I meant I was confirming that you already have turned off context > lines (i.e. the n=0 part), right? > > Also, what's the nature of the changes? You might be able to minimize > difflib's output by using word-based or character-based diff-ing instead > of traditional line-based diff-ing. > > diff output is fairly compressable, so you might want to look at zipping > the output.- Hide quoted text - > > - Show quoted text - Thanks for your response. The verboseness of the format is not really my problem and I only care about line by line comparison for now. Let me think of a better way to express what I mean by a "smaller diff." After I diff the two strings, I will have something like this: AAA - BBB + CCC + DDD - EEE It means the first line does not change, the second line is replaced by the third line, the forth line is new, and the fifth line is deleted. I define the "smallness" of the diff algorithm as "the sum of the total number of minuses and pluses". In my above example, it is 4 (two minuses and 2 pluses). Note that no matter what format we use to represent the diff, this number is the same. Python's difflib does not really minimize this number. It tries to make this number small, but also tries to yield matches that look right to people at the cost of increasing this number. (http:// docs.python.org/library/difflib.html). What I am looking for is an algo that can really minimize this number.
From: Ben Finney on 5 Jun 2010 21:42 GZ <zyzhu2000(a)gmail.com> writes: > Let me think of a better way to express what I mean by a "smaller > diff." After I diff the two strings, I will have something like this: > > AAA > - BBB > + CCC > + DDD > - EEE > > It means the first line does not change, the second line is replaced > by the third line, the forth line is new, and the fifth line is > deleted. Are you drawing a distinction between: * “line FOO is replaced by line BAR” * “line FOO is deleted, line BAR is added” Your wording seems to make that distinction, but I don't see how it's useful or meaningful in a discussion about diff. Are they not exactly the same? -- \ “Injustice is relatively easy to bear; what stings is justice.” | `\ —Henry L. Mencken | _o__) | Ben Finney
From: Dave Angel on 5 Jun 2010 22:09
GZ wrote: > <snip> >>>>> I want a library that does unix 'diff' like function, i.e. compare two >>>>> strings line by line and output the difference. Python's difflib does >>>>> not work perfectly for me, because the resulting differences are >>>>> pretty big. I would like an algorithm that generates the smallest >>>>> differences. >>>>> >> <snip> > Thanks for your response. > > The verboseness of the format is not really my problem and I only care > about line by line comparison for now. > > Let me think of a better way to express what I mean by a "smaller > diff." After I diff the two strings, I will have something like this: > > AAA > - BBB > + CCC > + DDD > - EEE > > It means the first line does not change, the second line is replaced > by the third line, the forth line is new, and the fifth line is > deleted. > > I define the "smallness" of the diff algorithm as "the sum of the > total number of minuses and pluses". In my above example, it is 4 (two > minuses and 2 pluses). Note that no matter what format we use to > represent the diff, this number is the same. > > Python's difflib does not really minimize this number. It tries to > make this number small, but also tries to yield matches that �look > right� to people at the cost of increasing this number. (http:// > docs.python.org/library/difflib.html). > > What I am looking for is an algo that can really minimize this number. > > The - lines aren't needed, any more than the context lines are needed, so that will typically cut your results in half. But perhaps the real algorithm you're describing is one that permits lines to be moved as well as inserted and deleted. If you then consider that information to be free (it's not part of the measure you're specifying), you may do best with the following algorithm: Scan the two files looking for lines that appear exactly once in each file. Consider those lines the first invariants, and everything else a potential change. Now, starting with each pair (which I call a bridge between islands), look at the line previous and the line following, and if either or both are also matched, expand the two islands to include them. As an island grows to butt up against another island, merge them. Now, each bridge is a "move" operation, and every line left over in either file is either a plus or a minus, an insert or a delete. For most editing transactions, there will be relatively few of these. For example, if three functions were reordered, from A, B, C, to A, C, B, there would be no plus or minus lines at all. DaveA |