Question about Algorithm::Diff [Perl]

Prev: FAQ 9.19 How do I return the user's mail address?
Next: FAQ 9.21 How do I use MIME to make an attachment to a mail message?

From: Dilbert on 12 Apr 2010 16:24

I have a question about the Perl Module Algorithm::Diff.

I want to use Algorithm::Diff to calculate the diff of the two
sequences "ab" and "a1bab", the "LCS" (the "Longest Common Sequence")
is a length of 2.

Theoretically there are 3 solutions with LCS = 2:

Solution 1:
---ab
a1bab

Solution 2:
a-b--
a1bab

Solution 3:
a---b
a1bab

I understand that any of those 3 solutions could be returned by
Algorithm::Diff, but I would argue that solution 1 is "better" than
solution 2 or 3, because solution 1 changes only once between '-' and
[ab], whereas solution 2 and 3 change more than once between '-' and
[ab].

This is my Perl program:

use strict;
use warnings;
use 5.010;
use Algorithm::Diff;
use Data::Dumper;
my @old = split //, 'ab';
my @new = split //, 'a1bab';
my $d = Algorithm::Diff::sdiff(\@old, \@new);
my $line = Dumper($d);
$line =~ s{\s}''xmsg;
say $line;

The output is
$VAR1=[['u','a','a'],['+','','1'],['+','','b'],['+','','a'],
['u','b','b']];

This is in fact Solution 3:
a---b
a1bab

How can I teach Algorithm::Diff to choose Solution 1 (the best of the
3 possibilities) ?

From: Kyle T. Jones on 13 Apr 2010 03:01

Dilbert wrote:
> I have a question about the Perl Module Algorithm::Diff.
>
> I want to use Algorithm::Diff to calculate the diff of the two
> sequences "ab" and "a1bab", the "LCS" (the "Longest Common Sequence")
> is a length of 2.
>
> Theoretically there are 3 solutions with LCS = 2:
>
> Solution 1:
> ---ab
> a1bab
>
> Solution 2:
> a-b--
> a1bab
>
> Solution 3:
> a---b
> a1bab
>

There is only one LCS solution - 'ab'. You're making too much of it and
you don't get to say there are 3 solutions with LCS = 2 (as if you were
assigning 2, and LCS=3 or LCS=1 were possible with the data you
provide), there is just *the* LCS solution, which, in this case, happens
to be 'ab'.

> I understand that any of those 3 solutions could be returned by
> Algorithm::Diff, but I would argue that solution 1 is "better" than
> solution 2 or 3, because solution 1 changes only once between '-' and
> [ab], whereas solution 2 and 3 change more than once between '-' and
> [ab].
>
> This is my Perl program:
>
> use strict;
> use warnings;
> use 5.010;
> use Algorithm::Diff;
> use Data::Dumper;
> my @old = split //, 'ab';
> my @new = split //, 'a1bab';
> my $d = Algorithm::Diff::sdiff(\@old, \@new);
> my $line = Dumper($d);
> $line =~ s{\s}''xmsg;
> say $line;
>
> The output is
> $VAR1=[['u','a','a'],['+','','1'],['+','','b'],['+','','a'],
> ['u','b','b']];
>
> This is in fact Solution 3:
> a---b
> a1bab
>

No, this is the smallest set of instructions for turning the first
sequence into the second sequence (same as diff), with what some find to
be an easier to read "side by side" format:

(unchanged, was 'a', stays 'a')
(add, was nothing(''), adds '1')
(add, was nothing(''), adds 'b')
(add, was nothing(''), adds 'a')
(unchanged, was 'b', stays 'b')

The same set of instructions in diff would just be:

[
['+', 1, '1'],
['+', 2, 'b'],
['+', 3, 'a'],
]

> How can I teach Algorithm::Diff to choose Solution 1 (the best of the
> 3 possibilities) ?

That question doesn't make a lot of sense (to me).

http://search.cpan.org/dist/Algorithm-Diff/lib/Algorithm/Diff.pm

Cheers.

From: Kyle T. Jones on 13 Apr 2010 03:17

Dilbert wrote:

I may have misunderstood what you were asking a second ago... let me try
again (including diff for each of your three solutions):

> I have a question about the Perl Module Algorithm::Diff.
>
> I want to use Algorithm::Diff to calculate the diff of the two
> sequences "ab" and "a1bab", the "LCS" (the "Longest Common Sequence")
> is a length of 2.
>
> Theoretically there are 3 solutions with LCS = 2:
>
> Solution 1:
> ---ab
> a1bab
>

[
['+', 0, 'a'],
['+', 1, '1'],
['+', 2, 'b'],
]

> Solution 2:
> a-b--
> a1bab
>

[
['+', 1, '1'],
['+', 3, 'a'],
['+', 4, 'b'],
]

> Solution 3:
> a---b
> a1bab
>

[
['+', 1, '1'],
['+', 2, 'b'],
['+', 3, 'a'],
]

>
> How can I teach Algorithm::Diff to choose Solution 1 (the best of the
> 3 possibilities) ?

Why are any of the three better? Any way you crack this nut, the
smallest sequence of operations to turn ab into a1bab involves three
adds... so I'm confused as to why you pick one as "best"?

Cheers.

From: Dilbert on 13 Apr 2010 05:58

On 13 avr, 09:17, "Kyle T. Jones" <KBf...(a)realdomain.net> wrote:
> Dilbert wrote:
> > How can I teach Algorithm::Diff to choose Solution 1 (the best of the
> > 3 possibilities) ?
>
> Why are any of the three better? Any way you crack this nut, the
> smallest sequence of operations to turn ab into a1bab involves three
> adds... so I'm confused as to why you pick one as "best"?

Let me try a more realistic example:

use strict;
use warnings;
use 5.010;
use Algorithm::Diff;
use Data::Dumper;
my @old = split //, 'ab';
my @new = split //, 'Show manual contribution: ab :This word can be
displayed';
my $d = Algorithm::Diff::sdiff(\@old, \@new);
my $line = Dumper($d);
$line =~ s{\s}''xmsg;
say $line;

The output alignment (#al1) is:

-a-----------b---------------------------
manual contribution: ab :can be displayed

But what I want is (#al2):

---------------------ab------------------
manual contribution: ab :can be displayed

Why ?

Because, in case I have more than one optimal solution (and this is
the case here), I want to consider not only the value of a character
('a' eq 'a' and 'b' eq 'b'), but also the context each character is
in.

the 'a' in @old has context: {left => 'start', right => 'b'}
the 'b' in @old has context: {left => 'a', right => 'end'}

the 'a' in @new (#al1) context: {left => 'm', right => 'n'}
the 'b' in @new (#al1) context: {left => 'i', right => 'u'}

the 'a' in @new (#al2) context: {left => ' ', right => 'b'}
the 'b' in @new (#al2) context: {left => 'a', right => ' '}

It can easily be seen that @old and @new(#al1) have no context in
common

Whereas @old and @new(#al2) have two context matches, that is
@old('a')->right eq @new(#al2)'a'->right
@old('b')->left eq @new(#al2)'a'->left

That's why I prefer @new (#al2) over @new (#al1)

From: Dilbert on 13 Apr 2010 06:08

On 13 avr, 11:58, Dilbert <dilbert1...(a)gmail.com> wrote:
> my @new = split //, 'Show manual contribution: ab :This word can be
> displayed';

That should be...

my @new = split //, 'manual contribution: ab :can be displayed';

> Whereas @old and @new(#al2) have two context matches, that is
> @old('a')->right eq @new(#al2)'a'->right
> @old('b')->left eq @new(#al2)'a'->left

in the last line, replace
"...@new(#al2)'a'->left..." by
"...@new(#al2)'b'->left..."

| Next | Last
Pages: 1 2
Prev: FAQ 9.19 How do I return the user's mail address?
Next: FAQ 9.21 How do I use MIME to make an attachment to a mail message?