Compare two strings and find longest common sub-string [Word VBA]

Prev: coverting iWord.pages file
Next: Word.ActiveDocument.AttachedTemplate.Saved = True - gives "Type Mismatch" error

From: Doug Robbins - Word MVP on 3 Feb 2010 17:20

It is probably several orders of magnitude more complex than you may have
first though.

I am wondering however is some of the purpose-built plagiarism detection
software might not be the way to go.

--
Hope this helps,

Doug Robbins - Word MVP

Please reply only to the newsgroups unless you wish to obtain my services on
a paid professional basis.

"David Turner" <DavidTurner(a)discussions.microsoft.com> wrote in message
news:2A1C80E7-9C34-49A7-A13E-B1ECB60C46FA(a)microsoft.com...
>
>
> "Karl E. Peterson" wrote:
>
>> I'd suggest you start by clearly enumerating your actual requirements.
>> Is this a character comparison or a word comparison? Do the matches
>> need to start at the same position, or may they be anywhere within the
>> string. Etc, etc, etc...
>>
>> If you can't define the problem, you'll never solve it.
>
> Sorry. Word comparison, the matches can start anywhere in the strings and
> the strings could be of variable length, possibly ending in punctuation:
> two
> sentences for example.
> I now realise it's a much more complex problem than I first thought.
>

From: Karl E. Peterson on 3 Feb 2010 17:33

Doug Robbins - Word MVP wrote:
> It is probably several orders of magnitude more complex than you may have
> first though.

It's sounding fairly simple, actually, as brute force tasks go. And,
likewise, as brute force tasks go, it's going to take exponentially
longer to execute as the string size(s) grow. It's not unlike cracking
the combination to a lock without knowing how many numbers are in the
combination, although we are rewarded with partial success results
which actually makes it possible.

You'd "just" start comparing the first character to every character in
the other string, noting if you have a match of one. If you don't find
the first char, you scan the second string for the 2nd char in the
first string, and so on. Once you have a match of one, you start
looking for a match of two. You'd scan the second string for the first
two chars in the first. If they're not there, you'd start scanning the
second string for the 2nd and 3rd chars in the first, and so on.
Simple, huh? <g> Assuming you know the rules. Must the case match?
Etc, etc, etc...

> I am wondering however is some of the purpose-built plagiarism detection
> software might not be the way to go.

That's not a bad thought.

--
..NET: It's About Trust!
http://vfred.mvps.org

From: Peter Jamieson on 3 Feb 2010 19:53

In the simple case where you match character by character, white space
has to match exactly, etc. you're probably better off starting by
comparing the endpoints of the longest possible match, then reducing the
comparison length. If the endpoints don't match, the contents can't
match so you do not actually need to compare them.

Can't prove it though. But I would have thought there was good coverage
of this kind of algorithm in standard works such as Knuth's Sorting and
Searching volume. I don't know where you find that stuff online.

Peter Jamieson

http://tips.pjmsn.me.uk

On 03/02/2010 22:33, Karl E. Peterson wrote:
> Doug Robbins - Word MVP wrote:
>> It is probably several orders of magnitude more complex than you may
>> have first though.
>
> It's sounding fairly simple, actually, as brute force tasks go. And,
> likewise, as brute force tasks go, it's going to take exponentially
> longer to execute as the string size(s) grow. It's not unlike cracking
> the combination to a lock without knowing how many numbers are in the
> combination, although we are rewarded with partial success results which
> actually makes it possible.
>
> You'd "just" start comparing the first character to every character in
> the other string, noting if you have a match of one. If you don't find
> the first char, you scan the second string for the 2nd char in the first
> string, and so on. Once you have a match of one, you start looking for a
> match of two. You'd scan the second string for the first two chars in
> the first. If they're not there, you'd start scanning the second string
> for the 2nd and 3rd chars in the first, and so on. Simple, huh? <g>
> Assuming you know the rules. Must the case match? Etc, etc, etc...
>
>> I am wondering however is some of the purpose-built plagiarism
>> detection software might not be the way to go.
>
> That's not a bad thought.
>

From: Karl E. Peterson on 3 Feb 2010 20:09

Peter Jamieson wrote:
> In the simple case where you match character by character, white space has to
> match exactly, etc. you're probably better off starting by comparing the
> endpoints of the longest possible match, then reducing the comparison length.
> If the endpoints don't match, the contents can't match so you do not actually
> need to compare them.

Yeah, that's a potentially good option, too. A bit more complex, but
that happens when you start trying to finesse brute force attacks. :-)

> Can't prove it though. But I would have thought there was good coverage of
> this kind of algorithm in standard works such as Knuth's Sorting and
> Searching volume. I don't know where you find that stuff online.

You'd think, yeah. But there aren't many folks who still think in 2N
versus N^2 terms. Most just throw a framework at it, and see what
sticks.

--
..NET: It's About Trust!
http://vfred.mvps.org

From: Peter Jamieson on 4 Feb 2010 04:39

> You'd think, yeah. But there aren't many folks who still think in 2N
> versus N^2 terms. Most just throw a framework at it, and see what
> sticks.

Yeah. For the "build it from the ground up" types (which unfortunately
probably includes me),

http://www.cs.sunysb.edu/~algorith/files/longest-common-substring.shtml

could be handy. I got there via

http://www.itl.nist.gov/div897/sqg/dads/

Peter Jamieson

http://tips.pjmsn.me.uk

On 04/02/2010 01:09, Karl E. Peterson wrote:
> Peter Jamieson wrote:
>> In the simple case where you match character by character, white space
>> has to match exactly, etc. you're probably better off starting by
>> comparing the endpoints of the longest possible match, then reducing
>> the comparison length. If the endpoints don't match, the contents
>> can't match so you do not actually need to compare them.
>
> Yeah, that's a potentially good option, too. A bit more complex, but
> that happens when you start trying to finesse brute force attacks. :-)
>
>> Can't prove it though. But I would have thought there was good
>> coverage of this kind of algorithm in standard works such as Knuth's
>> Sorting and Searching volume. I don't know where you find that stuff
>> online.
>
> You'd think, yeah. But there aren't many folks who still think in 2N
> versus N^2 terms. Most just throw a framework at it, and see what sticks.
>

First | Prev | Next | Last
Pages: 1 2 3
Prev: coverting iWord.pages file
Next: Word.ActiveDocument.AttachedTemplate.Saved = True - gives "Type Mismatch" error