Prev: coverting iWord.pages file
Next: Word.ActiveDocument.AttachedTemplate.Saved = True - gives "Type Mismatch" error
From: Doug Robbins - Word MVP on 3 Feb 2010 17:20 It is probably several orders of magnitude more complex than you may have first though. I am wondering however is some of the purpose-built plagiarism detection software might not be the way to go. -- Hope this helps, Doug Robbins - Word MVP Please reply only to the newsgroups unless you wish to obtain my services on a paid professional basis. "David Turner" <DavidTurner(a)discussions.microsoft.com> wrote in message news:2A1C80E7-9C34-49A7-A13E-B1ECB60C46FA(a)microsoft.com... > > > "Karl E. Peterson" wrote: > >> I'd suggest you start by clearly enumerating your actual requirements. >> Is this a character comparison or a word comparison? Do the matches >> need to start at the same position, or may they be anywhere within the >> string. Etc, etc, etc... >> >> If you can't define the problem, you'll never solve it. > > Sorry. Word comparison, the matches can start anywhere in the strings and > the strings could be of variable length, possibly ending in punctuation: > two > sentences for example. > I now realise it's a much more complex problem than I first thought. >
From: Karl E. Peterson on 3 Feb 2010 17:33 Doug Robbins - Word MVP wrote: > It is probably several orders of magnitude more complex than you may have > first though. It's sounding fairly simple, actually, as brute force tasks go. And, likewise, as brute force tasks go, it's going to take exponentially longer to execute as the string size(s) grow. It's not unlike cracking the combination to a lock without knowing how many numbers are in the combination, although we are rewarded with partial success results which actually makes it possible. You'd "just" start comparing the first character to every character in the other string, noting if you have a match of one. If you don't find the first char, you scan the second string for the 2nd char in the first string, and so on. Once you have a match of one, you start looking for a match of two. You'd scan the second string for the first two chars in the first. If they're not there, you'd start scanning the second string for the 2nd and 3rd chars in the first, and so on. Simple, huh? <g> Assuming you know the rules. Must the case match? Etc, etc, etc... > I am wondering however is some of the purpose-built plagiarism detection > software might not be the way to go. That's not a bad thought. -- ..NET: It's About Trust! http://vfred.mvps.org
From: Peter Jamieson on 3 Feb 2010 19:53 In the simple case where you match character by character, white space has to match exactly, etc. you're probably better off starting by comparing the endpoints of the longest possible match, then reducing the comparison length. If the endpoints don't match, the contents can't match so you do not actually need to compare them. Can't prove it though. But I would have thought there was good coverage of this kind of algorithm in standard works such as Knuth's Sorting and Searching volume. I don't know where you find that stuff online. Peter Jamieson http://tips.pjmsn.me.uk On 03/02/2010 22:33, Karl E. Peterson wrote: > Doug Robbins - Word MVP wrote: >> It is probably several orders of magnitude more complex than you may >> have first though. > > It's sounding fairly simple, actually, as brute force tasks go. And, > likewise, as brute force tasks go, it's going to take exponentially > longer to execute as the string size(s) grow. It's not unlike cracking > the combination to a lock without knowing how many numbers are in the > combination, although we are rewarded with partial success results which > actually makes it possible. > > You'd "just" start comparing the first character to every character in > the other string, noting if you have a match of one. If you don't find > the first char, you scan the second string for the 2nd char in the first > string, and so on. Once you have a match of one, you start looking for a > match of two. You'd scan the second string for the first two chars in > the first. If they're not there, you'd start scanning the second string > for the 2nd and 3rd chars in the first, and so on. Simple, huh? <g> > Assuming you know the rules. Must the case match? Etc, etc, etc... > >> I am wondering however is some of the purpose-built plagiarism >> detection software might not be the way to go. > > That's not a bad thought. >
From: Karl E. Peterson on 3 Feb 2010 20:09 Peter Jamieson wrote: > In the simple case where you match character by character, white space has to > match exactly, etc. you're probably better off starting by comparing the > endpoints of the longest possible match, then reducing the comparison length. > If the endpoints don't match, the contents can't match so you do not actually > need to compare them. Yeah, that's a potentially good option, too. A bit more complex, but that happens when you start trying to finesse brute force attacks. :-) > Can't prove it though. But I would have thought there was good coverage of > this kind of algorithm in standard works such as Knuth's Sorting and > Searching volume. I don't know where you find that stuff online. You'd think, yeah. But there aren't many folks who still think in 2N versus N^2 terms. Most just throw a framework at it, and see what sticks. -- ..NET: It's About Trust! http://vfred.mvps.org
From: Peter Jamieson on 4 Feb 2010 04:39 > You'd think, yeah. But there aren't many folks who still think in 2N > versus N^2 terms. Most just throw a framework at it, and see what > sticks. Yeah. For the "build it from the ground up" types (which unfortunately probably includes me), http://www.cs.sunysb.edu/~algorith/files/longest-common-substring.shtml could be handy. I got there via http://www.itl.nist.gov/div897/sqg/dads/ Peter Jamieson http://tips.pjmsn.me.uk On 04/02/2010 01:09, Karl E. Peterson wrote: > Peter Jamieson wrote: >> In the simple case where you match character by character, white space >> has to match exactly, etc. you're probably better off starting by >> comparing the endpoints of the longest possible match, then reducing >> the comparison length. If the endpoints don't match, the contents >> can't match so you do not actually need to compare them. > > Yeah, that's a potentially good option, too. A bit more complex, but > that happens when you start trying to finesse brute force attacks. :-) > >> Can't prove it though. But I would have thought there was good >> coverage of this kind of algorithm in standard works such as Knuth's >> Sorting and Searching volume. I don't know where you find that stuff >> online. > > You'd think, yeah. But there aren't many folks who still think in 2N > versus N^2 terms. Most just throw a framework at it, and see what sticks. >
First
|
Prev
|
Next
|
Last
Pages: 1 2 3 Prev: coverting iWord.pages file Next: Word.ActiveDocument.AttachedTemplate.Saved = True - gives "Type Mismatch" error |