From: karthikbalaguru on 20 Dec 2009 10:42 Hi, Two set of codes can be compared by using beyond compare or other equivalent software to determine the areas of differences and the areas of similarities. But, there are scenarios in which the same logic would be used by 2 set of programs/softwares but that the variable names might be different. So, how to determine this without executing the program ? Is there any tool that will help in identifying the existence of similar logics in two different programs ? Any ideas ? Thx in advans, Karthik Balaguru
From: Malcolm McLean on 20 Dec 2009 14:20 "karthikbalaguru" <karthikbalaguru79(a)gmail.com> wrote in message news: > Two set of codes can be compared by using > beyond compare or other equivalent software > to determine the areas of differences and the > areas of similarities. But, there are scenarios > in which the same logic would be used by 2 set of > programs/softwares but that the variable names > might be different. So, how to determine this > without executing the program ? Is there any tool > that will help in identifying the existence of similar > logics in two different programs ? Any ideas ? > I think such exist. Whilst in the general case "do two programs have the same output for the same input?" is the same as the halting problem, you are probably not interested in the general case, but intwo programs which have been copied from the same source, with identifiers changed and maybe a few minor control paths added.
From: spinoza1111 on 22 Dec 2009 07:10 On Dec 20, 11:42 pm, karthikbalaguru <karthikbalagur...(a)gmail.com> wrote: > Hi, > Two set of codes can be compared by using > beyond compare or other equivalent software > to determine the areas of differences and the > areas of similarities. But, there are scenarios > in which the same logic would be used by 2 set of > programs/softwares but that the variable names > might be different. So, how to determine this > without executing the program ? Is there any tool > that will help in identifying the existence of similar > logics in two different programs ? Any ideas ? > > Thx in advans, > Karthik Balaguru As others have pointed out, the general problem is equivalent to the Halting problem and is not solvable by an automated tool. However, there are useful levels of similarity, where "similarity" is a generalized identity: (1) Similarity after compiler lexical analysis: "printf( \"Hello world\ \n\" );" as a fragment is similar in this way to "printf(\"Hello world\ \n\");" (2) Similarity after parsing: "int a; a = 1+1;" is similar in this sense to "int b; b = 1+1;" (3) Similarity after code motion, elimination and optimization that preserves intent: "int a; int b; a = 0; b = 0;" is similar to "int b; int a; b = 0; a = 0;". "if (b || -1) c();" is similar in this sense to "c();" Factoring compilers better and providing an open architecture would allow us to use compilers for far more than merely compiling. Instead of shipping compilers, we should ship components. But, it's much more amusing here to engage in the politics of personal destruction.
From: Joe Beanfish on 22 Dec 2009 14:05 karthikbalaguru wrote: > Hi, > Two set of codes can be compared by using > beyond compare or other equivalent software > to determine the areas of differences and the > areas of similarities. But, there are scenarios > in which the same logic would be used by 2 set of > programs/softwares but that the variable names > might be different. So, how to determine this > without executing the program ? Is there any tool > that will help in identifying the existence of similar > logics in two different programs ? Any ideas ? As others say, it would be difficult at best to find dups. But a couple approaches that could help. One might be to run both codes through an obfuscator which would replace the programmer's meaningful variable names with generated names. Then compare. But there's still a lot of room for programming style to mess you up. A second approach would be to compile both to assembler and compare the assembler code. That should remove a fair amount of programmer style problems but not all.
From: Kaz Kylheku on 22 Dec 2009 14:53
On 2009-12-20, karthikbalaguru <karthikbalaguru79(a)gmail.com> wrote: > Hi, > Two set of codes can be compared by using > beyond compare or other equivalent software > to determine the areas of differences and the > areas of similarities. But, there are scenarios > in which the same logic would be used by 2 set of > programs/softwares but that the variable names > might be different. So, how to determine this > without executing the program ? Is there any tool > that will help in identifying the existence of similar > logics in two different programs ? Any ideas ? The structural similarity you are hinting at can be easily found if you parse the code into a nice abstract syntax tree representation. Then it's just a matter of writing an equality function which compares two pieces of code. This is easiest to understand and illustrate through the Lisp language. The comparison is similar to a routine that compares two trees, except that it has to understand all ``special forms'' in the language, and their semantics. That is to say, the comparison has to behave as a ``code walker'', with respect to two trees at the same time. It has to recognize all special syntactic constructs, so that it can apply the right kind of comparison. It also must be capable of performing variable renaming. So for instance we might want these two to be found equivalent: (let ((x 1)) (+ x x)) (let ((y 1)) (+ y y)) For thaat we have to recognize that we have two matching constructs (both are ``let''). They both bind the same number of variables, using the same initialization values. Next, we have to normalize the bodies. We can do that by renaming the variables on each side: e.g. both x and y get renamed to t0 (temporary variable #0). Each time we rename a variable we generate a new temporary. When we rename x to t0 and substitute, we end up with: (let ((t0 1)) (+ t0 t0)) And when we do the same substitution on the variable y on the second form we of course also end up with: (let ((t0 1)) (+ t0 t0)) So now having done this substituion over the let, we can recursively process the body of each let and continue comparing: we now end up comparing (+ t0 t0) to (+ t0 t0). We have a positive match on the same symbol +, and we know that it's a function call. The two functions calls have the same number of arguments, which we can compare one by one, finding an exact match in ech position: the expression t0 trivially matches the expression t0. Thus having reached bottom, we conclude that the two constructs are equivalent (modulo variable naming). This applies to the comparison of C programs too because for the C language, you can pick this kind of tree representation with symbols and atoms. Parse the code to that, and write a comparison which applies the right kind of rules. The question is the semantic depth that you want in the comparison. Should the expression !(!P || !Q) match P && Q, by the application of De Morgan's law? Does your equivalence function fold constant expressions, so that 2 + 2 matches 1 + 3? There is a huge range of things you can do in between the most naive static comparison (in which even differences in variable names do matter) to running the program and trying to identify that it does the same things to all inputs (running into the halting problem). Now about implementing this. Turns out, the annoying part of parsing C to structure is already done. For instance, there is this Japanese gentleman's project called SC: http://super.para.media.kyoto-u.ac.jp/~tasuku/sc/index-e.html With this software, you would be able to just concentrate on writing the equivalence function. |