Sunday 21 July 2013

Code copying (cut + paste)

I recently started looking into the concept of code copying. The
idea is that in a large body of code, it is normal to cut/paste blocks
of code, typically small "idioms", e.g. iterate over a list, or connect
to a database etc.

Theres some tools out there, but I wanted to add this to CRiSP, so thought
this wouldnt be too hard. Its not. But the ideas behind it are interesting.
If you have never thought about this, then go and do so.

Some people use tools like this for plagiarism - thats nice, and may be
warranted, but even on your own code base, or some one elses, it gives
rise to code quality metrics. (Code quality metrics cannot tell that
code is good or bad, but can merely hint and give the reader a chance
to inspect areas of code - with some clues about where to start).

Anyway, I ran this on the CRiSP code base - and the results are
surprising. CRiSP is closed source (there is an open source very
early version out there). There is no "cut/paste" code from external
sources, other than the odd API calls. But what I found was lots of
examples of "code copying". (It reminds me that many algorithms are
the same, albeit in different flavors, and I may copy code, but usually
this is no more than a few lines, e.g. to iterate over a list or extract
some information from a datastructure).

Firstly, in my unoptimised code, looking at 100k lines of code over
79 source files, it took 15s of CPU to do a scan. This is not optimised
code, but it is CPU intensive.

Next, even in this small sample, there is code similarities. To do
this, one effectively needs to compare each region of lines against every
other. This is a O(n^3) or worse operation - using a naive algorithm.
(A file of 1000 lines for instance, contains 1m regions - consider
line #1 can be in a group by itself, or lines 1+2, or 1+2+3, etc).

Comparing regions of size "1" is silly - all the blank lines or "{"
lines would show as code dups. So we need to exclude trivial blocks. I set
my block to 10 or more consecutive lines. And I found genuine cut/paste
examples.

Looking at the dups, they are mostly insignificant - the longest is
a 21 line match.


Post created by CRiSP v11.0.18a-b6589


No comments:

Post a Comment