Dotplots rediscovered

After viewing Dan Kaminsky's CCC talk the idea of dotplots stuck in my head.

The dotplot idea has been mentioned before by Jonathan Isaac Helfman in his paper(pdf) about Similarity Patterns in Language at AT&T Bell Labs.

The question that I am trying to answer is: Can the dotplot idea help me identifying interesting parts in huge amounts of hex dumps?

I have written a small haskell program that takes a text file and plots the contents on a word basis as a Portable Networks Graphics (aka png) file. So far this is very inefficient, at least on my 800MHz board so I will implement a smarter way plotting those dots. Also the word example is pretty arbitrary as the algorithm is working on any type of list with the equality function implemented.

The code for this can be downloaded from this machine with

>darcs get http://pestilenz.org/~ckeen/dotplot

I appreciate patches sent in by darcs send

As an example how this works I show you a plot of the GPLv2:

So on the straight vertical line you see that we compared something with itselft, all the other dots indicate that there are reduntant information in the string.

As text usually does not repeat the exact same words this shows little redundancy. I expect code to be more redundant. For code better symbols than words have to be used though...

Stay tuned for more pretty pictures as I encounter them!

Code on this site is licensed under a 2 clause BSD license, everything else unless noted otherwise is licensed under a CreativeCommonsAttribution-ShareAlike3.0UnportedLicense