Jon Aquino's Mental Garden: [Programming] Command-line tool for finding frequent substrings

Hiroki Arimura has written a nifty tool called wasa that you can use to find the most frequent substrings in a text file. (It’s implemented using a suffix array.) I modified it to have an -xc option for displaying the counts.

I was pretty excited about this at first because I thought it would solve the problem of summarizing log files by finding frequent substrings. But alas, it outputs too much noise to be useful.

For example, given the following logfile,

May 14 13:26:13 kenya sshd[41244]: Invalid user louis from 85.21.206.18
May 14 13:26:16 kenya sshd[41246]: Invalid user louis from 85.21.206.18
May 15 04:00:58 kenya sshd[50672]: Did not receive identification string from 61.152.157.166
May 15 04:04:04 kenya sshd[50699]: Invalid user test from 61.152.157.166
May 15 04:04:06 kenya sshd[50701]: Invalid user test from 61.152.157.166
May 15 04:04:08 kenya sshd[50705]: Invalid user test from 61.152.157.166

The most frequent substrings are as follows:

prompt> wasa -xc -m 1 c:\junk\foo.log | sort -nr
6 sshd
6 may
6 kenya sshd
6 from
5 user
5 invalid user
4 may 15 04
4 from 61
4 61
4 166
4 157
4 152
4 15 04
4 04
3 user test from 61
3 test from 61
3 invalid user test from 61
2 user louis from 85
2 may 14 13
2 louis from 85
2 invalid user louis from 85
2 from 85
2 85
2 26
2 21
2 206
2 18
2 14 13
2 13