[Programming] Command-line tool for finding frequent substrings
Hiroki Arimura has written a nifty tool called wasa that you can use to find the most frequent substrings in a text file. (It’s implemented using a suffix array.) I modified it to have an -xc option for displaying the counts.
I was pretty excited about this at first because I thought it would solve the problem of summarizing log files by finding frequent substrings. But alas, it outputs too much noise to be useful.
For example, given the following logfile,
May 14 13:26:13 kenya sshd[41244]: Invalid user louis from 85.21.206.18
May 14 13:26:16 kenya sshd[41246]: Invalid user louis from 85.21.206.18
May 15 04:00:58 kenya sshd[50672]: Did not receive identification string from 61.152.157.166
May 15 04:04:04 kenya sshd[50699]: Invalid user test from 61.152.157.166
May 15 04:04:06 kenya sshd[50701]: Invalid user test from 61.152.157.166
May 15 04:04:08 kenya sshd[50705]: Invalid user test from 61.152.157.166
The most frequent substrings are as follows:
prompt> wasa -xc -m 1 c:\junk\foo.log | sort -nr
6 sshd
6 may
6 kenya sshd
6 from
5 user
5 invalid user
4 may 15 04
4 from 61
4 61
4 166
4 157
4 152
4 15 04
4 04
3 user test from 61
3 test from 61
3 invalid user test from 61
2 user louis from 85
2 may 14 13
2 louis from 85
2 invalid user louis from 85
2 from 85
2 85
2 26
2 21
2 206
2 18
2 14 13
2 13
It finds interesting substrings (“invalid user” and “invalid user louis from 85”) but it finds a lot of uninteresting substrings as well.
Ah well. Interesting tool nonetheless. ∎
2 Comments:
Unfortunately the DL-link to
wasa0.9_with_xc_option.zip
is not valid any more and also no luck in Wayback machine. Any chance for downloading this file?!
By pappalapub, at 1/25/2013 3:09 p.m.
Thx Jonathan for updated DL link
By pappalapub, at 1/26/2013 6:34 a.m.
Post a Comment
<< Home