Do-It-Yourself Semantic Clustering of Tags using Google Directory
There has recently been a surge of articles about using the Porter Stemming Algorithm to find similar tags by similar etymology. Well how about clustering tags by similar meaning? Well, that's a hard problem, you say.
Or is it? It turns out that Google Directory gives us hierarchical categories for any given keyword. For example, the word "coding" is categorized under Science/Math/Applications/Communication_Theory/Coding_Theory and Computers/Programming/Languages/Java/Coding_Standards:
So if we want to organize del.icio.us and Flickr tags semantically i.e. if we want to cluster tags together, we can simply ask Google Directory for the categories of each tag.
In fact, I have implemented this idea using screen scraping to get the information from Google Directory. (Hope you won't mind too much, Google!). I have written a script to extract the most frequently occurring tags for my del.icio.us links, and a second script to cluster the tags together into categories scraped from Google Directory. The end result is a list of categories representing my interests. Check it out!
Update: A neat side effect of using Google Directory is that it even understands tags from other languages. For instance, someone tagged one of my links with "desenvolvimento". Now, I don't know what this word means, but Google Directory still gave me some clusters for it: World, Português/Regional/Brasil/Governo/Ministérios e Agências:
Or is it? It turns out that Google Directory gives us hierarchical categories for any given keyword. For example, the word "coding" is categorized under Science/Math/Applications/Communication_Theory/Coding_Theory and Computers/Programming/Languages/Java/Coding_Standards:
So if we want to organize del.icio.us and Flickr tags semantically i.e. if we want to cluster tags together, we can simply ask Google Directory for the categories of each tag.
In fact, I have implemented this idea using screen scraping to get the information from Google Directory. (Hope you won't mind too much, Google!). I have written a script to extract the most frequently occurring tags for my del.icio.us links, and a second script to cluster the tags together into categories scraped from Google Directory. The end result is a list of categories representing my interests. Check it out!
Update: A neat side effect of using Google Directory is that it even understands tags from other languages. For instance, someone tagged one of my links with "desenvolvimento". Now, I don't know what this word means, but Google Directory still gave me some clusters for it: World, Português/Regional/Brasil/Governo/Ministérios e Agências:
1 Comments:
Ha! Thanks J for the tip!
By Jonathan, at 3/16/2005 9:20 a.m.
Post a Comment
<< Home