11 October 2010

Playing with the Wordle algorithm: a tag cloud of Mesh Terms

The paper describing Wordle has been recently published (http://www.research.ibm.com/visual/papers/wordle_final2.pdf ). The algorithm was briefly described: “The most distinctive geometric aspect of a Wordle is the layout algorithm, which packs words to make efficient use of space. While many space-filling visualizations exist, they typically work by recursiveLayout proceeds according to this pseudocode:

sort words by weight, decreasing
for each word w:
w.position := makeInitialPosition(w);
while w intersects other words:

The two key procedures here are "makeInitialPosition" and "updatePosition". The makeInitialPosition routine picks a point at random according to a distribution that takes into account the desired overall shape, and, if desired, alphabetical order. The updatePosition routine moves the word on a spiral of increasing radius, radiating from the word's starting position. The updatePosition routine is aware of constraints on the overall shape of the Wordle. Constraining the layout to a rectangular shape causes updatePosition to prefer positions inside of the strict boundaries of the playing field; a blobby overall shape accepts boundary violations. The rectangular constraint is relaxed when the spiral radius exceeds either playing field dimension. ”
Your browser does not support the <CANVAS> element !

Just for fun , I've implemented my own version of the Wordle Alogrithm. The Java code is available on github at http://github.com/lindenb/jsandbox/blob/master/src/sandbox/MyWordle.java. I won't describe the program here, but I'll just say that the code invokes a java.awt.font.TextLayout class to get the shape of the text:
Graphics2D g=(...)
FontRenderContext frc = g.getFontRenderContext();
Font font=new Font("Dialog",Font.BOLD,fontSize);
TextLayout textLayout=new TextLayout(w.getText(), font, frc);
Shape shape=textLayout.getOutline(null);
and a java.awt.geom.Area to test if two shapes intersects.

Ok, let's test this code. FIrst I'm going to dump a pubmed query as XML with another simple tool named PubmedDump. This XML file is then parsed with a javascript program called from the SAX parser saxstream.jar (previously described here):
var content=null;
var mesh2count={};

function startElement(uri,localName,name,atts)
function characters(s)
if(content!=null) content+=s;
function endElement(uri,localName,name)
var count=mesh2count[content];
if(count===undefined) count=0;
function endDocument()
var w= new MyWordle();
for(var s in mesh2count)
var word= new MyWordle.Word(s,mesh2count[s]);
w.setUseArea(true);/* use shape area instead of bounding boxes */
w.setSortType(1);/* sort by weight */

var f=new File("result.svg");
This javascript code counts the occurence of each MESH term (<DescriptorName>), and when the document has been parsed, a new instance of MyWordle class is created, filled and the result is saved to a SVG file.


Here is an example for the query "Rotavirus NSP3 NSP1":
java -jar pubmeddump.jar "Rotavirus NSP3 NSP1" |\
java -cp mywordle.jar:saxscript.jar org.lindenb.tinytools.SAXScript -n -f mesh.js


That's it,


Anonymous said...

Excellent work and very fine analysis!

Anonymous said...

Thanks a LOT! Yours was the only implementation of a word cloud purely in Java that I could find!

Regu said...


Good work there. I was looking for a Java based implementation of tag cloud. Your source code provided a great start. I have created an OSS project for unstructured data analysis called Sift (https://github.com/regunathb/Sift) and tag cloud is one of the supported features.