22 June 2009

Event-driven XML parsing (SAX) with Java+JavaScript

I just wrote SAXScript, an event-driven SAX parser java program invoking some javascript callbacks. It can be used to quickly write a piece of code to parse a huge XML file.

Download


Download saxscript.jar from http://code.google.com/p/lindenb/downloads/list

Invoke


java -jar saxscript.jar (options) [file|url]s

Options


-h (help) this screen
-f read javascript script from file
-e 'script' read javascript script from argument
-D add a variable (as string) in the scripting context.
__FILENAME__ is the current uri.
-n SAX parser is NOT namespace aware (default true)
-v SAX parser is validating (default false)

Callbacks


function startDocument()
{println("Start doc");}
function endDocument()
{println("End doc");}
function startElement(uri,localName,name,atts)
{
print(""+__FILENAME__+" START uri: "+uri+" localName:"+localName);
for(var i=0;atts!=undefined && i< atts.getLength();++i)
{
print(" @"+atts.getQName(i)+"="+atts.getValue(i));
}
println("");
}
function characters(s)
{println("Characters :" +s);}
function endElement(uri,localName,name)
{println("END: uri: "+uri+" localName:"+localName);}

Source Code



Example


The following shell script invokes NCBI/ESearch to retrieve a key to get all the bibliographic references about the Rotaviruses (8793 references).
This key is then used to download each pubmed entry and we then count the number of time each journal (tag is "MedlineTA") was cited.
#!/bin/sh
JAVA=${JAVA_HOME}/bin/java
WEBENV=`${JAVA} -jar saxscript.jar \
-e '
var WebEnv=null;
function startElement(uri,localName,name,atts)
{
if(name=="WebEnv") WebEnv="";
}

function characters(s)
{
if(WebEnv!=null) WebEnv+=s;
}

function endElement(uri,localName,name)
{
if(WebEnv!=null)
{
print(WebEnv);
WebEnv=null;
}
}
' \
"http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&usehistory=y&retmode=xml&term=Rotavirus"`


${JAVA} -jar saxscript.jar -e '
var content=null;
var hash=new Array();
function startElement(uri,localName,name,atts)
{
if(name=="MedlineTA") content="";
}

function characters(s)
{
if(content!=null) content+=s;
}

function endElement(uri,localName,name)
{
if(content!=null)
{
var c=hash[content];
hash[content]=(c==null?1:c+1);
content=null;
}
}
function endDocument()
{
for(var content in hash)
{
println(content+"\t"+ hash[content]);
}
}
' "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&query_key=1&WebEnv=${WEBENV}&retmode=xml" |\
sort -t ' ' -k2n

Result


Acta Gastroenterol Latinoam 1
Acta Histochem Suppl 1
Acta Microbiol Acad Sci Hung 1
Acta Microbiol Hung 1
Acta Microbiol Immunol Hung 1
Acta Pathol Microbiol Scand C 1
Acta Vet Acad Sci Hung 1
Adv Neonatal Care 1
Adv Nurse Pract 1
Adv Ther 1
Adv Vet Med 1
Afr J Med Med Sci 1
Age Ageing 1
AIDS Res Hum Retroviruses 1
AJNR Am J Neuroradiol 1
AJR Am J Roentgenol 1
Akush Ginekol (Sofiia) 1
(...)
Appl Environ Microbiol 87
J Pediatr Gastroenterol Nutr 97
J Virol Methods 130
Lancet 130
Vaccine 158
Pediatr Infect Dis J 177
J Gen Virol 217
Arch Virol 254
J Med Virol 262
J Infect Dis 265
Virology 278
J Virol 460
J Clin Microbiol 514


That's it !
Pierre

1 comment:

Anonymous said...

You might also want to look at vtd-xml, the next generation XML processing model that is far more powerful than DOM and SAX

http://vtd-xml.sf.net