In this post I describe how I used XProc, the XML "pipeline language" to create a workflow of XML data calling the NCBI for some SNP and building a HTML table describing those markers.
W3C:XProc: (the) XML Pipeline Language, (is) a language for describing operations to be performed on XML documents.
An XML Pipeline specifies a sequence of operations to be performed on zero or more XML documents. Pipelines generally accept zero or more XML documents as input and produce zero or more XML documents as output. Pipelines are made up of simple steps which perform atomic operations on XML documents and constructs similar to conditionals, iteration, and exception handlers which control which steps are executed.
The implementation I've choosen is Norman Walsh's XMLCalabash. It seemed to be the de-facto standard implementation. However I found it a little bit slow and I didn't like the fact that it sent 'log' messages to http://xproc.org/. XMLCalabash requires, here, the SAXON and the apache-httpclient libraries.
The XProc language itself was not easy to learn: it is missing some good examples for each feature.
A first workflow
Say , the file "rslist.xml" list of SNP packed in a HTML list:
The folling XProc script reads a XML file and returns the original input.
Here XMLCalabash was called by assigning the port called listOfSnp to our file "rslist.xml"
It returns the original file:
In this second workflow, we loop over the SNPs and we echo each node. The attribute in <input> @sequence="true" tells xmlcalabash that the result will be a sequence of XML documents.
And here is the result
This third workflow loops over each SNP, builds a URI for this SNP pointing to its XML definition at the NCBI
Here is the result, two concatened <ExchangeSet> documents :
This fourth workflow is the same than the previous one, but it uses <p:unwrap> to remove all the children from <ExchangeSet> for each call at the NCBI, and at the end of the workflow, <wrap-sequence wrapper="ExchangeSet"p> is called to merge all those children in a new <ExchangeSet>
So, at the end, a one and only well defined document is returned:
This time, we well use a list of ENTREZ queries grouped in a HTML list:
The next workflow calls NCBI ESearch for each query for the SNP database. It then calls NCBI EFetch and retrieve the information about the SNP, using the parameters (
QueryKey) found in the previous EFetch call.
At the end, the
ExchangeSetdocument is transformed into HTML using an XSLT stylesheet defined inline in the workflow:
And here is the HTML returned:
I've uploaded this workflow in myExperiment:
to my knowledge, this is the first XProc script available there.
That's it !