XML Pipelines/ XProc for bioinformatics: my notebook
In this post I describe how I used XProc, the XML "pipeline language" to create a workflow of XML data calling the NCBI for some SNP and building a HTML table describing those markers.
W3C:XProc: (the) XML Pipeline Language, (is) a language for describing operations to be performed on XML documents.
An XML Pipeline specifies a sequence of operations to be performed on zero or more XML documents. Pipelines generally accept zero or more XML documents as input and produce zero or more XML documents as output. Pipelines are made up of simple steps which perform atomic operations on XML documents and constructs similar to conditionals, iteration, and exception handlers which control which steps are executed.
The implementation I've choosen is Norman Walsh's XMLCalabash. It seemed to be the de-facto standard implementation. However I found it a little bit slow and I didn't like the fact that it sent 'log' messages to http://xproc.org/. XMLCalabash requires, here, the SAXON and the apache-httpclient libraries.
The XProc language itself was not easy to learn: it is missing some good examples for each feature.
A first workflow
Say , the file "rslist.xml" list of SNP packed in a HTML list:
<li>rs25</li>
<li>rs26</li>
</ul>
The folling XProc script reads a XML file and returns the original input.
<p:documentation>Reads a list of SNP and echoes it</p:documentation>
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true"/>
<p:identity/>
</p:declare-step>
Here XMLCalabash was called by assigning the port called listOfSnp to our file "rslist.xml"
It returns the original file:
<li>rs25</li>
<li>rs26</li>
</ul>
Workflow 2
In this second workflow, we loop over the SNPs and we echo each node. The attribute in <input> @sequence="true" tells xmlcalabash that the result will be a sequence of XML documents.
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true" sequence="true"/>
<p:for-each name="loopOverRs">
<p:iteration-source select="/ul/li"/>
<p:identity/>
</p:for-each>
</p:declare-step>
And here is the result
Workflow 3
This third workflow loops over each SNP, builds a URI for this SNP pointing to its XML definition at the NCBI
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true" sequence="true"/>
<p:for-each name="loopOverRs">
<p:iteration-source select="/ul/li"/>
<p:variable name="rsId" select="substring(.,3)"/>
<p:load>
<p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&id=',$rsId)"/>
</p:load>
</p:for-each>
</p:declare-step>
Here is the result, two concatened <ExchangeSet> documents :
.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/docsum htt
p://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">
<Rs rsId="25" snpClass="snp" snpType="notwithdrawn" molType="cDN
A" genotype="true" bitField="030000080001020500020101">
<Het type="est" value="0.499585956335068" stdError="0.0143825300037861"/>
<RsLinkout resourceId="1" linkValue="25"/>
<hgvs>NM_015204.1:c.1454-1398A>G</hgvs>
<hgvs>NT_007819.16:g.11073100T>C</hgvs>
</Rs>
(...)
</ExchangeSet>
<ExchangeSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xm
lns="http://www.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="http://www.ncbi.nlm.ni
h.gov/SNP/docsum http://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">
<Rs rsId="26" snpClass="mixed" snpType="notwithdrawn" molType="c
DNA" bitField="030100080001000000000700">
<Validation byCluster="true"/>
<Create build="36" date="2000-09-19 17:02"/>
(...)
<hgvs>NM_015204.1:c.1454-727A>G</hgvs>
<hgvs>NT_007819.16:g.11072429T>C</hgvs>
</Rs>
</ExchangeSet>
Workflow 4
This fourth workflow is the same than the previous one, but it uses <p:unwrap> to remove all the children from <ExchangeSet> for each call at the NCBI, and at the end of the workflow, <wrap-sequence wrapper="ExchangeSet"p> is called to merge all those children in a new <ExchangeSet>
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true"/>
<p:for-each name="loopOverRs">
<p:iteration-source select="/ul/li"/>
<p:output port="efetch-doc" sequence="true"/>
<p:variable name="rsId" select="substring(.,3)"/>
<p:load>
<p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&id=',$rsId)"/>
</p:load>
<p:unwrap match="/ncbi:ExchangeSet"/>
</p:for-each>
<p:wrap-sequence wrapper="ncbi:ExchangeSet">
</p:wrap-sequence>
</p:declare-step>
So, at the end, a one and only well defined document is returned:
<Rs xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.ncbi.nlm
.nih.gov/SNP/docsum" rsId="25" snpClass="snp" snpType="notwithdrawn" molType="cDNA" genotype="true" bitField="030000080001020500020101">
<Het type="est" value="0.499585956335068" stdError="0.0143825300037861"/>
(...)
</PrimarySequence>
<RsLinkout resourceId="1" linkValue="26"/>
<hgvs>NM_015204.1:c.1454-727A>G</hgvs>
<hgvs>NT_007819.16:g.11072429T>C</hgvs>
</Rs>
</ncbi:ExchangeSet>
Workflow 5
This time, we well use a list of ENTREZ queries grouped in a HTML list:
<li>"snp_gene_clin"[Filter] AND "snp_pubmed_cited"[Filter] AND 2[CHR]</li>
<li>(1000[CHRPOS] : 5000[CHRPOS]) AND 2[CHR] AND "homo sapiens"[Organism]</li>
</ul>
The next workflow calls NCBI ESearch for each query for the SNP database. It then calls NCBI EFetch and retrieve the information about the SNP, using the parameters (
WebEnv
and QueryKey
) found in the previous EFetch call.At the end, the
ExchangeSet
document is transformed into HTML using an XSLT stylesheet defined inline in the workflow:<p:input port="queries" primary="true">
</p:input>
<p:output port="result" primary="true"/>
<p:for-each name="loopOverQueries">
<p:iteration-source select="/ul/li"/>
<p:output port="efetch-doc" sequence="true"/>
<p:variable name="term" select="encode-for-uri(.)"/>
<p:load>
<p:with-option name="href" select="concat('http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&usehistory=y&term=',$term)"/>
</p:load>
<p:load>
<p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&WebEnv=', encode-for-uri(/eSearchResult/WebEnv), '&query_key=', encode-for-uri(/eSearchResult/QueryKey) )"/>
</p:load>
<p:unwrap match="/ncbi:ExchangeSet"/>
</p:for-each>
<p:wrap-sequence wrapper="ncbi:ExchangeSet"/>
<p:xslt name="tr2html">
<p:input port="parameters">
<p:empty/>
</p:input>
<p:input port="stylesheet">
<p:inline>
<xsl:stylesheet version="1.0">
<xsl:output method="html"/>
<xsl:template match="/">
<html><body><table>
<thead>
<tr>
<th>rs#</th>
<th>dbSnpBuild</th>
<th>group</th>
<th>build</th>
<th>Chrom</th>
<th>Position</th>
</tr>
</thead>
<tbody>
<xsl:for-each select="/ncbi:ExchangeSet/ncbi:Rs">
<xsl:variable name="rsId"><xsl:value-of select="@rsId"/></xsl:variable>
<xsl:for-each select="ncbi:Assembly">
<xsl:variable name="dbSnpBuild"><xsl:value-of select="@dbSnpBuild"/></xsl:variable>
<xsl:variable name="groupLabel"><xsl:value-of select="@groupLabel"/></xsl:variable>
<xsl:variable name="genomeBuild"><xsl:value-of select="@genomeBuild"/></xsl:variable>
<xsl:for-each select="ncbi:Component">
<xsl:variable name="chromosome"><xsl:value-of select="@chromosome"/></xsl:variable>
<xsl:for-each select="ncbi:MapLoc">
<xsl:element name="tr">
<td>
<xsl:element name="a">
<xsl:attribute name="href">http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=<xsl:value-of select="$rsId"/></xsl:attribute>rs<xsl:value-of select="$rsId"/></xsl:element>
</td>
<td><xsl:value-of select="$dbSnpBuild"/></td>
<td><xsl:value-of select="$groupLabel"/></td>
<td><xsl:value-of select="$genomeBuild"/></td>
<td>chr<xsl:value-of select="$chromosome"/></td>
<td><xsl:value-of select="@physMapInt"/></td>
</xsl:element>
</xsl:for-each>
</xsl:for-each>
</xsl:for-each>
</xsl:for-each>
</tbody>
</table></body></html>
</xsl:template>
</xsl:stylesheet>
</p:inline>
</p:input>
</p:xslt>
</p:declare-step>
And here is the HTML returned:
rs# | dbSnpBuild | group | build | Chrom | Position |
---|---|---|---|---|---|
rs28928870 | 129 | Celera | 36_3 | chr2 | 49031258 |
rs28928870 | 129 | HuRef | 36_3 | chr2 | 48924538 |
rs28928870 | 129 | reference | 36_3 | chr2 | 49044117 |
(...) | |||||
rs10168026 | 129 | Celera | 36_3 | chr2 | 71241 |
rs10168026 | 129 | HuRef | 36_3 | chr2 | 4967 |
rs10168026 | 129 | reference | 36_3 | chr2 | 4484 |
I've uploaded this workflow in myExperiment:
to my knowledge, this is the first XProc script available there.
That's it !
Pierre
3 comments:
I prefer Makefiles over xml workflows, because they are a lot easier to write (at least for the day-to-day pipelines).
In our laboratory, we have also had some discussion over this. Have a look at these slides I have prepared to introduce make to the others:
- http://bioinfoblog.it/2009/03/seminar-on-makefiles-in-bioinformatics/
cheers :)
The "phone home" part is optional, you can completely disable it with -Dcom.xmlcalabash.phonehome=false
good to know. Thank you Norman
Post a Comment