W3C:XProc: (the) XML Pipeline Language, (is) a language for describing operations to be performed on XML documents.
An XML Pipeline specifies a sequence of operations to be performed on zero or more XML documents. Pipelines generally accept zero or more XML documents as input and produce zero or more XML documents as output. Pipelines are made up of simple steps which perform atomic operations on XML documents and constructs similar to conditionals, iteration, and exception handlers which control which steps are executed.
The implementation I've choosen is Norman Walsh's XMLCalabash. It seemed to be the de-facto standard implementation. However I found it a little bit slow and I didn't like the fact that it sent 'log' messages to http://xproc.org/. XMLCalabash requires, here, the SAXON and the apache-httpclient libraries.
The XProc language itself was not easy to learn: it is missing some good examples for each feature.
A first workflow
Say , the file "rslist.xml" list of SNP packed in a HTML list:
<ul>
<li>rs25</li>
<li>rs26</li>
</ul>
<li>rs25</li>
<li>rs26</li>
</ul>
The folling XProc script reads a XML file and returns the original input.
<p:declare-step name="snp">
<p:documentation>Reads a list of SNP and echoes it</p:documentation>
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true"/>
<p:identity/>
</p:declare-step>
<p:documentation>Reads a list of SNP and echoes it</p:documentation>
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true"/>
<p:identity/>
</p:declare-step>
Here XMLCalabash was called by assigning the port called listOfSnp to our file "rslist.xml"
java -cp ${CLASSPATH} com.xmlcalabash.drivers.Main --input listOfSnp=rslist.xml xproc01.xpl
It returns the original file:
<ul>
<li>rs25</li>
<li>rs26</li>
</ul>
<li>rs25</li>
<li>rs26</li>
</ul>
Workflow 2
In this second workflow, we loop over the SNPs and we echo each node. The attribute in <input> @sequence="true" tells xmlcalabash that the result will be a sequence of XML documents.
<p:declare-step name="snp">
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true" sequence="true"/>
<p:for-each name="loopOverRs">
<p:iteration-source select="/ul/li"/>
<p:identity/>
</p:for-each>
</p:declare-step>
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true" sequence="true"/>
<p:for-each name="loopOverRs">
<p:iteration-source select="/ul/li"/>
<p:identity/>
</p:for-each>
</p:declare-step>
And here is the result
<li>rs25</li><li>rs26</li>
Workflow 3
This third workflow loops over each SNP, builds a URI for this SNP pointing to its XML definition at the NCBI
<p:declare-step name="snp">
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true" sequence="true"/>
<p:for-each name="loopOverRs">
<p:iteration-source select="/ul/li"/>
<p:variable name="rsId" select="substring(.,3)"/>
<p:load>
<p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&id=',$rsId)"/>
</p:load>
</p:for-each>
</p:declare-step>
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true" sequence="true"/>
<p:for-each name="loopOverRs">
<p:iteration-source select="/ul/li"/>
<p:variable name="rsId" select="substring(.,3)"/>
<p:load>
<p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&id=',$rsId)"/>
</p:load>
</p:for-each>
</p:declare-step>
Here is the result, two concatened <ExchangeSet> documents :
<ExchangeSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www
.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/docsum htt
p://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">
<Rs rsId="25" snpClass="snp" snpType="notwithdrawn" molType="cDN
A" genotype="true" bitField="030000080001020500020101">
<Het type="est" value="0.499585956335068" stdError="0.0143825300037861"/>
<RsLinkout resourceId="1" linkValue="25"/>
<hgvs>NM_015204.1:c.1454-1398A>G</hgvs>
<hgvs>NT_007819.16:g.11073100T>C</hgvs>
</Rs>
(...)
</ExchangeSet>
<ExchangeSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xm
lns="http://www.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="http://www.ncbi.nlm.ni
h.gov/SNP/docsum http://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">
<Rs rsId="26" snpClass="mixed" snpType="notwithdrawn" molType="c
DNA" bitField="030100080001000000000700">
<Validation byCluster="true"/>
<Create build="36" date="2000-09-19 17:02"/>
(...)
<hgvs>NM_015204.1:c.1454-727A>G</hgvs>
<hgvs>NT_007819.16:g.11072429T>C</hgvs>
</Rs>
</ExchangeSet>
.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/docsum htt
p://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">
<Rs rsId="25" snpClass="snp" snpType="notwithdrawn" molType="cDN
A" genotype="true" bitField="030000080001020500020101">
<Het type="est" value="0.499585956335068" stdError="0.0143825300037861"/>
<RsLinkout resourceId="1" linkValue="25"/>
<hgvs>NM_015204.1:c.1454-1398A>G</hgvs>
<hgvs>NT_007819.16:g.11073100T>C</hgvs>
</Rs>
(...)
</ExchangeSet>
<ExchangeSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xm
lns="http://www.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="http://www.ncbi.nlm.ni
h.gov/SNP/docsum http://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">
<Rs rsId="26" snpClass="mixed" snpType="notwithdrawn" molType="c
DNA" bitField="030100080001000000000700">
<Validation byCluster="true"/>
<Create build="36" date="2000-09-19 17:02"/>
(...)
<hgvs>NM_015204.1:c.1454-727A>G</hgvs>
<hgvs>NT_007819.16:g.11072429T>C</hgvs>
</Rs>
</ExchangeSet>
Workflow 4
This fourth workflow is the same than the previous one, but it uses <p:unwrap> to remove all the children from <ExchangeSet> for each call at the NCBI, and at the end of the workflow, <wrap-sequence wrapper="ExchangeSet"p> is called to merge all those children in a new <ExchangeSet>
<p:declare-step name="snp">
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true"/>
<p:for-each name="loopOverRs">
<p:iteration-source select="/ul/li"/>
<p:output port="efetch-doc" sequence="true"/>
<p:variable name="rsId" select="substring(.,3)"/>
<p:load>
<p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&id=',$rsId)"/>
</p:load>
<p:unwrap match="/ncbi:ExchangeSet"/>
</p:for-each>
<p:wrap-sequence wrapper="ncbi:ExchangeSet">
</p:wrap-sequence>
</p:declare-step>
<p:input port="listOfSnp" primary="true">
</p:input>
<p:output port="result" primary="true"/>
<p:for-each name="loopOverRs">
<p:iteration-source select="/ul/li"/>
<p:output port="efetch-doc" sequence="true"/>
<p:variable name="rsId" select="substring(.,3)"/>
<p:load>
<p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&id=',$rsId)"/>
</p:load>
<p:unwrap match="/ncbi:ExchangeSet"/>
</p:for-each>
<p:wrap-sequence wrapper="ncbi:ExchangeSet">
</p:wrap-sequence>
</p:declare-step>
So, at the end, a one and only well defined document is returned:
<ncbi:ExchangeSet xmlns:ncbi="http://www.ncbi.nlm.nih.gov/SNP/docsum">
<Rs xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.ncbi.nlm
.nih.gov/SNP/docsum" rsId="25" snpClass="snp" snpType="notwithdrawn" molType="cDNA" genotype="true" bitField="030000080001020500020101">
<Het type="est" value="0.499585956335068" stdError="0.0143825300037861"/>
(...)
</PrimarySequence>
<RsLinkout resourceId="1" linkValue="26"/>
<hgvs>NM_015204.1:c.1454-727A>G</hgvs>
<hgvs>NT_007819.16:g.11072429T>C</hgvs>
</Rs>
</ncbi:ExchangeSet>
<Rs xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.ncbi.nlm
.nih.gov/SNP/docsum" rsId="25" snpClass="snp" snpType="notwithdrawn" molType="cDNA" genotype="true" bitField="030000080001020500020101">
<Het type="est" value="0.499585956335068" stdError="0.0143825300037861"/>
(...)
</PrimarySequence>
<RsLinkout resourceId="1" linkValue="26"/>
<hgvs>NM_015204.1:c.1454-727A>G</hgvs>
<hgvs>NT_007819.16:g.11072429T>C</hgvs>
</Rs>
</ncbi:ExchangeSet>
Workflow 5
This time, we well use a list of ENTREZ queries grouped in a HTML list:
<ul>
<li>"snp_gene_clin"[Filter] AND "snp_pubmed_cited"[Filter] AND 2[CHR]</li>
<li>(1000[CHRPOS] : 5000[CHRPOS]) AND 2[CHR] AND "homo sapiens"[Organism]</li>
</ul>
<li>"snp_gene_clin"[Filter] AND "snp_pubmed_cited"[Filter] AND 2[CHR]</li>
<li>(1000[CHRPOS] : 5000[CHRPOS]) AND 2[CHR] AND "homo sapiens"[Organism]</li>
</ul>
The next workflow calls NCBI ESearch for each query for the SNP database. It then calls NCBI EFetch and retrieve the information about the SNP, using the parameters (
WebEnv
and QueryKey
) found in the previous EFetch call.At the end, the
ExchangeSet
document is transformed into HTML using an XSLT stylesheet defined inline in the workflow:<p:declare-step name="snp">
<p:input port="queries" primary="true">
</p:input>
<p:output port="result" primary="true"/>
<p:for-each name="loopOverQueries">
<p:iteration-source select="/ul/li"/>
<p:output port="efetch-doc" sequence="true"/>
<p:variable name="term" select="encode-for-uri(.)"/>
<p:load>
<p:with-option name="href" select="concat('http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&usehistory=y&term=',$term)"/>
</p:load>
<p:load>
<p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&WebEnv=', encode-for-uri(/eSearchResult/WebEnv), '&query_key=', encode-for-uri(/eSearchResult/QueryKey) )"/>
</p:load>
<p:unwrap match="/ncbi:ExchangeSet"/>
</p:for-each>
<p:wrap-sequence wrapper="ncbi:ExchangeSet"/>
<p:xslt name="tr2html">
<p:input port="parameters">
<p:empty/>
</p:input>
<p:input port="stylesheet">
<p:inline>
<xsl:stylesheet version="1.0">
<xsl:output method="html"/>
<xsl:template match="/">
<html><body><table>
<thead>
<tr>
<th>rs#</th>
<th>dbSnpBuild</th>
<th>group</th>
<th>build</th>
<th>Chrom</th>
<th>Position</th>
</tr>
</thead>
<tbody>
<xsl:for-each select="/ncbi:ExchangeSet/ncbi:Rs">
<xsl:variable name="rsId"><xsl:value-of select="@rsId"/></xsl:variable>
<xsl:for-each select="ncbi:Assembly">
<xsl:variable name="dbSnpBuild"><xsl:value-of select="@dbSnpBuild"/></xsl:variable>
<xsl:variable name="groupLabel"><xsl:value-of select="@groupLabel"/></xsl:variable>
<xsl:variable name="genomeBuild"><xsl:value-of select="@genomeBuild"/></xsl:variable>
<xsl:for-each select="ncbi:Component">
<xsl:variable name="chromosome"><xsl:value-of select="@chromosome"/></xsl:variable>
<xsl:for-each select="ncbi:MapLoc">
<xsl:element name="tr">
<td>
<xsl:element name="a">
<xsl:attribute name="href">http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=<xsl:value-of select="$rsId"/></xsl:attribute>rs<xsl:value-of select="$rsId"/></xsl:element>
</td>
<td><xsl:value-of select="$dbSnpBuild"/></td>
<td><xsl:value-of select="$groupLabel"/></td>
<td><xsl:value-of select="$genomeBuild"/></td>
<td>chr<xsl:value-of select="$chromosome"/></td>
<td><xsl:value-of select="@physMapInt"/></td>
</xsl:element>
</xsl:for-each>
</xsl:for-each>
</xsl:for-each>
</xsl:for-each>
</tbody>
</table></body></html>
</xsl:template>
</xsl:stylesheet>
</p:inline>
</p:input>
</p:xslt>
</p:declare-step>
<p:input port="queries" primary="true">
</p:input>
<p:output port="result" primary="true"/>
<p:for-each name="loopOverQueries">
<p:iteration-source select="/ul/li"/>
<p:output port="efetch-doc" sequence="true"/>
<p:variable name="term" select="encode-for-uri(.)"/>
<p:load>
<p:with-option name="href" select="concat('http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&usehistory=y&term=',$term)"/>
</p:load>
<p:load>
<p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&WebEnv=', encode-for-uri(/eSearchResult/WebEnv), '&query_key=', encode-for-uri(/eSearchResult/QueryKey) )"/>
</p:load>
<p:unwrap match="/ncbi:ExchangeSet"/>
</p:for-each>
<p:wrap-sequence wrapper="ncbi:ExchangeSet"/>
<p:xslt name="tr2html">
<p:input port="parameters">
<p:empty/>
</p:input>
<p:input port="stylesheet">
<p:inline>
<xsl:stylesheet version="1.0">
<xsl:output method="html"/>
<xsl:template match="/">
<html><body><table>
<thead>
<tr>
<th>rs#</th>
<th>dbSnpBuild</th>
<th>group</th>
<th>build</th>
<th>Chrom</th>
<th>Position</th>
</tr>
</thead>
<tbody>
<xsl:for-each select="/ncbi:ExchangeSet/ncbi:Rs">
<xsl:variable name="rsId"><xsl:value-of select="@rsId"/></xsl:variable>
<xsl:for-each select="ncbi:Assembly">
<xsl:variable name="dbSnpBuild"><xsl:value-of select="@dbSnpBuild"/></xsl:variable>
<xsl:variable name="groupLabel"><xsl:value-of select="@groupLabel"/></xsl:variable>
<xsl:variable name="genomeBuild"><xsl:value-of select="@genomeBuild"/></xsl:variable>
<xsl:for-each select="ncbi:Component">
<xsl:variable name="chromosome"><xsl:value-of select="@chromosome"/></xsl:variable>
<xsl:for-each select="ncbi:MapLoc">
<xsl:element name="tr">
<td>
<xsl:element name="a">
<xsl:attribute name="href">http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=<xsl:value-of select="$rsId"/></xsl:attribute>rs<xsl:value-of select="$rsId"/></xsl:element>
</td>
<td><xsl:value-of select="$dbSnpBuild"/></td>
<td><xsl:value-of select="$groupLabel"/></td>
<td><xsl:value-of select="$genomeBuild"/></td>
<td>chr<xsl:value-of select="$chromosome"/></td>
<td><xsl:value-of select="@physMapInt"/></td>
</xsl:element>
</xsl:for-each>
</xsl:for-each>
</xsl:for-each>
</xsl:for-each>
</tbody>
</table></body></html>
</xsl:template>
</xsl:stylesheet>
</p:inline>
</p:input>
</p:xslt>
</p:declare-step>
And here is the HTML returned:
rs# | dbSnpBuild | group | build | Chrom | Position |
---|---|---|---|---|---|
rs28928870 | 129 | Celera | 36_3 | chr2 | 49031258 |
rs28928870 | 129 | HuRef | 36_3 | chr2 | 48924538 |
rs28928870 | 129 | reference | 36_3 | chr2 | 49044117 |
(...) | |||||
rs10168026 | 129 | Celera | 36_3 | chr2 | 71241 |
rs10168026 | 129 | HuRef | 36_3 | chr2 | 4967 |
rs10168026 | 129 | reference | 36_3 | chr2 | 4484 |
I've uploaded this workflow in myExperiment:
to my knowledge, this is the first XProc script available there.
That's it !
Pierre
I prefer Makefiles over xml workflows, because they are a lot easier to write (at least for the day-to-day pipelines).
ReplyDeleteIn our laboratory, we have also had some discussion over this. Have a look at these slides I have prepared to introduce make to the others:
- http://bioinfoblog.it/2009/03/seminar-on-makefiles-in-bioinformatics/
cheers :)
The "phone home" part is optional, you can completely disable it with -Dcom.xmlcalabash.phonehome=false
ReplyDeletegood to know. Thank you Norman
ReplyDelete