YOKOFAKUN: XML Pipelines/ XProc for bioinformatics: my notebook

In this post I describe how I used XProc, the XML "pipeline language" to create a workflow of XML data calling the NCBI for some SNP and building a HTML table describing those markers.
W3C:XProc: (the) XML Pipeline Language, (is) a language for describing operations to be performed on XML documents.

An XML Pipeline specifies a sequence of operations to be performed on zero or more XML documents. Pipelines generally accept zero or more XML documents as input and produce zero or more XML documents as output. Pipelines are made up of simple steps which perform atomic operations on XML documents and constructs similar to conditionals, iteration, and exception handlers which control which steps are executed.
The implementation I've choosen is Norman Walsh's XMLCalabash. It seemed to be the de-facto standard implementation. However I found it a little bit slow and I didn't like the fact that it sent 'log' messages to http://xproc.org/. XMLCalabash requires, here, the SAXON and the apache-httpclient libraries.
The XProc language itself was not easy to learn: it is missing some good examples for each feature.

A first workflow

Say , the file "rslist.xml" list of SNP packed in a HTML list:

<ul>
<li>rs25</li>
<li>rs26</li>
</ul>

The folling XProc script reads a XML file and returns the original input.

<p:declare-step name="snp">
 <p:documentation>Reads a list of SNP and echoes it</p:documentation>
 <p:input port="listOfSnp" primary="true">
 </p:input>
 <p:output port="result" primary="true"/>
  <p:identity/>
</p:declare-step>

Here XMLCalabash was called by assigning the port called listOfSnp to our file "rslist.xml"

java -cp ${CLASSPATH} com.xmlcalabash.drivers.Main  --input listOfSnp=rslist.xml   xproc01.xpl

It returns the original file:

<ul>
<li>rs25</li>
<li>rs26</li>
</ul>

Workflow 2

In this second workflow, we loop over the SNPs and we echo each node. The attribute in <input> @sequence="true" tells xmlcalabash that the result will be a sequence of XML documents.

<p:declare-step name="snp">
 <p:input port="listOfSnp" primary="true">
 </p:input>
 <p:output port="result" primary="true" sequence="true"/>
  <p:for-each name="loopOverRs">
    <p:iteration-source select="/ul/li"/>
        <p:identity/>
  </p:for-each>
</p:declare-step>

And here is the result

<li>rs25</li><li>rs26</li>

Workflow 3

This third workflow loops over each SNP, builds a URI for this SNP pointing to its XML definition at the NCBI

<p:declare-step name="snp">
 <p:input port="listOfSnp" primary="true">
 </p:input>
 <p:output port="result" primary="true" sequence="true"/>
  <p:for-each name="loopOverRs">
    <p:iteration-source select="/ul/li"/>
     <p:variable name="rsId" select="substring(.,3)"/>
     <p:load>
        <p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&id=',$rsId)"/>
     </p:load>
  </p:for-each>
</p:declare-step>

Here is the result, two concatened <ExchangeSet> documents :

<ExchangeSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www
.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="http://www.ncbi.nlm.nih.gov/SNP/docsum htt
p://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">
<Rs rsId="25" snpClass="snp" snpType="notwithdrawn" molType="cDN
A" genotype="true" bitField="030000080001020500020101">
  <Het type="est" value="0.499585956335068" stdError="0.0143825300037861"/>
  <RsLinkout resourceId="1" linkValue="25"/>
  <hgvs>NM_015204.1:c.1454-1398A&gt;G</hgvs>
  <hgvs>NT_007819.16:g.11073100T&gt;C</hgvs>
</Rs>
(...)
</ExchangeSet>



<ExchangeSet xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xm
lns="http://www.ncbi.nlm.nih.gov/SNP/docsum" xsi:schemaLocation="http://www.ncbi.nlm.ni
h.gov/SNP/docsum http://www.ncbi.nlm.nih.gov/SNP/docsum/eudocsum.xsd">
<Rs rsId="26" snpClass="mixed" snpType="notwithdrawn" molType="c
DNA" bitField="030100080001000000000700">
  <Validation byCluster="true"/>
  <Create build="36" date="2000-09-19 17:02"/>
(...)
  <hgvs>NM_015204.1:c.1454-727A&gt;G</hgvs>
  <hgvs>NT_007819.16:g.11072429T&gt;C</hgvs>
</Rs>
</ExchangeSet>

Workflow 4

This fourth workflow is the same than the previous one, but it uses <p:unwrap> to remove all the children from <ExchangeSet> for each call at the NCBI, and at the end of the workflow, <wrap-sequence wrapper="ExchangeSet"p> is called to merge all those children in a new <ExchangeSet>

<p:declare-step name="snp">
 <p:input port="listOfSnp" primary="true">
 </p:input>
 <p:output port="result" primary="true"/>

  <p:for-each name="loopOverRs">
    <p:iteration-source select="/ul/li"/>
     <p:output port="efetch-doc" sequence="true"/>
     <p:variable name="rsId" select="substring(.,3)"/>
     <p:load>
        <p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&id=',$rsId)"/>
     </p:load>
     <p:unwrap match="/ncbi:ExchangeSet"/>
  </p:for-each>

<p:wrap-sequence wrapper="ncbi:ExchangeSet">
</p:wrap-sequence>
</p:declare-step>

So, at the end, a one and only well defined document is returned:

<ncbi:ExchangeSet xmlns:ncbi="http://www.ncbi.nlm.nih.gov/SNP/docsum">
<Rs xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://www.ncbi.nlm
.nih.gov/SNP/docsum" rsId="25" snpClass="snp" snpType="notwithdrawn" molType="cDNA" genotype="true" bitField="030000080001020500020101">
  <Het type="est" value="0.499585956335068" stdError="0.0143825300037861"/>

(...)

  </PrimarySequence>
  <RsLinkout resourceId="1" linkValue="26"/>
  <hgvs>NM_015204.1:c.1454-727A&gt;G</hgvs>
  <hgvs>NT_007819.16:g.11072429T&gt;C</hgvs>
</Rs>

</ncbi:ExchangeSet>

Workflow 5

This time, we well use a list of ENTREZ queries grouped in a HTML list:

<ul>
<li>"snp_gene_clin"[Filter] AND "snp_pubmed_cited"[Filter] AND 2[CHR]</li>
<li>(1000[CHRPOS] : 5000[CHRPOS]) AND 2[CHR] AND "homo sapiens"[Organism]</li>
</ul>

The next workflow calls NCBI ESearch for each query for the SNP database. It then calls NCBI EFetch and retrieve the information about the SNP, using the parameters ( WebEnv and QueryKey) found in the previous EFetch call.
At the end, the ExchangeSet document is transformed into HTML using an XSLT stylesheet defined inline in the workflow:

<p:declare-step name="snp">
 <p:input port="queries" primary="true">
 </p:input>
 <p:output port="result" primary="true"/>

  <p:for-each name="loopOverQueries">
    <p:iteration-source select="/ul/li"/>
     <p:output port="efetch-doc" sequence="true"/>
     <p:variable name="term" select="encode-for-uri(.)"/>
     <p:load>
        <p:with-option name="href" select="concat('http://www.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=snp&usehistory=y&term=',$term)"/>
     </p:load>

        <p:load>
          <p:with-option name="href" select="concat('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&retmode=xml&WebEnv=',     encode-for-uri(/eSearchResult/WebEnv),     '&query_key=',     encode-for-uri(/eSearchResult/QueryKey)     )"/>
        </p:load>
        <p:unwrap match="/ncbi:ExchangeSet"/>
  </p:for-each>
  <p:wrap-sequence wrapper="ncbi:ExchangeSet"/>
  <p:xslt name="tr2html">
    <p:input port="parameters">
        <p:empty/>
    </p:input>
    <p:input port="stylesheet">
       <p:inline>
          <xsl:stylesheet version="1.0">
            <xsl:output method="html"/>
            <xsl:template match="/">
                <html><body><table>
                <thead>
                  <tr>
                     <th>rs#</th>
                     <th>dbSnpBuild</th>
                     <th>group</th>
                     <th>build</th>
                     <th>Chrom</th>
                     <th>Position</th>
                  </tr>
                </thead>
                <tbody>
                <xsl:for-each select="/ncbi:ExchangeSet/ncbi:Rs">
                <xsl:variable name="rsId"><xsl:value-of select="@rsId"/></xsl:variable>
                <xsl:for-each select="ncbi:Assembly">
                <xsl:variable name="dbSnpBuild"><xsl:value-of select="@dbSnpBuild"/></xsl:variable>
                <xsl:variable name="groupLabel"><xsl:value-of select="@groupLabel"/></xsl:variable>
                <xsl:variable name="genomeBuild"><xsl:value-of select="@genomeBuild"/></xsl:variable>
                <xsl:for-each select="ncbi:Component">
                <xsl:variable name="chromosome"><xsl:value-of select="@chromosome"/></xsl:variable>
                <xsl:for-each select="ncbi:MapLoc">
                <xsl:element name="tr">

                        <td>
                                <xsl:element name="a">
                                <xsl:attribute name="href">http://www.ncbi.nlm.nih.gov/SNP/snp_ref.cgi?rs=<xsl:value-of select="$rsId"/></xsl:attribute>rs<xsl:value-of select="$rsId"/></xsl:element>
                        </td>
                        <td><xsl:value-of select="$dbSnpBuild"/></td>
                        <td><xsl:value-of select="$groupLabel"/></td>
                        <td><xsl:value-of select="$genomeBuild"/></td>
                        <td>chr<xsl:value-of select="$chromosome"/></td>
                        <td><xsl:value-of select="@physMapInt"/></td>
                 </xsl:element>
                </xsl:for-each>
                </xsl:for-each>
                </xsl:for-each>
                </xsl:for-each>
                </tbody>
                </table></body></html>
            </xsl:template>
          </xsl:stylesheet>
       </p:inline>
    </p:input>
  </p:xslt>


</p:declare-step>

And here is the HTML returned:

rs#	dbSnpBuild	group	build	Chrom	Position
rs28928870	129	Celera	36_3	chr2	49031258
rs28928870	129	HuRef	36_3	chr2	48924538
rs28928870	129	reference	36_3	chr2	49044117
(...)
rs10168026	129	Celera	36_3	chr2	71241
rs10168026	129	HuRef	36_3	chr2	4967
rs10168026	129	reference	36_3	chr2	4484

I've uploaded this workflow in myExperiment:

to my knowledge, this is the first XProc script available there.

That's it !
Pierre

3 comments:

dalloliogm said...: I prefer Makefiles over xml workflows, because they are a lot easier to write (at least for the day-to-day pipelines).

In our laboratory, we have also had some discussion over this. Have a look at these slides I have prepared to introduce make to the others:
- http://bioinfoblog.it/2009/03/seminar-on-makefiles-in-bioinformatics/

cheers :); Friday, 22 May, 2009
Norman Walsh said...: The "phone home" part is optional, you can completely disable it with -Dcom.xmlcalabash.phonehome=false; Saturday, 23 May, 2009
Pierre Lindenbaum said...: good to know. Thank you Norman; Saturday, 23 May, 2009

YOKOFAKUN

21 May 2009

XML Pipelines/ XProc for bioinformatics: my notebook

A first workflow

Workflow 2

Workflow 3

Workflow 4

Workflow 5

3 comments:

About Me

Feeds

Blog Archive

Web2.0

Labels