Showing posts with label velocity. Show all posts

15 May 2014

How I start a bioinformatics project

Phil Ashton tweeted a link to a paper about how to set up a bioinformatics project file hierarchy: " A Quick Guide to Organizing Computational Biology Projects ".

Nick Loman posted his version yesterday : "How I start a bioinformatics project" on http://nickloman.github.io/2014/05/14/how-i-start-a-bioinformatics-project/.

Here is mine (simplified):

I start by creating a directory managed by git
I create a JSON-based description of my data, including the path to the softwares, to the references
I create a git submodule for a project hosting an Apache-velocity template transforming a Makefile from config.json :
The Makefile is generated using jsvelocity :It produces the following Makefile:
The Makefile is invoked with option -j N(Allow N jobs at once) using GNU-Make or QMake(distributed parallel make, scheduled by Sun Grid Engine)

That's it,

Pierre

23 October 2013

Inside the variation toolkit: Generating a structured document describing an Illumina directory.

I wrote a tool named "Illuminadir" : it creates a structured (JSON or XML) representation of a directory containing some Illumina FASTQs (I only tested it with HiSeq , paired end-data and indexes).

Motivation

Illuminadir scans folders , search for FASTQs and generate a structured summary of the files (xml or json).
Currently only tested with HiSeq data having an index

Compilation

Options

Option	Description
IN=File	root directories This option may be specified 1 or more times.
JSON=Boolean	json output Default value: false. Ouput, could be used with jsvelocity https://github.com/lindenb/jsvelocity

Example

$ java  -jar dist/illuminadir.jar \
	I=dir1 \
	I=dir2 | xsltproc xml2script.xslt > script.bash
(...)

XML output

The XML ouput looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<illumina>
  <!--com.github.lindenb.jvarkit.tools.misc.IlluminaDirectory IN=[RUN62_XFC2DM8ACXX/data]    JSON=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false-->
  <directory path="RUN62_XFC2DM8ACXX/data">
    <samples>
      <sample name="SAMPLE1">
        <pair md5="cd4b436ce7aff4cf669d282c6d9a7899" lane="8" index="ATCACG" split="2">
          <fastq md5filename="3369c3457d6603f06379b654cb78e696" side="1" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_002.fastq.gz" file-size="359046311"/>
          <fastq md5filename="832039fa00b5f40108848e48eb437e0b" side="2" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R2_002.fastq.gz" file-size="359659451"/>
        </pair>
        <pair md5="b3050fa3307e63ab9790b0e263c5d240" lane="8" index="ATCACG" split="3">
          <fastq md5filename="091727bb6b300e463c3d708e157436ab" side="1" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_003.fastq.gz" file-size="206660736"/>
          <fastq md5filename="20235ef4ec8845515beb4e13da34b5d3" side="2" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R2_003.fastq.gz" file-size="206715143"/>
        </pair>
        <pair md5="9f7ee49e87d01610372c43ab928939f6" lane="8" index="ATCACG" split="1">
          <fastq md5filename="54cb2fd33edd5c2e787287ccf1595952" side="1" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_001.fastq.gz" file-size="354530831"/>
          <fastq md5filename="e937cbdf32020074e50d3332c67cf6b3" side="2" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R2_001.fastq.gz" file-size="356908963"/>
        </pair>
        <pair md5="0697846a504158eef523c0f4ede85288" lane="7" index="ATCACG" split="2">
          <fastq md5filename="6fb35d130efae4dcfa79260281504aa3" side="1" path="RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L007_R1_002.fastq.gz" file-size="357120615"/>
(...)
      <pair md5="634cbb29ca64604174963a4fff09f37a" lane="7" split="1">
        <fastq md5filename="bc0b283a58946fd75a95b330e0aefdc8" side="1" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane7/lane7_Undetermined_L007_R1_001.fastq.gz" file-size="371063045"/>
        <fastq md5filename="9eab26c5b593d50d642399d172a11835" side="2" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane7/lane7_Undetermined_L007_R2_001.fastq.gz" file-size="372221753"/>
      </pair>
      <pair md5="bf31099075d6c3c7ea052b8038cb4a03" lane="8" split="2">
        <fastq md5filename="f229389da36a3efc20888bffdec09b80" side="1" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane8/lane8_Undetermined_L008_R1_002.fastq.gz" file-size="374331268"/>
        <fastq md5filename="417fd9f28d24f63ce0d0808d97543315" side="2" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane8/lane8_Undetermined_L008_R2_002.fastq.gz" file-size="372181102"/>
      </pair>
      <pair md5="95cab850b0608c53e8c83b25cfdb3b2b" lane="8" split="3">
        <fastq md5filename="23f5be8a962697f50e2a271394242e2f" side="1" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane8/lane8_Undetermined_L008_R1_003.fastq.gz" file-size="60303589"/>
        <fastq md5filename="3f39f212c36d0aa884b81649ad56630c" side="2" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane8/lane8_Undetermined_L008_R2_003.fastq.gz" file-size="59123627"/>
      </pair>
      <pair md5="ab108b1dda7df86f33f375367b86bfe4" lane="8" split="1">
        <fastq md5filename="14f8281cf7d1a53d29cd03cb53a45b4a" side="1" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane8/lane8_Undetermined_L008_R1_001.fastq.gz" file-size="371255111"/>
        <fastq md5filename="977fd388e1b3451dfcdbf9bdcbb89ed4" side="2" path="RUN62_XFC2DM8ACXX/data/Undetermined_indices/Sample_lane8/lane8_Undetermined_L008_R2_001.fastq.gz" file-size="370744530"/>
      </pair>
    </undetermined>
  </directory>
</illumina>

How to use that file ? here is a example of XSLT stylesheet that can generate a Makefile to generate a LaTeX about the number of reads per Lane/Sample/Index

<?xml version='1.0'  encoding="ISO-8859-1"?>
<xsl:stylesheet
	xmlns:xsl='http://www.w3.org/1999/XSL/Transform'
	version='1.0' 
	> 
<xsl:output method="text"/>


<xsl:template match="/">
.PHONY:all clean

all: report.pdf

report.pdf: report.tex 
	pdflatex $&lt;

report.tex : all.count
	echo 'T&lt;-read.table("$&lt;",head=TRUE,sep="\t");$(foreach FTYPE,Index Sample Lane, T2&lt;-tapply(T$$count,T$$${FTYPE},sum);png("${FTYPE}.png");barplot(T2,las=3);dev.off();)' | R --no-save
	echo "\documentclass{report}" &gt; $@
	echo "\usepackage{graphicx}" &gt;&gt; $@
	echo "\date{\today}" &gt;&gt; $@
	echo "\title{FastQ Report}" &gt;&gt; $@
	echo "\begin{document}" &gt;&gt; $@
	echo "\maketitle" &gt;&gt; $@
	$(foreach FTYPE,Index Sample Lane, echo "\section{By ${FTYPE}}#\begin{center}#\includegraphics{${FTYPE}.png}#\end{center}" | tr "#" "\n" &gt;&gt; $@ ; )
	echo "\end{document}" &gt;&gt; $@
	


all.count : $(addsuffix .count, <xsl:for-each select="//fastq" ><xsl:value-of select="@md5filename"/><xsl:text> </xsl:text></xsl:for-each>) 
	echo -e "Lane\tsplit\tside\tsize\tcount\tIndex\tSample"  &gt; $@ &amp;&amp; \
	cat $^ &gt;&gt; $@

<xsl:apply-templates select="//fastq" mode="count"/>

clean:
	rm -f all.count report.pdf report.tex $(addsuffix .count, <xsl:for-each select="//fastq" ><xsl:value-of select="@md5filename"/><xsl:text> </xsl:text></xsl:for-each>) 

</xsl:template>

<xsl:template match="fastq" mode="count">
$(addsuffix .count, <xsl:value-of select="@md5filename"/>): <xsl:value-of select="@path"/>
	gunzip -c $&lt; | awk '(NR%4==1)' | wc -l  | xargs  printf "<xsl:value-of select="../@lane"/>\t<xsl:value-of select="../@split"/>\t<xsl:value-of select="@side"/>\t<xsl:value-of select="@file-size"/>\t%s\t<xsl:choose><xsl:when test="../@index"><xsl:value-of select="../@index"/></xsl:when><xsl:otherwise>Undetermined</xsl:otherwise></xsl:choose>\t<xsl:choose><xsl:when test="../../@name"><xsl:value-of select="../../@name"/></xsl:when><xsl:otherwise>Undetermined</xsl:otherwise></xsl:choose>\n"   &gt; $@

</xsl:template>
</xsl:stylesheet>

$ xsltproc  illumina.xml illumina2makefile.xsl > Makefile

output:

.PHONY:all clean

all: report.pdf

report.pdf: report.tex 
	pdflatex $<

report.tex : all.count
	echo 'T<-read.table("$<",head=TRUE,sep="\t");$(foreach FTYPE,Index Sample Lane, T2<-tapply(T$$count,T$$${FTYPE},sum);png("${FTYPE}.png");barplot(T2,las=3);dev.off();)' | R --no-save
	echo "\documentclass{report}" > $@
	echo "\usepackage{graphicx}" >> $@
	echo "\date{\today}" >> $@
	echo "\title{FastQ Report}" >> $@
	echo "\begin{document}" >> $@
	echo "\maketitle" >> $@
	$(foreach FTYPE,Index Sample Lane, echo "\section{By ${FTYPE}}#\begin{center}#\includegraphics{${FTYPE}.png}#\end{center}" | tr "#" "\n" >> $@ ; )
	echo "\end{document}" >> $@



all.count : $(addsuffix .count, 3369c3457d6603f06379b654cb78e696 832039fa00b5f40108848e48eb437e0b 091727bb6b300e463c3d708e157436ab 20235ef4ec88....)
	echo -e "Lane\tsplit\tside\tsize\tcount\tIndex\tSample"  > $@ && \
	cat $^ >> $@


$(addsuffix .count, 3369c3457d6603f06379b654cb78e696): RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_002.fastq.gz
	gunzip -c $< | awk '(NR%4==1)' | wc -l  | xargs  printf "8\t2\t1\t359046311\t%s\tATCACG\tSAMPLE1\n"   > $@


$(addsuffix .count, 832039fa00b5f40108848e48eb437e0b): RUN62_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R2_002.fastq.gz
	gunzip -c $< | awk '(NR%4==1)' | wc -l  | xargs  printf "8\t2\t2\t359659451\t%s\tATCACG\tSAMPLE1\n"   > $@
(....)

JSON output

The JSON output looks like this

[{"directory":"RUN62_XFC2DM8ACXX/data","samples":[{"sample":"SAMPLE1","files":[{
"md5pair":"cd4b436ce7aff4cf669d282c6d9a7899","lane":8,"index":"ATCACG","split":2
,"forward":{"md5filename":"3369c3457d6603f06379b654cb78e696","path":"20131001_SN
L149_0062_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_002.fastq.g
z","side":1,"file-size":359046311},"reverse":{"md5filename":"832039fa00b5f401088
48e48eb437e0b","path":"20131001_SNL149_0062_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/
SAMPLE1_ATCACG_L008_R2_002.fastq.gz","side":2,"file-size":359659451}},{"md5pair"
:"b3050fa3307e63ab9790b0e263c5d240","lane":8,"index":"ATCACG","split":3,"forward
":{"md5filename":"091727bb6b300e463c3d708e157436ab","path":"20131001_SNL149_0062
_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_003.fastq.gz","side"
:1,"file-size":206660736},"reverse":{"md5filename":"20235ef4ec8845515beb4e13da34
b5d3","path":"20131001_SNL149_0062_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_A
TCACG_L008_R2_003.fastq.gz","side":2,"file-size":206715143}},{"md5pair":"9f7ee49
e87d01610372c43ab928939f6","lane":8,"index":"ATCACG","split":1,"forward":{"md5fi
lename":"54cb2fd33edd5c2e787287ccf1595952","path":"20131001_SNL149_0062_XFC2DM8A
CXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L008_R1_001.fastq.gz","side":1,"file-
size":354530831},"reverse":{"md5filename":"e937cbdf32020074e50d3332c67cf6b3","pa
th":"20131001_SNL149_0062_XFC2DM8ACXX/data/OUT/Sample_SAMPLE1/SAMPLE1_ATCACG_L00
8_R2_001.fastq.gz","side":2,"file-size":356908963}},{"md5pair":"0697846a504158ee
f523c0f4ede85288","lane":7,"index":"ATCACG","split":2,"forward":{"md5filename":"

It can be processed using a tool like jsvelocity to generate the same kind of Makefile:

The velocity template for jsvelocity

#macro(maketarget $fastq)

$(addsuffix .count, ${fastq.md5filename}): ${fastq.path}
	gunzip -c $< | awk '(NR%4==1)' | wc -l  | xargs  printf "${fastq.parentNode.lane}\t${fastq.parentNode.split}\t${fastq.side}\t${fastq['file-size']}\t%s\t#if(${fastq.parentNode.containsKey("index")})${fastq.parentNode.index}#{else}Undetermined#{end}\t#if(${fastq.parentNode.parentNode.containsKey("name")})${fastq.parentNode.parentNode.name}#{else}Undetermined#{end}\n"   > $@

#end

.PHONY:all clean

all: report.pdf

report.pdf: report.tex 
	pdflatex $<

report.tex : all.count
	echo 'T<-read.table("$<",head=TRUE,sep="\t");$(foreach FTYPE,Index Sample Lane, T2<-tapply(T$$count,T$$${FTYPE},sum);png("${FTYPE}.png");barplot(T2,las=3);dev.off();)' | R --no-save
	echo "\documentclass{report}" > $@
	echo "\usepackage{graphicx}" >> $@
	echo "\date{\today}" >> $@
	echo "\title{FastQ Report}" >> $@
	echo "\begin{document}" >> $@
	echo "\maketitle" >> $@
	$(foreach FTYPE,Index Sample Lane, echo "\section{By ${FTYPE}}#\begin{center}#\includegraphics{${FTYPE}.png}#\end{center}" | tr "#" "\n" >> $@ ; )
	echo "\end{document}" >> $@

all.count : $(addsuffix .count, #foreach($dir in $all) #foreach($sample in ${dir.samples})#foreach($pair in ${sample.files}) ${pair.forward.md5filename}  ${pair.reverse.md5filename} #end #end #foreach($pair in   ${dir.undetermined}) ${pair.forward.md5filename}  ${pair.reverse.md5filename} #end  #end )



#foreach($dir in $all)
#foreach($sample in ${dir.samples})
#foreach($pair in ${sample.files})
#maketarget($pair.forward)
#maketarget($pair.reverse)
#end
#end
#foreach($pair in   ${dir.undetermined})
#maketarget($pair.forward)
#maketarget($pair.reverse)
#end 
#end


clean:
	rm -f all.count  $(addsuffix .count,  #foreach($dir in $all)
#foreach($sample in ${dir.samples})
#foreach($pair in ${sample.files}) ${pair.forward.md5filename}  ${pair.reverse.md5filename}  #end #end
#foreach($pair in   ${dir.undetermined}) ${pair.forward.md5filename}  ${pair.reverse.md5filename}  #end  #end )

transform using jsvelocity:

java -jar dist/jsvelocity.jar \
     -d all illumina.json \
      illumina.vm > Makefile

ouput: same as above

That's it,

Pierre

PS: This post was generated using the XSLT stylesheet :"github2html.xsl" and https://github.com/lindenb/jvarkit/wiki/Illuminadir.

17 October 2013

Rapid prototyping of a read-only Lims using JSON and Apache Velocity.

In a previous post, I showed how to use the Apache Velocity template engine to format JSON data.

Since that post, I've moved my application to a github repository: https://github.com/lindenb/jsvelocity. The project contains a java-based standalone tool to process the JSON data.
Here is an example: The JSON data:

{
individuals:[
    {
    name: "Riri",
    age: 8,
    duck: true
    },
    {
    name: "Fifi",
    age: 9,
    duck: true
    },
    {
    name: "Loulou",
    age: 10,
    duck: true
    }
    ]
}

.... and the velocity template:

#foreach($indi in ${all.individuals})
<h1>${indi['name']}</h1>
Age:${indi.age}<br/>${indi.duck}
#end

... with the following command line ...

$ java -jar dist/jsvelocity.jar \
    -f all test.json \
    test.vm

... produces the following output ...

<h1>Riri</h1>
Age:8<br/>true
<h1>Fifi</h1>
Age:9<br/>true
<h1>Loulou</h1>
Age:10<br/>true

Today I wrote a web version of the tool using the jetty server. I wanted to quickly write a web interface to display various summaries for our NGS experiments.
My JSON input looks like this:

{
"sequencer":[
 {
 "name":"HiSeq"
 },
 {
 "name":"MiSeq"
 }
 ],
"run":[ {
 "sequencer":"HiSeq",
 "flowcell":"C2AVTACXX",
 "name":"131010_C2AVTACXX_61",
 "date":"2013-10-10",
 "comment":"A comment",
 "samples":[
  {
  "name":"CD0001",
  "meancov": 10
  },
  {
  "name":"CD0002",
  "meancov": 20.0
  }
  ,
  {
  "name":"CD0003",
  "meancov": 30.0
  }
  ]
 },
 {
 "sequencer":"MiSeq",
 "flowcell":"C3VATACYY",
 "name":"131011_C3VATACYY_62",
 "date":"2013-10-11",
 "comment":"Another comment",
 "samples":[
  {
  "name":"CD0001",
  "meancov": 11
  },
  {
  "name":"CD0006",
  "meancov": 21.0
  }
  ,
  {
  "name":"CD0008",
  "meancov": null
  }
  ]
 },
 {
 "sequencer":"MiSeq",
 "flowcell":"C4VATACYZ",
 "name":"131012_C4VATACYZ_63",
 "date":"2013-10-12",
 "comment":"Another comment",
 "samples":[
  {
  "name":"CD0010",
  "meancov":1,
  "comment":"Failed, please, re-sequence"
  }
  ]
 }
 
 
 ],
"samples":[ 
 { "name":"CD0001" },
 { "name":"CD0002" },
 { "name":"CD0003" }, 
 { "name":"CD0004" }, 
 { "name":"CD0005" }, 
 { "name":"CD0006" }, 
 { "name":"CD0007" }, 
 { "name":"CD0008" },
 { "name":"CD0009" },
 { "name":"CD0010" }
 ],
"projects":[
 {
 "name":"Disease1",
 "description": "sequencing Project 1",
 "samples":["CD0001","CD0002","CD0006","CD0009"]
 },
 {
 "name":"Disease2",
 "description": "sequencing Project 2",
 "samples":["CD0002","CD0003","CD0008","CD0009"]
 }
 ]

}

One velocity template is used to browse this 'database': https://github.com/lindenb/jsvelocity/blob/master/src/test/resources/velocity/lims.vm.
The server is started like:

java -jar dist/webjsvelocity.jar  \
    -F lims src/test/resources/json/lims.json \
    src/test/resources/velocity/lims.vm

2013-10-17 12:43:35.566:INFO:oejs.Server:main: jetty-9.1.0.M0
2013-10-17 12:43:35.602:INFO:oejs.ServerConnector:main: Started ServerConnector@72dcb6{HTTP/1.1}{0.0.0.0:8080}
(...)

And here is a screenshot of the result:

That's it,

Pierre

06 November 2009

Handling RDF Statements with Apache Velocity

This post is about using Apache Velocity ( a Java-based template engine ) and the Jena RDF library. My aim was to use Velocity to handle the content of one or more RDF store without compiling, just by using a custom velocity template. This idea was much inspired by Egon Willighagen's posts where the RDF was handled with a scripting engine embedded in bioclipse. It also seems that I'm not the first who had this idea of using Velocity+RDF: see [here].
OK, my experimental source code for the program JenaVelocity is available here:

http://code.google.com/p/lindenb/source/browse/trunk/proj/tinytools/src/org/lindenb/tinytools/JenaVelocity.java

Describing the RDFstores

On the command line, one (or more) RDF dataset is described as a JSON document. In the following example this is a remote file, but it could also be the description of a persistent database, a N3 file, etc... This RDF file will also be used later for resolving some names from the bio2rdf repository, this is why I also added a table for the prefix mappings. This RDF model will be inserted in the VelocityContext under the name "$store1"

[
{
"name": "store1",
"url":"http://www.lri.fr/~pietriga/foaf.rdf",
"prefix-mapping":{
"uniprot":"http://bio2rdf.org/uniprot:"
}
}
]

Example 1

In the following example: all the RDFstores are inserted in the VelocityContext as $rdfstores. For each rdfstore a HTML list is created. The list contains the names of all the infividuals of the FOAF file described previously.

<html><body>
#set($RDF="http://www.w3.org/1999/02/22-rdf-syntax-ns#")
#set($FOAF="http://xmlns.com/foaf/0.1/")
<h1>Staff</h1>
<ul>
#foreach($store in $rdfstores)
#set($pred = ${store.model.createProperty("${FOAF}","name")})
#foreach($stmt in
${store.model.listStatements(null,${store.model.createProperty("${FOAF}","name")},null,null)})
<li>${stmt.object.string}</li>
#end
</ul></body></html>

After running JenaVelocity, I got the following result:

Staff

Jean-Daniel Fekete

Chris Bizer

Caroline Appert

Ralph Swick

Vincent Quint

Jean-Yves Vion-Dury

Yves Guiard

Eric Miller

Renaud Blanch

Emmanuel Pietriga

Jose Kahan

Eric Prud'hommeaux

Catherine Letondal

Olivier Chapuis

Michel Beaudouin-Lafon

Nicolas Roussel

Ryan Lee

Wendy Mackay

Example 2

Here, I've inserted an object called $sparql in the VelocityContext. This object is used to send a SPARQL query to the bio2rdf sparql endpoint and the Statements related to the rdf:type http://bio2rdf.org/ns/uniprot:Strain are fetched and displayed in a HTML table. For each Resource, we try to get a short form of its URI using our previously defined $store1. It the object of a statement is a literal, the quoted string is printed.

<html><body>
#set($RDF="http://www.w3.org/1999/02/22-rdf-syntax-ns#")
#set($FOAF="http://xmlns.com/foaf/0.1/")
<h1>Strains</h1>
<table>
#foreach($row in
$sparql.select("http://quebec.bio2rdf.org/sparql","select distinct ?s
?p ?o where { ?s a <http://bio2rdf.org/ns/uniprot:Strain> . ?s ?p ?o}
LIMIT 100"))
<tr>
<td><a href="${row.get("s").getURI()}">${store1.shortForm(${row.get("s").getURI()})}</a></td>
<td><a href="${row.get("p").getURI()}">${store1.shortForm(${row.get("p").getURI()})}</a></td>
<td>#if(${row.get("o").isResource()})
<a href="${row.get("o").getURI()}">${store1.shortForm(${row.get("o").getURI()})}</a>
#else
<span>"$row.get("o").string"</span>
#end</td>
</tr>
#end
</table>
#end
</body></html>

After running JenaVelocity, I got the following result:

Strains

uniprot:Q8C7G5_5	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7G5_5	dc:title	"C57BL/6J"
uniprot:Q8C7G5_8	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7G5_8	dc:title	"FVB/N"
uniprot:Q8C7H1_2	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7H1_2	dc:title	"C57BL/6J"
uniprot:Q8C7K6_2	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7K6_2	dc:title	"C57BL/6J"
uniprot:Q8C7K6_5	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7K6_5	dc:title	"C57BL/6"
uniprot:Q8C7M3_2	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7M3_2	dc:title	"C57BL/6J"
uniprot:Q8C7M3_A	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7M3_A	dc:title	"C57BL/6"
uniprot:Q8C7N7_2	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7N7_2	dc:title	"C57BL/6J"
uniprot:Q8C7N7_3	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7N7_3	dc:title	"NOD"
uniprot:Q8C7N7_7	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7N7_7	dc:title	"C57BL/6"
uniprot:Q8C7Q4_3	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7Q4_3	dc:title	"C57BL/6J"
uniprot:Q8C7R4_2	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7R4_2	dc:title	"C57BL/6J"
uniprot:Q8C7R4_5	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7R4_5	dc:title	"C57BL/6"
uniprot:Q8C7U1_2	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7U1_2	dc:title	"C57BL/6J"
uniprot:Q8C7U1_3	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7U1_3	dc:title	"NOD"
uniprot:Q8C7U1_7	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7U1_7	dc:title	"FVB/N"
uniprot:Q8C7U7_4	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7U7_4	dc:title	"C57BL/6J"
uniprot:Q8C7U7_5	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7U7_5	dc:title	"NOD"
uniprot:Q8C7V3_4	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7V3_4	dc:title	"C57BL/6J"
uniprot:Q8C7V3_7	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7V3_7	dc:title	"C57BL/6"
uniprot:Q8C7V8_2	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7V8_2	dc:title	"C57BL/6J"
uniprot:Q8C7V8_3	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7V8_3	dc:title	"NOD"
uniprot:Q8C7V8_8	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7V8_8	dc:title	"C57BL/6"
uniprot:Q8C7W7_2	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7W7_2	dc:title	"C57BL/6J"
uniprot:Q8C7W7_3	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7W7_3	dc:title	"NOD"
uniprot:Q8C7W7_8	rdf:type	http://bio2rdf.org/ns/uniprot:Strain
uniprot:Q8C7W7_8	dc:title	"Czech II"
(...)	(...)	(...)

Conclusion: Velocity templates allow to handle and render some RDF data without compiling anything. However, a prior knowledge of the Jena API is required.

That's it.

Pierre

03 October 2008

Java Wrappers for the tables of the UCSC/GoldenPath

Years ago, Jim Kent(UCSC) (the author of the BLAT algorithm) published in the Linux Journal "autoSql and autoXml: Code Generators from the Genome Project" the tools generate database definitions for SQL, write C header files with your data definitions and function prototypes, write C code to get data to and from C structures and generate C code for an XML parser.

For example the following 'as' file (http://hgwdev.cse.ucsc.edu/~kent/src/unzipped/hg/lib/cytoBand.as is the definition of the table called cytoBand:

table cytoBand
"Describes the positions of cytogenetic bands with a chromosome"
   (
   string chrom;    "Reference sequence chromosome or scaffold"
   uint   chromStart;  "Start position in genoSeq"
   uint   chromEnd;    "End position in genoSeq"
   string name;       "Name of cytogenetic band"
   string   gieStain; "Giemsa stain results"
   )

will be used to generate the following sql definition

# cytoBand.sql was originally generated by the autoSql program, which also 
# generated cytoBand.c and cytoBand.h.  This creates the database representation of
# an object which can be loaded and saved from RAM in a fairly 
# automatic way.

#Describes the positions of cytogenetic bands with a chromosome
CREATE TABLE cytoBand (
    chrom varchar(255) not null, # Human chromosome number
    chromStart int unsigned not null, # Start position in genoSeq
    chromEnd int unsigned not null, # End position in genoSeq
    name varchar(255) not null, # Name of cytogenetic band
    gieStain varchar(255) not null, # Giemsa stain results
              #Indices
    PRIMARY KEY(chrom(12),chromStart),
    UNIQUE(chrom(12),chromEnd)
);

the C code, and the C header.

As a java programmer I wanted to create my own wrappers to use the data of the UCSC. I wrote a custom ANT task(the code is available here) using the public mysql server of the UCSC to get the structure of each table (e.g. desc cytoBand) and the description of each table (e.g. select autoSqlDef from tableDescriptions where tableName="cytoBand"). Each structure is parsed and injected into an apache-velocity template (the template is available here).

Here is an example of a source generated by the ant task:

http://pastie.org/284507

As you can see 'enum' and 'set' are transformed into java Enum, getter, tableModel (for gui/swing) are created. Each class also comes with some useful static methods creating the instances from a sql query. For example here is what I wrote today to grab the information about the genes/cytobands/hapmap about a set of snp.

 for(RsId rsid: rsSet)
                {
                PreparedStatement pstmt=con.prepareStatement("select * from snp129 where name=?");
                pstmt.setString(1, rsid.getName());
                Hg18Snp129 snp= Hg18Snp129.selectOneOrZero(pstmt.executeQuery());
                if(snp==null)
                        {
                        cout().println(rsid.getName()+TAB+"##Not FOUND");
                        continue;
                        }

                (... print information about this snp ...)

                pstmt=con.prepareStatement("select * from cytoBand where chrom=? and chromStart<=? and chromEnd>=?");
                pstmt.setString(1, snp.getChrom());
                pstmt.setInt(2, snp.getChromStart());
                pstmt.setInt(3, snp.getChromEnd());

                for(Hg18CytoBand band:Hg18CytoBand.select(pstmt.executeQuery()))
                        {
                        (.. print info about this cytoband...)
                        }
                 cout().print(TAB);
                pstmt=con.prepareStatement("select * from refGene where chrom=? and txStart<=? and txEnd>=?");
                pstmt.setString(1, snp.getChrom());
                pstmt.setInt(2, snp.getChromStart());
                pstmt.setInt(3, snp.getChromEnd());
                i=0;
                for(Hg18RefGene  gene : Hg18RefGene.select(pstmt.executeQuery()))
                        {
                        if(i>0) cout().print(",");
                        cout().print(gene.getName()+"/"+gene.getName2());
                        ++i;
                        }
                 cout().print(TAB);

                for(String hapmapDb:HAPMAPDB)
                        {
                        pstmt=con.prepareStatement("select * from "+hapmapDb+" where name=?");
                        pstmt.setString(1, snp.getName());
                        Hg18HapmapSnps hapmap= Hg18HapmapSnps.selectOneOrZero(pstmt.executeQuery());
                        if(hapmap==null)
                                {
                               (...print empty fields...)
                                }
                        else
                                {
                               (... print result)
                                }
                        }
                }

Note: Hibernate is also a popular tool to map objects to databases. But here everything is read-only, (we don't need any transaction) and the relationships between the tables are rather complicated to be described using a mapping file (e.g. see the numerous "Connected Tables and Joining Fields" for the table knownGene).

That's it

Pierre

17 September 2008

Generating C code with apache-velicity

I'm currently working on Operon ( http://regulon.cng.fr/) a database developped by Mario Foglio at The National Center of Genotyping. The whole database/storage is developped around the Berkeley C API and I've been asked to write a clean 'C' API to access the data. Most data are stored with C structures and I wanted to quickly write the methods to:
* create a new instance of each structure
* free the resources allocated by each structure
* create a vector of those structures with the common methods (addElement, removeElement, getSize, clear, etc...)
* etc...

I wrote a description of a few structures in xml. Something like this:

<?xml version="1.0" encoding="UTF-8"?>
<op:operon
        xmlns:h="http://www.w3.org/1999/xhtml"
        xmlns:op="http://operon.cng.fr"
        >
<op:table name="SnpIds">
        <op:description>
                SNPIDS Berkeley Hash db: stores all SNP ids. The key for this
database is the acn, and
                duplicate acn keys are allowed.
        </op:description>
        <op:column name="fid" type="char*">
                <op:description>fid: SNP feature id</op:description>
        </op:column>
        <op:column name="acn" type="char*">
                <op:description>acn: SNP accession</op:description>
        </op:column>
</op:table>

To generate my C code I've first tried to use xslt but I later found it too ugly.
I then looked for something that could have looked like a standalone version of the java server page (jsp). I didn't find one ( it would have been nice to re-use the custom-tags).
I then tried apache-velocity ( http://velocity.apache.org/), a java processor, and this is the technology I used.

OK, this kind of C structures can be described as a java interface:

public interface CField
  {
  public String getName();
  public String getType();
  (...)
  }

public interface CStructure
  {
  public Colllection<CField> getFields();
  public String getName();
  (...)
  }

Those objects are created by parsing the XML description of the structures and are then associated with a string in the 'context' of velocity. (source code [here]).

CStructure mystructure;
  (...)
  velocityContext.put("struct",mystrucure);

The velocity engine is then called, it uses the object reflection to resolve the velocity statements. For example the following template:

 typedef struct $struct.typedef
  {
  #foreach($field in ${struct.fields})
        /**
    * ${field.name}
    * ${field.description}
    */
   ${field.type} ${field.name};
  #end
  } ${struct.name}, *${struct.name}Ptr;

will generate the C header for this structure.
The velocity templates generating the *.c and the *.h are available [here] and [here] (Warning this is a work in progress)

But that is not all: I also wanted to query each berkeley database without having to re-write a new code for each new kind of query. So I've used velocity to generate a Flex/lex and Bison/yacc files. Those tools then generate a simple parser to build a concrete syntax tree and then searching each database.

YNodePtr search = mydatabaseParseQuery("AND(LT([chromEnd],10000),GT([chromStart],100))");
myDatabaseArray array= myDatabaseSearch(search);

The velocity templates for flex and bison are available [here] and [here] (again, warning , this is a work in progress)

That's it

Pierre

YOKOFAKUN