23 April 2009

A Tag Cloud for my Resume.

I'm revising my CV as I'll move to Nantes and I wanted to create a Tag Cloud to illustrate my resume. Paul and Richard suggested to use wordle to generate the cloud but I wanted to generate it on the fly, for any language, whenever I want, etc...
so I stored my skills in a RDF file which looks like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE rdf:RDF [
<!ENTITY info "plum">
<!ENTITY bio "blue">
<!ENTITY other "lightgray">
<!ENTITY devtool "magenta">
<!ENTITY devlang "darkRed">
<!ENTITY os "purple">
<!ENTITY database "orange">
]>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns="http://ontology.lindenb.org/tagcloud/">
<Tag rdf:about="https://javacc.dev.java.net/">
<weight>25</weight>
<label>Javacc</label>
<title xml:lang="en">JavaCC is a parser/scanner generator for java</title>
<title xml:lang="fr">Un generateur de parser pour java</title>
<color>magenta</color>
</Tag>
(...)
<Tag rdf:about="http://en.wikipedia.org/wiki/Awk">
<weight>25</weight>
<label>Awk</label>
<title>Awk</title>
<color>darkRed</color>
</Tag>

<Tag rdf:about="http://en.wikipedia.org/wiki/GNU_bison">
<weight>25</weight>
<label>Lex/Yacc</label>
<title>Lex/Yacc & Flex/Bison</title>
<color>magenta</color>
</Tag>
(...)
</rdf:RDF>

Advantages: I can store the labels for various languages, use xml entities like <!ENTITY devlang "darkRed"> to quickly change a color, etc...

This XML file is then transformed with the following XSLT stylesheet


And (tada !) here is the result



Waves... :-)

(And the icing on the cake : it is a RDFa output).

Note: Pawel Szczesny did great job for his CV too.


Pierre

21 April 2009

Hadoop, my notebook: HDFS

This post is about the Apache Hadoop, an open-source algorithm implementing the MapReduce algorithm. This first notebook focuses on HDFS, the Hadoop file system, and follows the great Yahoo! Hadoop Tutorial Home. Forget the clusters, I'm running this hadoop engine on my one and only laptop.

Downloading & Installing


~/tmp/HADOOP> wget "http://apache.multidist.com/hadoop/core/hadoop-0.19.1/hadoop-0.19.1.tar.gz
Saving to: `hadoop-0.19.1.tar.gz'

100%[======================================>] 55,745,146 487K/s in 1m 53s

2009-04-21 20:52:04 (480 KB/s) - `hadoop-0.19.1.tar.gz' saved [55745146/55745146]
~/tmp/HADOOP> tar xfz hadoop-0.19.1.tar.gz
~/tmp/HADOOP> rm hadoop-0.19.1.tar.gz
~/tmp/HADOOP> mkdir -p hdfs/data
~/tmp/HADOOP> mkdir -p hdfs/name
#hum... this step was not clear as I'm not a ssh guru. I had to give my root password to make the server starts
~/tmp/HADOOP> ssh-keygen -t rsa -P 'password' -f ~/.ssh/id_rsa
Generating public/private dsa key pair.
Your identification has been saved in /home/pierre/.ssh/id_rsa.
Your public key has been saved in /home/pierre/.ssh/id_rsa.pub.
The key fingerprint is:
17:c0:29:b4:56:d1:d3:dd:ae:d5:ba:3e:5b:33:b0:99 pierre@linux-zfgk
~/tmp/HADOOP> cat ~/.ssh/id_rsa.pub >> ~/.ssh/autorized_keys

Editing the Cluster configuration


Edit the file hadoop-0.19.1/conf/hadoop-site.xml.
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>This is the URI (protocol specifier, hostname, and port) that describes the NameNode (main Node) for the cluster.</description>
</property>
<property>
<name>dfs.data.dir</name>
<value>/home/pierre/tmp/HADOOP/hdfs/data</value>
<description>This is the path on the local file system in which the DataNode instance should store its data</description>
</property>
<property>
<name>dfs.name.dir</name>
<value>/home/pierre/tmp/HADOOP/hdfs/name</value>
<description>This is the path on the local file system of the NameNode instance where the NameNode metadata is stored.</description>
</property>
</configuration>

Formatting HDFS


HDFS the Hadoop Distributed File System "HDFS is a block-structured file system: individual files are broken into blocks of a fixed size. These blocks are stored across a cluster of one or more machines with data storage capacity. A file can be made of several blocks, and they are not necessarily stored on the same machine(...)If several machines must be involved in the serving of a file, then a file could be rendered unavailable by the loss of any one of those machines. HDFS combats this problem by replicating each block across a number of machines (3, by default)."
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop namenode -format
09/04/21 21:11:18 INFO namenode.NameNode: STARTUP_MSG:
/************************************************************
STARTUP_MSG: Starting NameNode
STARTUP_MSG: host = linux-zfgk.site/127.0.0.2
STARTUP_MSG: args = [-format]
STARTUP_MSG: version = 0.19.1
STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.19 -r 745977; compiled by 'ndaley' on Fri Feb 20 00:16:34 UTC 2009
************************************************************/
Re-format filesystem in /home/pierre/tmp/HADOOP/hdfs/name ? (Y or N) Y
09/04/21 21:11:29 INFO namenode.FSNamesystem: fsOwner=pierre,users,dialout,video
09/04/21 21:11:29 INFO namenode.FSNamesystem: supergroup=supergroup
09/04/21 21:11:29 INFO namenode.FSNamesystem: isPermissionEnabled=true
09/04/21 21:11:29 INFO common.Storage: Image file of size 96 saved in 0 seconds.
09/04/21 21:11:29 INFO common.Storage: Storage directory /home/pierre/tmp/HADOOP/hdfs/name has been successfully formatted.
09/04/21 21:11:29 INFO namenode.NameNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down NameNode at linux-zfgk.site/127.0.0.2
************************************************************/

Starting HDFS


~/tmp/HADOOP> hadoop-0.19.1/bin/start-dfs.sh
starting namenode, logging to /home/pierre/tmp/HADOOP/hadoop-0.19.1/bin/../logs/hadoop-pierre-namenode-linux-zfgk.out
Password:
localhost: starting datanode, logging to /home/pierre/tmp/HADOOP/hadoop-0.19.1/bin/../logs/hadoop-pierre-datanode-linux-zfgk.out
Password:
localhost: starting secondarynamenode, logging to /home/pierre/tmp/HADOOP/hadoop-0.19.1/bin/../logs/hadoop-pierre-secondarynamenode-linux-zfgk.out

Playing with HDFS


First Download a few SNP from UCSC/dbsnp into ~/local.xls.
~/tmp/HADOOP> mysql -N --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg18 -e 'select name,chrom,chromStart,avHet from snp129 where avHet!=0 and name like "rs12345%" ' > ~/local.xls

Creating directories
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -mkdir /user
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -mkdir /user/pierre

Copying a file "local.xls" from your local file system to HDFS
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -put ~/local.xls stored.xls

Recursive listing of HDFS
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -lsr /
drwxr-xr-x - pierre supergroup 0 2009-04-21 21:45 /user
drwxr-xr-x - pierre supergroup 0 2009-04-21 21:45 /user/pierre
-rw-r--r-- 3 pierre supergroup 308367 2009-04-21 21:45 /user/pierre/stored.xls

'cat' the first lines of the SNP file stored on HDFS:
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -cat /user/pierre/stored.xls | head
rs12345003 chr9 1765426 0.02375
rs12345004 chr9 2962430 0.055768
rs12345006 chr9 74304094 0.009615
rs12345007 chr9 73759324 0.112463
rs12345008 chr9 88421765 0.014184
rs12345013 chr9 78951530 0.104463
rs12345014 chr9 78542260 0.490608
rs12345015 chr9 10121973 0.201446
rs12345016 chr9 2698257 0.456279
rs12345027 chr9 8399632 0.04828

Removing a file. Note: "On startup, the NameNode enters a special state called Safemode." I could not delete a file before I used "dfsadmin -safemode leave".
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfsadmin -safemode leave
Safe mode is OFF
~/tmp/HADOOP> hadoop-0.19.1/bin/hadoop dfs -rm /user/pierre/stored.xls
Deleted hdfs://localhost:9000/user/pierre/stored.xls

Check there is NO file named stored.xls in the local file system !
~/tmp/HADOOP> find hdfs/
hdfs/
hdfs/data
hdfs/data/detach
hdfs/data/in_use.lock
hdfs/data/tmp
hdfs/data/current
hdfs/data/current/blk_3340572659657793789
hdfs/data/current/dncp_block_verification.log.curr
hdfs/data/current/blk_3340572659657793789_1002.meta
hdfs/data/current/VERSION
hdfs/data/storage
hdfs/name
hdfs/name/in_use.lock
hdfs/name/current
hdfs/name/current/edits
hdfs/name/current/VERSION
hdfs/name/current/fsimage
hdfs/name/current/fstime
hdfs/name/image
hdfs/name/image/fsimage


Stop HDFS


~/tmp/HADOOP> hadoop-0.19.1/bin/stop-dfs.sh
stopping namenode
Password:
localhost: stopping datanode
Password:
localhost: stopping secondarynamenode



Pierre

10 April 2009

Resolving LSID: my notebook

This post is about LSID (The Life Science Identifier) and was inspired by the recent activity of Roderic Page on Twitter and by Roderic's paper "LSID Tester, a tool for testing Life Science Identifier resolution services".

OK.
At the beginning, there is a LSID

urn:lsid:ubio.org:namebank:11815

ubio.org is the authority.It is followed by a database and an id.
We need to resolve this authority to find some metadata about this LSID object. On unix, we put _lsid._tcp before this authority and the host command is used to ask the "DNS for the lsid service record for pdb.org with TCP as the network protocol" (I'm not really sure of what it really means, and I guess this can be a problem for the other bioinformaticians too).
%host -t srv _lsid._tcp.ubio.org
_lsid._tcp.ubio.org has SRV record 1 0 80 ANIMALIA.ubio.org.

So http://ANIMALIA.ubio.org the is location of the LSID service. We append /authority and we get a WSDL file at http://animalia.ubio.org/authority/ (This WSDL is another issue for me, is there so many bioinformaticians knowing how to read such format ?).

<wsdl:definitions targetNamespace="http://www.hyam.net/lsid/Authority">
<import namespace="http://www.omg.org/LSID/2003/AuthorityServiceHTTPBindings"
location="LSIDAuthorityServiceHTTPBindings.wsdl"
/>

<wsdl:service name="MyAuthorityHTTPService">
<wsdl:port name="MyAuthorityHTTPPort" binding="httpsns:LSIDAuthorityHTTPBinding">
<httpsns:address location="http://animalia.ubio.org/authority/index.php"/>
</wsdl:port>
</wsdl:service>
</wsdl:definitions>

At http://animalia.ubio.org/authority/LSIDAuthorityServiceHTTPBindings.wsdl we get the Http bindings.
</><definitions targetNamespace="http://www.omg.org/LSID/2003/AuthorityServiceHTTPBindings">
<import namespace="http://www.omg.org/LSID/2003/Standard/WSDL" location="LSIDPortTypes.wsdl"/>
<binding name="LSIDAuthorityHTTPBinding" type="sns:LSIDAuthorityServicePortType">
<http:binding verb="GET"/>
<operation name="getAvailableServices">
<http:operation location="/authority/"/>
<input>
<http:urlEncoded/>
</input>
<output>
<mime:multipartRelated>
<mime:part>
<mime:content part="wsdl" type="application/octet-stream"/>
</mime:part>
</mime:multipartRelated>
</output>
</operation>
</binding>
</definitions>

This WSDL tells us that http://animalia.ubio.org/authority/ is the URL where we can find some metadata about the LSID and using http+GET. And, by appending metadata.php (why this php extension ? this is not clear for me ) you'll get the following RDF metadata about urn:lsid:ubio.org:namebank:11815 (Very cool, I like this idea of getting a RDF from one identifier). The process of resolving the WSDL can be achieved once and cached.

<rdf:RDF>
<rdf:Description rdf:about="urn:lsid:ubio.org:namebank:11815">
<dc:identifier>urn:lsid:ubio.org:namebank:11815</dc:identifier>
<dc:creator rdf:resource="http://www.ubio.org"/>
<dc:subject>Pternistis leucoscepus (Gray, GR) 1867</dc:subject>
<ubio:taxonomicGroup>Aves</ubio:taxonomicGroup>
<ubio:recordVersion>4</ubio:recordVersion>
<ubio:canonicalName>Pternistis leucoscepus</ubio:canonicalName>
<dc:title>Pternistis leucoscepus</dc:title>
<dc:type>Scientific Name</dc:type>
<ubio:lexicalStatus>Unknown (Default)</ubio:lexicalStatus>
<gla:rank>Species</gla:rank>
<gla:vernacularName rdf:resource="urn:lsid:ubio.org:namebank:954940"/>
<gla:vernacularName rdf:resource="urn:lsid:ubio.org:namebank:954941"/>
<gla:vernacularName rdf:resource="urn:lsid:ubio.org:namebank:1564236"/>
<gla:vernacularName rdf:resource="urn:lsid:ubio.org:namebank:783787"/>
<gla:vernacularName rdf:resource="urn:lsid:ubio.org:namebank:1580313"/>
<gla:mapping rdf:resource="http://starcentral.mbl.edu/microscope/portal.php?pagetitle=classification&BLCHID=12-4498"/>
<gla:mapping rdf:resource="http://www.cbif.gc.ca/pls/itisca/next?v_tsn=553857&taxa=&p_format=&p_ifx=cbif&p_lang="/>
<gla:hasBasionym rdf:resource="urn:lsid:ubio.org:namebank:12292"/>
<gla:objectiveSynonym rdf:resource="urn:lsid:ubio.org:namebank:12292"/>
<gla:objectiveSynonym rdf:resource="urn:lsid:ubio.org:namebank:1762007"/>
<gla:objectiveSynonym rdf:resource="urn:lsid:ubio.org:namebank:1762032"/>
<gla:objectiveSynonym rdf:resource="urn:lsid:ubio.org:namebank:1762051"/>
<gla:objectiveSynonym rdf:resource="urn:lsid:ubio.org:namebank:3408791"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1116259"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1137821"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1173817"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1174615"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1416177"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1672192"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:2233032"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:13853963"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:1909656"/>
<ubio:hasCAVConcept rdf:resource="urn:lsid:ubio.org:classificationbank:2304281"/>
<dcterms:bibliographicCitation>Sclater, W.L., Systema Avium Ethiopicarum, p. 91</dcterms:bibliographicCitation>
</rdf:Description>
</rdf:RDF>


notebook EOF.

XML to DOM using XSLT

A short post. I was fed up with writing javascript/java code for creating dynamic web interfaces ( You know all those document.createElementNS, document.createTextNode node.appendChild etc... statements for building the DOM), so I wrote a XSL stylesheet taking as input a XML file and echoing the code that should be used to build the document. The stylesheet is available at:


For example the following XUL document:
<window id="example-window" title="Example 2.5.4">
<listbox>
<listhead>
<listheader label="Name"/>
<listheader label="Occupation"/>
</listhead>
<listcols>
<listcol/>
<listcol flex="1"/>
</listcols>
<listitem>
<listcell label="George"/>
<listcell label="House Painter"/>
</listitem>
<listitem>
<listcell label="Mary Ellen"/>
<listcell label="Candle Maker"/>
</listitem>
<listitem>
<listcell label="Roger"/>
<listcell label="Swashbuckler"/>
</listitem>
</listbox>
</window>

Will be transformed ( using xsltproc xml2dom.xsl file.xul ) into the following javascript code:
var window_id2244179= document.createElementNS(XUL.NS,"window");
window_id2244179.setAttribute("id","example-window");
window_id2244179.setAttribute("title","Example 2.5.4");
var listbox_id2244186= document.createElementNS(XUL.NS,"listbox");
window_id2244179.appendChild(listbox_id2244186);
var listhead_id2244188= document.createElementNS(XUL.NS,"listhead");
listbox_id2244186.appendChild(listhead_id2244188);
var listheader_id2244190= document.createElementNS(XUL.NS,"listheader");
listhead_id2244188.appendChild(listheader_id2244190);
listheader_id2244190.setAttribute("label","Name");
var listheader_id2244194= document.createElementNS(XUL.NS,"listheader");
listhead_id2244188.appendChild(listheader_id2244194);
listheader_id2244194.setAttribute("label","Occupation");
var listcols_id2244200= document.createElementNS(XUL.NS,"listcols");
listbox_id2244186.appendChild(listcols_id2244200);
var listcol_id2244202= document.createElementNS(XUL.NS,"listcol");
listcols_id2244200.appendChild(listcol_id2244202);
var listcol_id2244204= document.createElementNS(XUL.NS,"listcol");
listcols_id2244200.appendChild(listcol_id2244204);
listcol_id2244204.setAttribute("flex","1");
var listitem_id2244209= document.createElementNS(XUL.NS,"listitem");
listbox_id2244186.appendChild(listitem_id2244209);
var listcell_id2244211= document.createElementNS(XUL.NS,"listcell");
listitem_id2244209.appendChild(listcell_id2244211);
listcell_id2244211.setAttribute("label","George");
var listcell_id2244215= document.createElementNS(XUL.NS,"listcell");
listitem_id2244209.appendChild(listcell_id2244215);
listcell_id2244215.setAttribute("label","House Painter");
var listitem_id2244221= document.createElementNS(XUL.NS,"listitem");
listbox_id2244186.appendChild(listitem_id2244221);
var listcell_id2244223= document.createElementNS(XUL.NS,"listcell");
listitem_id2244221.appendChild(listcell_id2244223);
listcell_id2244223.setAttribute("label","Mary Ellen");
var listcell_id2244227= document.createElementNS(XUL.NS,"listcell");
listitem_id2244221.appendChild(listcell_id2244227);
listcell_id2244227.setAttribute("label","Candle Maker");
var listitem_id2244232= document.createElementNS(XUL.NS,"listitem");
listbox_id2244186.appendChild(listitem_id2244232);
var listcell_id2244234= document.createElementNS(XUL.NS,"listcell");
listitem_id2244232.appendChild(listcell_id2244234);
listcell_id2244234.setAttribute("label","Roger");
var listcell_id2244238= document.createElementNS(XUL.NS,"listcell");
listitem_id2244232.appendChild(listcell_id2244238);
listcell_id2244238.setAttribute("label","Swashbuckler");


Note: I also have a XML2HTML stylesheet here.

That's it.
Pierre

06 April 2009

Go West !

After one year at the Center for the Study of Human Polymorphisms I will follow my wife in Nantes (France) on September 1st 2009. Hum... that is not the right period to find a new occupation, so I hope I'll find a new job there (related to science or to the semantic web). Wanna hire me? Here is my profile on LinkedIn.


Nantes Image via wikipedia
.

03 April 2009

Consequences : SNP, cDNA, proteins, etc....

This post is about Consequences, a tool finding the consequences of a set of mutations mapped on the human genome. It was motivated by a recent post of FriendFeed, Daniel MacArthur asked:“Given a list of human b36 coordinates for a list of genic SNPs (most not in dbSNP), what would be the quickest way to get a list of the genes they're found in and, if possible, the amino acid position they would affect?”.

About one year ago, I wrote a tool named "Consequences" answering this question but the sources are somewhere in a tar.gz , burned in an old CD, in a cardboard, in my cellar... so it was faster to re-write this simple code from scratch. The result should be fine but please, tell me if you find a bug.

This tool takes as input a tab delimited file containing the following fields:

  1. A Name for your SNP
  2. the chromosome e.g. 'chr2' (at this time only one chromosome per input is supported)
  3. the position on the chromosome. The first base is indexed at 0
  4. The base observed ON THE PLUS STRAND OF THE GENOME
. The sequence of the chromosome is then downloaded using the DAS server of the UCSC, the genes are downloaded using the mysql server of the UCSC and the 'knownGene' table. Then, for each mutation, I simply look at the consequence of each mutation. Here is a sample of the output:

<consequences chrom="chr1">
<observed-mutation position="1116" name="snp1" base="A">
<gene name="uc001aaa.2" exon-count="3" strand="+" txStart="1115" txEnd="4121" cdsStart="1115" cdsEnd="1115">
<in-utr-3/>
</gene>
<gene name="uc009vip.1" exon-count="2" strand="+" txStart="1115" txEnd="4272" cdsStart="1115" cdsEnd="1115">
<in-utr-3/>
</gene>
</observed-mutation>
(...)
</observed-mutation>
<observed-mutation position="1149167" name="snp282" base="A">
<gene name="uc009vjv.1" exon-count="6" strand="-" txStart="1142150" txEnd="1157310" cdsStart="1142754" cdsEnd="1149171">
<in-exon name="Exon 2" codon-wild="CAG" codon-mut="TAG" aa-wild="Q" aa-mut="*" base-wild="C" base-mut="T" index-cdna="3" index-protein="1">
<wild-cDNA>ATG C AGCGCTGGATCATGGAGAAGACGGCCGAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACCCTGACGGGGACGGTCACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCCGACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAACACAGGAAGTCCTGGAGAACCTGAAGGACCGCTGGTACCAGGCGGACAGCCCCCCTGCAGACCTGCTGCTGACGGAGGAGGAGTTCCTGTCGTTCCTCCACCCCGAGCACAGCCGGGGAATGCTCAGGTTCATGGTGAAGGAGATCGTCCGGGACCTGGACCAGGACGGTGACAAGCAGCTCTCTGTGCCCGAGTTCATCTCCCTGCCCGTGGGCACCGTGGAGAACCAGCAGGGCCAGGACATTGACGACAACTGGGTGAAAGACAGAAAAAAGGAGTTTGAGGAGCTCATTGACTCCAACCACGACGGCATCGTGACCGCCGAGGAGCTGGAGAGCTACATGGACCCCATGAACGAGTACAACGCGCTGAACGAGGCCAAGCAGATGATCGCCGTCGCCGACGAGAACCAGAACCACCACCTGGAGCCCGAGGAGGTGCTCAAGTACAGCGAGTTCTTCACGGGCAGCAAGCTGGTGGACTACGCGCGCAGCGTGCACGAGGAGTTTTGA</wild-cDNA>
<mut-cDNA>ATG T AGCGCTGGATCATGGAGAAGACGGCCGAGCACTTCCAGGAGGCCATGGAGGAGAGCAAGACACACTTCCGCGCCGTGGACCCTGACGGGGACGGTCACGTGTCTTGGGACGAGTATAAGGTGAAGTTTTTGGCGAGTAAAGGCCATAGCGAGAAGGAGGTTGCCGACGCCATCAGGCTCAACGAGGAACTCAAAGTGGATGAGGAAACACAGGAAGTCCTGGAGAACCTGAAGGACCGCTGGTACCAGGCGGACAGCCCCCCTGCAGACCTGCTGCTGACGGAGGAGGAGTTCCTGTCGTTCCTCCACCCCGAGCACAGCCGGGGAATGCTCAGGTTCATGGTGAAGGAGATCGTCCGGGACCTGGACCAGGACGGTGACAAGCAGCTCTCTGTGCCCGAGTTCATCTCCCTGCCCGTGGGCACCGTGGAGAACCAGCAGGGCCAGGACATTGACGACAACTGGGTGAAAGACAGAAAAAAGGAGTTTGAGGAGCTCATTGACTCCAACCACGACGGCATCGTGACCGCCGAGGAGCTGGAGAGCTACATGGACCCCATGAACGAGTACAACGCGCTGAACGAGGCCAAGCAGATGATCGCCGTCGCCGACGAGAACCAGAACCACCACCTGGAGCCCGAGGAGGTGCTCAAGTACAGCGAGTTCTTCACGGGCAGCAAGCTGGTGGACTACGCGCGCAGCGTGCACGAGGAGTTTTGA</mut-cDNA>
<wild-protein>M Q RWIMEKTAEHFQEAMEESKTHFRAVDPDGDGHVSWDEYKVKFLASKGHSEKEVADAIRLNEELKVDEETQEVLENLKDRWYQADSPPADLLLTEEEFLSFLHPEHSRGMLRFMVKEIVRDLDQDGDKQLSVPEFISLPVGTVENQQGQDIDDNWVKDRKKEFEELIDSNHDGIVTAEELESYMDPMNEYNALNEAKQMIAVADENQNHHLEPEEVLKYSEFFTGSKLVDYARSVHEEF*</wild-protein>
<mut-protein>M * RWIMEKTAEHFQEAMEESKTHFRAVDPDGDGHVSWDEYKVKFLASKGHSEKEVADAIRLNEELKVDEETQEVLENLKDRWYQADSPPADLLLTEEEFLSFLHPEHSRGMLRFMVKEIVRDLDQDGDKQLSVPEFISLPVGTVENQQGQDIDDNWVKDRKKEFEELIDSNHDGIVTAEELESYMDPMNEYNALNEAKQMIAVADENQNHHLEPEEVLKYSEFFTGSKLVDYARSVHEEF*</mut-protein>
</in-exon>
</gene>
<gene name="uc009vjw.1" exon-count="7" strand="-" txStart="1142150" txEnd="1157310" cdsStart="1142150" cdsEnd="1142150">
<in-utr-5/>
</gene>
</observed-mutation>
(...)
<observed-mutation position="1205906" name="snp195" base="A">
<gene name="uc001adt.1" exon-count="18" strand="+" txStart="1205678" txEnd="1217272" cdsStart="1205904" cdsEnd="1216853">
<in-exon name="Exon 1" codon-wild="ATG" codon-mut="ATA" aa-wild="M" aa-mut="I" base-wild="G" base-mut="A" index-cdna="2" index-protein="0">
<wild-cDNA>AT G AGGGCAGTGCTGTCACAGAAGACAACACCGCTCCCTCGTTACCTGTGGCCCGGCCACCTCAGCGGCCCAAGGAGGCTCACCTGGTCATGGTGCAGTGACCACAGGACCCCCACATGCCGGGAGCTGGGTTCGCCCCACCCCACCCCCTGCACCGGGCCAGCGAGGGGATGGCCCAGAAGAGGGGGAGGACCATGTGGATTCACCAGTGCTGGACATGTGCTCTGTGGCTACCCCCTCTGCCTACTCTCTGGCCCGATACAGGGGTGTGGGACAGGCCTGGGTGACTCCAGCATGGCTTTCCTCTCCAGGACGTCACCGGTGGCAGCTGCTTCCTTCCAGAGCCGGCAGGAGGCCAGAGGCTCCATCCTGCTTCAGAGCTGCCAGCTGCCCCCGCAATGGCTGAGCACCGAAGCATGGACGGGAGAATGGAAGCAGCCACACGGGGGGGCTCTCACCTCCAGATCGCCTGGGCCTGTGGCTCCCCAGAGGCCCTGCCACCTGAAGGGATGGCAGCACAGACCCACTCAGCACAACGCTGCCTGCAAACAGGGCCAGGCTGCAGCCCAGACGCCCCCCAGGCCGGGGCCACCATCAGCACCACCACCACCACCCAAGGAGGGGCACCAGGAGGGGCTGGTGGAGCTGCCCGCCTCGTTCCGGGAGCTGCTCACCTTCTTCTGCACCAATGCCACCATCCACGGCGCCATCCGCCTGGTCTGCTCCCGCGGGAACCGCCTCAAGACGACGTCCTGGGGGCTGCTGTCCCTGGGAGCCCTGGTCGCGCTCTGCTGGCAGCTGGGGCTCCTCTTTGAGCGTCACTGGCACCGCCCGGTCCTCATGGCCGTCTCTGTGCACTCGGAGCGCAAGCTGCTCCCGCTGGTCACCCTGTGTGACGGGAACCCACGTCGGCCGAGTCCGGTCCTCCGCCATCTGGAGCTGCTGGACGAGTTTGCCAGGGAGAACATTGACTCCCTGTACAACGTCAACCTCAGCAAAGGCAGAGCCGCCCTCTCCGCCACTGTCCCCCGCCACGAGCCCCCCTTCCACCTGGACCGGGAGATCCGTCTGCAGAGGCTGAGCCACTCGGGCAGCCGGGTCAGAGTGGGGTTCAGACTGTGCAACAGCACGGGCGGCGACTGCTTTTACCGAGGCTACACGTCAGGCGTGGCGGCTGTCCAGGACTGGTACCACTTCCACTATGTGGATATCCTGGCCCTGCTGCCCGCGGCATGGGAGGACAGCCACGGGAGCCAGGACGGCCACTTCGTCCTCTCCTGCAGTTACGATGGCCTGGACTGCCAGGCCCGACAGTTCCGGACCTTCCACCACCCCACCTACGGCAGCTGCTACACGGTCGATGGCGTCTGGACAGCTCAGCGCCCCGGCATCACCCACGGAGTCGGCCTGGTCCTCAGGGTTGAGCAGCAGCCTCACCTCCCTCTGCTGTCCACGCTGGCCGGCATCAGGGTCATGGTTCACGGCCGTAACCACACGCCCTTCCTGGGGCACCACAGCTTCAGCGTCCGGCCAGGGACGGAGGCCACCATCAGCATCCGAGAGGACGAGGTGCACCGGCTCGGGAGCCCCTACGGCCACTGCACCGCCGGCGGGGAAGGCGTGGAGGTGGAGCTGCTACACAACACCTCCTACACCAGGCAGGCCTGCCTGGTGTCCTGCTTCCAGCAGCTGATGGTGGAGACCTGCTCCTGTGGCTACTACCTCCACCCTCTGCCGGCGGGGGCTGAGTACTGCAGCTCTGCCCGGCACCCTGCCTGGGGACACTGCTTCTACCGCCTCTACCAGGACCTGGAGACCCACCGGCTCCCCTGTACCTCCCGCTGCCCCAGGCCCTGCAGGGAGTCTGCATTCAAGCTCTCCACTGGGACCTCCAGGTGGCCTTCCGCCAAGTCAGCTGGATGGACTCTGGCCACGCTAGGTGAACAGGGGCTGCCGCATCAGAGCCACAGACAGAGGAGCAGCCTGGCCAAAATCAACATCGTCTACCAGGAGCTCAACTACCGCTCAGTGGAGGAGGCGCCCGTGTACTCGGTGCCGCAGCTGCTCTCGGCCATGGGCAGCCTCTGCAGCCTGTGGTTTGGGGCCTCCGTCCTCTCCCTCCTGGAGCTCCTGGAGCTGCTGCTCGATGCTTCTGCCCTCACCCTGGTGCTAGGCGGCCGCCGGCTCCGCAGGGCGTGGTTCTCCTGGCCCAGAGCCAGCCCTGCCTCAGGGGCGTCCAGCATCAAGCCAGAGGCCAGTCAGATGCCCCCGCCTGCAGGCGGCACGTCAGATGACCCGGAGCCCAGCGGGCCTCATCTCCCACGGGTGATGCTTCCAGGGGTTCTGGCGGGAGTCTCAGCCGAAGAGAGCTGGGCTGGGCCCCAGCCCCTTGAGACTCTGGACACCTGA</wild-cDNA>
<mut-cDNA>AT A AGGGCAGTGCTGTCACAGAAGACAACACCGCTCCCTCGTTACCTGTGGCCCGGCCACCTCAGCGGCCCAAGGAGGCTCACCTGGTCATGGTGCAGTGACCACAGGACCCCCACATGCCGGGAGCTGGGTTCGCCCCACCCCACCCCCTGCACCGGGCCAGCGAGGGGATGGCCCAGAAGAGGGGGAGGACCATGTGGATTCACCAGTGCTGGACATGTGCTCTGTGGCTACCCCCTCTGCCTACTCTCTGGCCCGATACAGGGGTGTGGGACAGGCCTGGGTGACTCCAGCATGGCTTTCCTCTCCAGGACGTCACCGGTGGCAGCTGCTTCCTTCCAGAGCCGGCAGGAGGCCAGAGGCTCCATCCTGCTTCAGAGCTGCCAGCTGCCCCCGCAATGGCTGAGCACCGAAGCATGGACGGGAGAATGGAAGCAGCCACACGGGGGGGCTCTCACCTCCAGATCGCCTGGGCCTGTGGCTCCCCAGAGGCCCTGCCACCTGAAGGGATGGCAGCACAGACCCACTCAGCACAACGCTGCCTGCAAACAGGGCCAGGCTGCAGCCCAGACGCCCCCCAGGCCGGGGCCACCATCAGCACCACCACCACCACCCAAGGAGGGGCACCAGGAGGGGCTGGTGGAGCTGCCCGCCTCGTTCCGGGAGCTGCTCACCTTCTTCTGCACCAATGCCACCATCCACGGCGCCATCCGCCTGGTCTGCTCCCGCGGGAACCGCCTCAAGACGACGTCCTGGGGGCTGCTGTCCCTGGGAGCCCTGGTCGCGCTCTGCTGGCAGCTGGGGCTCCTCTTTGAGCGTCACTGGCACCGCCCGGTCCTCATGGCCGTCTCTGTGCACTCGGAGCGCAAGCTGCTCCCGCTGGTCACCCTGTGTGACGGGAACCCACGTCGGCCGAGTCCGGTCCTCCGCCATCTGGAGCTGCTGGACGAGTTTGCCAGGGAGAACATTGACTCCCTGTACAACGTCAACCTCAGCAAAGGCAGAGCCGCCCTCTCCGCCACTGTCCCCCGCCACGAGCCCCCCTTCCACCTGGACCGGGAGATCCGTCTGCAGAGGCTGAGCCACTCGGGCAGCCGGGTCAGAGTGGGGTTCAGACTGTGCAACAGCACGGGCGGCGACTGCTTTTACCGAGGCTACACGTCAGGCGTGGCGGCTGTCCAGGACTGGTACCACTTCCACTATGTGGATATCCTGGCCCTGCTGCCCGCGGCATGGGAGGACAGCCACGGGAGCCAGGACGGCCACTTCGTCCTCTCCTGCAGTTACGATGGCCTGGACTGCCAGGCCCGACAGTTCCGGACCTTCCACCACCCCACCTACGGCAGCTGCTACACGGTCGATGGCGTCTGGACAGCTCAGCGCCCCGGCATCACCCACGGAGTCGGCCTGGTCCTCAGGGTTGAGCAGCAGCCTCACCTCCCTCTGCTGTCCACGCTGGCCGGCATCAGGGTCATGGTTCACGGCCGTAACCACACGCCCTTCCTGGGGCACCACAGCTTCAGCGTCCGGCCAGGGACGGAGGCCACCATCAGCATCCGAGAGGACGAGGTGCACCGGCTCGGGAGCCCCTACGGCCACTGCACCGCCGGCGGGGAAGGCGTGGAGGTGGAGCTGCTACACAACACCTCCTACACCAGGCAGGCCTGCCTGGTGTCCTGCTTCCAGCAGCTGATGGTGGAGACCTGCTCCTGTGGCTACTACCTCCACCCTCTGCCGGCGGGGGCTGAGTACTGCAGCTCTGCCCGGCACCCTGCCTGGGGACACTGCTTCTACCGCCTCTACCAGGACCTGGAGACCCACCGGCTCCCCTGTACCTCCCGCTGCCCCAGGCCCTGCAGGGAGTCTGCATTCAAGCTCTCCACTGGGACCTCCAGGTGGCCTTCCGCCAAGTCAGCTGGATGGACTCTGGCCACGCTAGGTGAACAGGGGCTGCCGCATCAGAGCCACAGACAGAGGAGCAGCCTGGCCAAAATCAACATCGTCTACCAGGAGCTCAACTACCGCTCAGTGGAGGAGGCGCCCGTGTACTCGGTGCCGCAGCTGCTCTCGGCCATGGGCAGCCTCTGCAGCCTGTGGTTTGGGGCCTCCGTCCTCTCCCTCCTGGAGCTCCTGGAGCTGCTGCTCGATGCTTCTGCCCTCACCCTGGTGCTAGGCGGCCGCCGGCTCCGCAGGGCGTGGTTCTCCTGGCCCAGAGCCAGCCCTGCCTCAGGGGCGTCCAGCATCAAGCCAGAGGCCAGTCAGATGCCCCCGCCTGCAGGCGGCACGTCAGATGACCCGGAGCCCAGCGGGCCTCATCTCCCACGGGTGATGCTTCCAGGGGTTCTGGCGGGAGTCTCAGCCGAAGAGAGCTGGGCTGGGCCCCAGCCCCTTGAGACTCTGGACACCTGA</mut-cDNA>
<wild-protein> M RAVLSQKTTPLPRYLWPGHLSGPRRLTWSWCSDHRTPTCRELGSPHPTPCTGPARGWPRRGGGPCGFTSAGHVLCGYPLCLLSGPIQGCGTGLGDSSMAFLSRTSPVAAASFQSRQEARGSILLQSCQLPPQWLSTEAWTGEWKQPHGGALTSRSPGPVAPQRPCHLKGWQHRPTQHNAACKQGQAAAQTPPRPGPPSAPPPPPKEGHQEGLVELPASFRELLTFFCTNATIHGAIRLVCSRGNRLKTTSWGLLSLGALVALCWQLGLLFERHWHRPVLMAVSVHSERKLLPLVTLCDGNPRRPSPVLRHLELLDEFARENIDSLYNVNLSKGRAALSATVPRHEPPFHLDREIRLQRLSHSGSRVRVGFRLCNSTGGDCFYRGYTSGVAAVQDWYHFHYVDILALLPAAWEDSHGSQDGHFVLSCSYDGLDCQARQFRTFHHPTYGSCYTVDGVWTAQRPGITHGVGLVLRVEQQPHLPLLSTLAGIRVMVHGRNHTPFLGHHSFSVRPGTEATISIREDEVHRLGSPYGHCTAGGEGVEVELLHNTSYTRQACLVSCFQQLMVETCSCGYYLHPLPAGAEYCSSARHPAWGHCFYRLYQDLETHRLPCTSRCPRPCRESAFKLSTGTSRWPSAKSAGWTLATLGEQGLPHQSHRQRSSLAKINIVYQELNYRSVEEAPVYSVPQLLSAMGSLCSLWFGASVLSLLELLELLLDASALTLVLGGRRLRRAWFSWPRASPASGASSIKPEASQMPPPAGGTSDDPEPSGPHLPRVMLPGVLAGVSAEESWAGPQPLETLDT*</wild-protein>
<mut-protein> I RAVLSQKTTPLPRYLWPGHLSGPRRLTWSWCSDHRTPTCRELGSPHPTPCTGPARGWPRRGGGPCGFTSAGHVLCGYPLCLLSGPIQGCGTGLGDSSMAFLSRTSPVAAASFQSRQEARGSILLQSCQLPPQWLSTEAWTGEWKQPHGGALTSRSPGPVAPQRPCHLKGWQHRPTQHNAACKQGQAAAQTPPRPGPPSAPPPPPKEGHQEGLVELPASFRELLTFFCTNATIHGAIRLVCSRGNRLKTTSWGLLSLGALVALCWQLGLLFERHWHRPVLMAVSVHSERKLLPLVTLCDGNPRRPSPVLRHLELLDEFARENIDSLYNVNLSKGRAALSATVPRHEPPFHLDREIRLQRLSHSGSRVRVGFRLCNSTGGDCFYRGYTSGVAAVQDWYHFHYVDILALLPAAWEDSHGSQDGHFVLSCSYDGLDCQARQFRTFHHPTYGSCYTVDGVWTAQRPGITHGVGLVLRVEQQPHLPLLSTLAGIRVMVHGRNHTPFLGHHSFSVRPGTEATISIREDEVHRLGSPYGHCTAGGEGVEVELLHNTSYTRQACLVSCFQQLMVETCSCGYYLHPLPAGAEYCSSARHPAWGHCFYRLYQDLETHRLPCTSRCPRPCRESAFKLSTGTSRWPSAKSAGWTLATLGEQGLPHQSHRQRSSLAKINIVYQELNYRSVEEAPVYSVPQLLSAMGSLCSLWFGASVLSLLELLELLLDASALTLVLGGRRLRRAWFSWPRASPASGASSIKPEASQMPPPAGGTSDDPEPSGPHLPRVMLPGVLAGVSAEESWAGPQPLETLDT*</mut-protein>
</in-exon>
</gene>
<gene name="uc001adu.1" exon-count="17" strand="+" txStart="1205678" txEnd="1217272" cdsStart="1209267" cdsEnd="1216853">
<in-utr-5/>
</gene>
</observed-mutation>
(...)
</consequences>


The source code is available here:

A 'jar' is available for download at http://lindenb.googlecode.com/files/consequences.jar.
Running the tool:
java -cp {path}/mysql-connector-java-xxxx-bin.jar:consequences.jar org.lindenb.tinytools.Consequences your-list-of-snp.txt


Well, that is not big science but it might be helpful.
That's it.

Pierre

12 March 2009

A few nightmares before biohackathon 2009.

OK, after Scifoo 2007 (http://plindenbaum.blogspot.com/2007/07/scifoo-07-anxiety-from-homebody.html). Here are my apprehensions for BioHackathon 2009, you know, I'm lost and anxious when I cannot see the Peripherique ;-)
















My notebook for a Stupid RDF Server

In this post, I'm writing my notes about the java package com.sun.net.httpserver.*. This package contains a lightweight HTTP server and I've tested it to create a simple-and-stupid RDF server.
The source code is available at:



OK. First we need a 'Statement' class holding a RDF triple:
private static class Statement
{
/** subject of this statement */
private URI subject;
/** predicate of this statement */
private URI predicate;
/** value of this statement a String or a URI*/
private Object value;


boolean isLiteral()
{
return value.getClass()==String.class;
}

@Override
public int hashCode() {
final int prime = 31;
int result = 1;
result = prime * result + predicate.hashCode();
result = prime * result + subject.hashCode();
result = prime * result + value.hashCode();
return result;
}

@Override
public boolean equals(Object obj) {
if (this == obj) return true;
if (obj == null) return false;
Statement other = (Statement) obj;
return subject.equals(other.subject) &&
predicate.equals(other.predicate) &&
value.equals(other.value)
;
}


@Override
public String toString() {
return "<"+subject+"> <"+predicate+"> "+(isLiteral()
?"\""+C.escape(String.class.cast(value))+"\""
:"<"+value+"> ."
);
}
}

The statements are just stored in a synchronized set. That is to say, we the server stops, all the statements are lost :-).
private Set<Statement> statements= Collections.synchronizedSet(new HashSet<Statement>());


The 'servlet' StupidRDFServer is a class that implementing com.sun.net.httpserver.HttpHandler, that is to say it must implement the method public void handle(HttpExchange http) throws IOException

The following code starts the server.
public static void main(String[] args)
{
try
{
HttpServer server = HttpServer.create(new InetSocketAddress(PORT), 0);
server.createContext(CONTEXT, new StupidRDFServer());
server.start();
}
catch(IOException err)
{
err.printStackTrace();
}
}
If the query is empty, the following form is returned to the client.
Add Statement 
 
 
  


The 'Add Statements' button adds a statement in the list:
respHeader.add("Content-Type", "text/html");
http.sendResponseHeaders(200, 0);
boolean added=this.statements.add(stmt);
printForm(out,(added?"Statement Added":"Statement already in model"));

The 'Query N3' button prints the triples, using the parameters of the form as a filter:
respHeader.add("Content-Type", "text/plain");
http.sendResponseHeaders(200, 0);
synchronized(statements)
{
Iterator<Statement> iter= statements.iterator();
while(iter.hasNext())
{
Statement triple=iter.next();
if(
(stmt.subject==null || stmt.subject.equals(triple.subject)) &&
(stmt.predicate==null || stmt.predicate.equals(triple.predicate)) &&
(stmt.value==null || stmt.value.equals(triple.value))
)
{
out.println(triple);
}
}
}


The 'Query RDF' button prints the triples as RDF, using the parameters of the form as a filter. Example of output:
<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Statement>
<rdf:subject rdf:resource="foaf%3Ame"/>
<rdf:predicate rdf:resource="foaf%3Aname"/>
<rdf:object>Pierre</rdf:object>
</rdf:Statement>
<rdf:Statement>
<rdf:subject rdf:resource="urn%3Aother"/>
<rdf:predicate rdf:resource="foaf%3Aknows"/>
<rdf:object rdf:resource="foaf%3Aknows"/>
</rdf:Statement>
</rdf:RDF>


That's it

07 March 2009

A lightweight java parser for RDF

About one year ago, I wrote a lightweight java parser for RDF based on the Stream API for XML (Stax). It is far from being perfect as , for example, it does not handle the reified statements, xml:base, ... but it is small (24K) and works fine with most RDF files. Inspired by the XML SAX parsers, this RDF parser doesn't keep the statements in memory but calls a method "found" each time a triple is found. This method can be overridden to implement your own code.

Source code


The code is available at

RDFEvent


First we need a small internal class to record the content of each triple
private static class RDFEvent
{
URI subject=null;
URI predicate=null;
Object value=null;
URI valueType=null;
String lang=null;
int listIndex=-1;
(...)
}

Searching for rdf:RDF


First we scan the elements of the document until the <rdf:RDF> element is found. Then, the method parseRDF is called.
this.parser = this.factory.createXMLEventReader(in);

while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
StartElement start=(StartElement)event;
if(name2string(start).equals(RDF.NS+"RDF"))
{
parseRDF();
}
}
}

parseRDF: Searching the statements


All the nodes are then scanned .The method parseDescription is called for each element.
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isEndElement())
{
return;
}
else if(event.isStartElement())
{
parseDescription(event.asStartElement());
}
else if(event.isProcessingInstruction())
{
throw new XMLStreamException("Found Processing Instruction in RDF ???");
}
else if(event.isCharacters() &&
event.asCharacters().getData().trim().length()>0)
{
throw new XMLStreamException("Found text in RDF ???");
}
}

parseDescription: Parsing the subject of a triple


The current element will be the subject of the triple.
The URI of this subject need to extracted.
First we check if this URI can be extracted from an attribute rdf:about
Attribute att= description.getAttributeByName(new QName(RDF.NS,"about"));
if(att!=null) descriptionURI= createURI( att.getValue());

If it was not found, the attribute rdf:nodeID is searched:
att= description.getAttributeByName(new QName(RDF.NS,"nodeID"));
if(att!=null) descriptionURI= createURI( att.getValue());

If it was not found, the attribute rdf:ID is searched.
att= description.getAttributeByName(new QName(RDF.NS,"ID"));
if(att!=null) descriptionURI= resolveBase(att.getValue());

If it was not found, this is an anonymous node. We create a random URI.
descriptionURI= createAnonymousURI();


rdf:type


The qualified name of the element contains the rdf:type of this statement. We can emit a new triple about this type:
QName qn=description.getName();
if(!(qn.getNamespaceURI().equals(RDF.NS) &&
qn.getLocalPart().equals("Description")))
{
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= createURI(RDF.NS+"type");
evt.value=name2uri(qn);
found(evt);
}


Other attributes


The other attributes of the current element may contains some new triples.
for(Iterator<?> i=description.getAttributes();
i.hasNext();)
{
att=(Attribute)i.next();
qn= att.getName();
String local= qn.getLocalPart();
if(qn.getNamespaceURI().equals(RDF.NS) &&
( local.equals("about") ||
local.equals("ID") ||
local.equals("nodeID")))
{
continue;
}
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= name2uri(qn);
evt.value= att.getValue();
found(evt);
}

Searching the predicates


We then loop over the children of the current element. Those nodes are the predicates of the current subject. The method parsePredicate is called, each time a new element is found.
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isEndElement())
{
return descriptionURI;
}
else if(event.isStartElement())
{
parsePredicate(descriptionURI,event.asStartElement());
}
else if(event.isProcessingInstruction())
{
throw new XMLStreamException("Found Processing Instruction in RDF ???");
}
else if(event.isCharacters() &&
event.asCharacters().getData().trim().length()>0)
{
throw new XMLStreamException("Found text in RDF ??? \""+
event.asCharacters().getData()+"\""
);
}
}

parsePredicate: Parsing the predicate of the current triple


First the property attributes of the current element are scanned, and some new triples may be created. e.g:
<rdf:Description ex:fullName="Dave Beckett">
<ex:homePage rdf:resource="http://purl.org/net/dajobe/"/>
</rdf:Description>

During this process, the value of the attribute rdf:parseType is noted if it was found.
Furthermore, if there was an attribute rdf:resource, then this element is a new triple linking another resource.
<ex:homePage rdf:resource="http://purl.org/net/dajobe/"/>

If rdf:parseType="Literal" then we transform the children of the current node into a string, and a new triple is created.
if(parseType.equals("Literal"))
{
StringBuilder b= parseLiteral();
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= b.toString();
evt.lang=lang;
evt.valueType=datatype;
found(evt);
}

If rdf:parseType="Resource", then the current node is a blank node: The rdf:Description will be omitted. A blank node is created and we call recursively parsePredicate using this blank node has the new subject.
URI blanck = createAnonymousURI();
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value=blanck;
evt.lang=lang;
evt.valueType=datatype;
found(evt);

while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
parsePredicate(blanck, event.asStartElement());
}
else if(event.isEndElement())
{
return;
}
}

If rdf:parseType="Collection", The children elements give the set of subject nodes of the collection. We call recursively parseDescription for each of these nodes.
int index=0;
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
URI value= parseDescription(event.asStartElement());
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value=value;
evt.lang=lang;
evt.valueType=datatype;
evt.listIndex=(++index);

found(evt);
}
else if(event.isEndElement())
{
return;
}
}

Else this is the default rdf:parseType.
If a new element is found, then , this is the subject of a new resource (We call recursively parseDescription), else the current statement has a Literal as the object of this statement and we concatenate all the text.
StringBuilder b= new StringBuilder();
while(getReader().hasNext())
{
XMLEvent event = getReader().nextEvent();
if(event.isStartElement())
{
URI childURI=parseDescription(event.asStartElement());
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= childURI;
found(evt);
b.setLength(0);
foundResourceAsChild=true;
}
else if(event.isCharacters())
{
b.append(event.asCharacters().getData());
}
else if(event.isEndElement())
{
if(!foundResourceAsChild)
{
RDFEvent evt= new RDFEvent();
evt.subject=descriptionURI;
evt.predicate= predicateURI;
evt.value= b.toString();
evt.lang=lang;
evt.valueType=datatype;
found(evt);
}
else
{
if(b.toString().trim().length()!=0) throw new XMLStreamException("Found bad text "+b);
}
return;
}
}

Testing


The following code parses go.rdf.gz (1744 Ko) and returns the number of statements.
long now= System.currentTimeMillis();
URL url= new URL("ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/rdf/go.rdf.gz");
InputStream r= new GZIPInputStream(url.openStream());
RDFHandler h= new RDFHandler()
{
@Override
public void found(URI subject, URI predicate, Object value,
URI dataType, String lang, int index)
throws IOException {
++count;
}
};
h.parse(r);
r.close();
System.out.println("time:"+((System.currentTimeMillis()-now)/1000)+" secs count:"+count+" triples");

Result:
time:17 secs count:188391 triples



That's it.

Pierre

03 March 2009

String Challenge: My (brute-force) solution.

In this post, I present my (brute-force/quick n'dirty) solution to the recent 'String Challenge' submited by Thomas Mailund on his blog: http://www.mailund.dk/index.php/2009/03/02/a-string-algorithms-challenge/. Briefly, here is the problem:

Given an input string X, an integer k and a frequency f, report all k-mers that occur with frequency higher than f. Expect the length of X to be from a few hundred thousands to a few millions and k to be between 5 and 20.



I wrote a simple java code to solve this problem. This is not big science as I used a brute-force algorithm, but you might be interested about how the k-mers were mapped to their occurrences. Here, my code uses the java implementation of the BerkeleyDB API (http://www.oracle.com/technology/documentation/berkeley-db/je/index.html) to map the k-mers to their occurrences.

The source code is available here:http://anybody.cephb.fr/perso/lindenb/tmp/StringChallenge.java (Opps, please remove this extra (c=fin.read();), I cannot change this at this time)


The code



First the BerkeleyDB environment is initialized and we create a temporary database that will map the k-mers to the counts.

File envHome=new File(System.getProperty("java.io.tmpdir"));
EnvironmentConfig envCfg= new EnvironmentConfig();
envCfg.setAllowCreate(true);
envCfg.setReadOnly(false);
Environment env= new Environment(envHome,envCfg);

//create a first database mapping k-mers to count
DatabaseConfig cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setReadOnly(false);
cfg.setTemporary(true);
Database db= env.openDatabase(null, "kmers", cfg);
DatabaseEntry key = new DatabaseEntry();
DatabaseEntry value = new DatabaseEntry();


The sequence is then scanned and an array of bytes of length k is filled with the characters.

FileReader fin= new FileReader(file);
byte array[]=new byte[kmer];
int c=-1;
int array_size=0;
//c=fin.read(); oopps, I should have removed this....


while((c=fin.read())!=-1)
{
if(Character.isWhitespace(c)) continue;
c=Character.toUpperCase(c);
array[array_size++]=(byte)c;

}



This array is filled, searched in the database and the count is incremented back in the BerkeleyDB. The the content of the array is shifted to the left.

key.setData(array);

int count=0;
//is this data already exists ?
if(db.get(null,key, value, null)==OperationStatus.SUCCESS)
{
count =IntegerBinding.entryToInt(value);
}

IntegerBinding.intToEntry(count+1, value);
db.put(null,key, value);

//switch to left
for(int i=1;i< kmer;++i)
{
array[i-1]=array[i];
}
array_size--;


At the end, in order to have a set of ordered results, a reverse database is created . It maps the counts to the k-mers.



//create a second database mapping count to k-mers
cfg= new DatabaseConfig();
cfg.setAllowCreate(true);
cfg.setReadOnly(false);
cfg.setTemporary(true);
cfg.setSortedDuplicates(true);
Database db2= env.openDatabase(null, "occ",cfg);
key = new DatabaseEntry();
value = new DatabaseEntry();

Cursor cursor= db.openCursor(null, null);
while(cursor.getNext(key, value,null)==OperationStatus.SUCCESS)
{
int count=IntegerBinding.entryToInt(value);
if((count/(float)total)< freq) continue;
db2.put(null, value, key);
}
cursor.close();


This second database is then scanned to print the ordered results.

while(cursor.getNext(key, value,null)==OperationStatus.SUCCESS)
{
int count=IntegerBinding.entryToInt(key);
String seq= new String(value.getData(),value.getOffset(),value.getSize());
System.out.println(seq+"\t"+count+"/"+total);
}


Compilation



javac -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge.java

Excecution


java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge

Test with the Human chr22 (~49E6 bp)


time java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 8 -f 0.001 chr22.fa
AAAAAAAA 71811/49691430
TTTTTTTT 72474/49691430
NNNNNNNN 14840030/49691430

real 5m44.240s
user 5m56.186s
sys 0m3.178s

time java -cp ${BERKELEY-JE-PATH}/lib/je-3.3.75.jar:. StringChallenge -k 10 -f 0.001 chr22.fa
AAAAAAAAAA 50128/49691428
TTTTTTTTTT 50841/49691428
NNNNNNNNNN 14840010/49691428

real 8m15.152s
user 8m41.429s
sys 0m3.748s



That's it. Now I'd like to know how the 'elegant' solutions will be implemented :-)