18 February 2010

eXist: The Open Source Native XML Database : My notebook

In a previous post, I've played with Oracle's BerkeleyDB-XML. Here, I used with eXist-db, an open source database management system built using XML technology. It stores XML data according to the XML data model and features efficient, index-based XQuery processing.

Download & Install

wget http://downloads.sourceforge.net/project/exist/Stable/1.4/eXist-setup-1.4.0-rev10440.jar
java -jar eXist-setup-1.4.0-rev10440.jar
export EXIST_HOME=PATH_TO_EXIST/eXist
And tha'ts it: it was far more easy than installing (compiling...) BerkeleyDB-XML.

Starting the Server

eXist/bin/startup.sh

Using locale: en_US.UTF-8
18 Feb 2010 15:31:32,579 [main] INFO (JettyStart.java [run]:90) - Configuring eXist from EXIST/eXist/conf.xml
18 Feb 2010 15:31:32,580 [main] INFO (JettyStart.java [run]:91) -
18 Feb 2010 15:31:32,580 [main] INFO (JettyStart.java [run]:92) - Running with Java 1.6.0_07 [Sun Microsystems Inc. (Java HotSpot(TM) Server VM) in /usr/local/package/jdk1.6.0_07/jre]
(...)

Inserting the data


First using the web console, I've created a 'collection' named '/db/dbsnp'.
I've then downloaded about 1000 XML documents from dbsnp:
for S in `mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select right(name,length(name)-2) from snp130 limit 1000' -N`
do
wget -O rs${S}.xml "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=${S}&retmode=xml"
done
Those document have been inserted in the XML database:
EXIST/eXist/bin/client.sh -u admin --password=mypassword -s -m 'db/dbsnp' -p rs*.xml
(...)
parsed 7704 bytes in 33ms.
storing document rs10.xml (1 of 1) ...done.
parsing 16306 bytes took 36ms.

parsed 16306 bytes in 36ms.

Using XQUERY

The following XQuery search for all the SNPs in the database '/db/dbsnp' having a heterozygosity greater than 0.49. For each SNP it prints its name, its sequence and the position on the reference genome.
xquery version "1.0";

declare namespace s="http://www.ncbi.nlm.nih.gov/SNP/docsum";

<MyListOfSnp>
{
for $x in collection("/db/dbsnp")/s:ExchangeSet/s:Rs
where data($x/s:Het/@value)>0.49
return <SNP>
<name>rs{data($x/@rsId)}</name>
<sequence>{data($x/s:Sequence/s:Seq5)}[{data($x/s:Sequence/s:Observed)}]{data($x/s:Sequence/s:Seq3)}</sequence>

{
for $as in $x/s:Assembly
where $as/@groupLabel="reference" return

for $comp in $as/s:Component return

for $maploc in $comp/s:MapLoc
return
<map>
<chromosome>chr{data($comp/@chromosome)}</chromosome>
<position>{data($maploc/@physMapInt)}</position>
</map>
}
</SNP>
}
</MyListOfSnp>
Executing the query:
EXIST/eXist/bin/client.sh -u admin --password=mypassword -F input.xquery
Result:
<MyListOfSnp>
<SNP>
<name>rs10000300</name>
<sequence>ATCAAATACCCAAGCAAAGATTTACATTCAAATCTGTTTACTGAAGTTCTATTTATAATACAATGCAATGAACATAATAGTATATATTTACACGTAATGTAATAAACACAAATATTCAATGGTATAAAAATGGTCAATAAATCGTGGCATAGCCACAGCTTAGAGTACCTGTTTAATGTTCTCAGCTATTTTAACTTTGCTAAATAATATTTAAAGATATGcggtagtcccccttcatctgaggaggacctgttccaagacccccagtggatgcctgaaacctctgatagtaatgaaccctatatatactgttttttcctatacatatttacatatataatacatacctatgattaagtttaatttataaattaggcacagtaagagattaacaacaacaataataaaatgtaacaattatagcaatactctaataataaagttatgtgagtgtggtctctctctctgtctcaaaatatcatactgtatgcctctatttt[G/T]ggaatacagttgacaacgggtaactgaaaccgagaaaagtgaaactgcagatgggggctgactactgTATATGAAAATTAAACAATCagccaggcatggtggctcacgcctgtaataccagcactttgggaggccgaggcgggaggatcacgaggtcaggagatcgagaccacggtgaaaccccgtctctattaaaaatacaaaaaaaaaattagccgggtacagtggcaggcacctgtagtcccagctactcgggaggctgaggcaggagaatggcgtgaacccgggaggcagagcttgcagtgagccgagatcgcgccactgcactccagcctgggcgacagagcaagactctgtctcaaaaaaaaaaaaaaaaaaaaaaaaaaGGAAAAGAAAATTAAACAACCAAACAAAATCAGAGTAAATACACCATGTTAATTCTGGTTATATTTGGATTGTGGGCTTATGGGTAGATTTTGTTACATTTTTCTATAATTTCC</sequence>
<map>
<chromosome>chr4</chromosome>
<position>40161303</position>
</map>
</SNP>
<SNP>
<name>rs10000307</name>
<sequence>GTTCAAAGACTCCTGATTAGAGTGTCCTTTCTATAACCAATCTTGTTCCTTAAAACATCTTGAATGATTTGATCTCAGATCCCCTGAAGGGACTGCTGAGATCATCTGCACCAATCCTAAAAAAAAAAATCTTTCATGCCCAAACCCTTAGCAAAGCTAGTTTCTTGTGGGACTCTTAATCCCTCTTATCCTGCTTGACACAGAGGTGCTCACCTGCTGTGCATCAGAAACACTATGGATACTTCTTGAAAGTGCCTGCAACAGAGATACTGATGCATCTGTTGTGGGGTGGGCCCTAGGTATCAGTAATTTAAAAAGTTTTTTAAAATACG[C/T]CCCAGATAATTCTGATTGCTTGTAAATGGCAAAGGTTGAGAAGCACTGCTGGAAGCTTTTGAGCTCCTGTTGGGTAAGTTCAAGCGACAGGAGAATCTCATAGTGATCATAAAACAGCACTCTGAATTCTTGGAGAAACCCAGACTCATCTTATGTGACTAATTTCCTTAATGTGTACCCCAAAACTATCCTAGCGCGTTCACAGGTACACCAGGTAATGCTATTCTGATTGAGCACCCAAGAGTCTC</sequence>
<map>
<chromosome>chr4</chromosome>
<position>188340930</position>
</map>
</SNP>
<SNP>
<name>rs1000031</name>
<sequence>CTTTGAGGATCTCGATGAAAAATCTGCACCTCTCCCAGAAAAATGCACCTCTGCACAGGTTCACAGATGTCTGCATACAATTTCAGGGTTCTCAGACCCTGAAGGCCACCAAGGGACCCAAGTACATGAGCCTTACACAGCACAACCTAAATCGTCAATGGCAATGTCTCAGGAGTGTAGGACAGTGACTGCCTCTGTAAGACCATCAGCACAGCCATGGCCACACATGTTGTCTGGAGGATCAGGTGGCCTTTTTCTGTGGCTTTTGAGGTTGAGGCTGGGTACCCTTGTGGCTAATGCATAATGCCAGGATGGCCAATAAAGACACCATAAAAATTCCCTGCCGTGTGCCTGACACTGGACAGATTTAATCTCCAGGTCTTCTGGGAACCCCGCAgaggcaggggctgttttctcattttactgatggaaactgaggctcaaggaagtgaaggaatttgtttcaagtcccaggcagtacca[C/T]gaacatgggatttgaaatcacgcaagtctgacACGCAAACCTTGGTTCTTTCCTTTTTCCCTTCTCACAGAGGGTGCTTTTCGCTTCCCGGAAGCTGGCAGGGAGTTCCTCTAAAGCGCAGGTTGGAGTGGTCAGAAGGGAGCGAACTGACAGCACGAGGAAGGCTCAGCGCATGCCAGCTCCACTCACGGGAAATGACTCACTGCAGCCCTGCTGCTCTCGGGCTCCGGGGGACACATCCACATTTCCTGTATCTCGGCTAGAGCCTTGGGCAGTGTGAGCTGGCAGGGCAGATCGCTGAAGGCGGCTAGAGATAGAAAACCACCCAGCTCTGCATCCTGAGACAAAGAAGCCTTTCCCTGGGCTCATATGATAGAGGTACGTTGCctctgggcctcagttttgccatctgtaaaatagggTGAAGGTCAGATTAGATTGGGCATATTCAGTGTG</sequence>
<map>
<chromosome>chr18</chromosome>
<position>44615438</position>
</map>
</SNP>
(...)
</MyListOfSnp>



That's it
Pierre

1 comment:

nesya said...

hey, what if i'm gonna build a web application using a native-XML database (not relasional database such as oracle, mySQL, etc), and querrying the database using the XPath query. And it'll be integrated to a web service which using a SOAP as its protocol for exchanging message.

Can I still use exist db to build my web application according to the requirements I told you above?

Thx for sharing :)