19 December 2006

CiteXplore – integrating biomedical literature and data

Via Prosper:


CiteXplore combines literature search with text mining tools for biology. Search results are cross referenced to EBI applications based on publication identifiers. Links to full text versions are provided where available.

CiteXplore uses powerful text-mining
tools developed by EMBL-EBI researchers to link literature
and databases automatically, so that at the touch of
a button the biological terms are identified in the text
and you can call up the record of the molecule that you
are looking for.


http://www.ebi.ac.uk/citexplore/


15 December 2006

JAVA 1.6 Mustang, Derby/JavaDB and Bioinformatics.

Java1.6 nows contains an embedded SQL database engine called derby. In this post I will show how I have tested derby to store some fasta sequences. What is cool is that you can call any public java static methods directly from the SQL queries.

The source code is available here

When the SQL driver is called, it creates a new directory (here derby4fasta) where derby will store its data


File database=new File("derby4fasta");
String DRIVER="org.apache.derby.jdbc.EmbeddedDriver";
Class<?> driver=Class.forName(DRIVER);
boolean create=!database.exists();
driver.newInstance();
String url="jdbc:derby:";
Properties props= new Properties();
if(create)
{
props.setProperty("create", "true");
}
props.setProperty("user", "dba");
props.setProperty("password", "");
props.setProperty("databaseName","derby4fasta");
this.connection = DriverManager.getConnection(url,props);


I declare a static method returning the GC% of a sequence.

public static double gcPercent(String sequence)
{
int n=0;
for(int i=0;i< sequence.length();++i)
{
char base= Character.toUpperCase(sequence.charAt(i));
n+=(base=='G' || base=='C'?1:0);
}
return (double)(n/(double)sequence.length());
}


I declare a new function that will call this method.

/* the function FASTA.GC call org.lindenb.sandbox.Derby4Fasta.gcPercent() */
statement.executeUpdate(
"create function FASTA.GC( seq VARCHAR(2000) ) returns DOUBLE "+
" LANGUAGE JAVA "+
" NO SQL "+
" PARAMETER STYLE JAVA "+
" EXTERNAL NAME \'org.lindenb.sandbox.Derby4Fasta.gcPercent\'"
);




The fasta sequences are read and inserted in the database. We can now select the sequences having a GC% greater than 55%.

//find sequences having a GC% > 55%
Statement stmt= app.connection.createStatement();
ResultSet row=stmt.executeQuery("select FASTA.GC(seq)*100.0,name,length(seq) from FASTA.SEQUENCE " +
"where FASTA.GC(seq)>0.55");
while(row.next())
{
System.out.println(row.getInt(1)+"%\t"+row.getString(2)+"("+row.getInt(3)+")");
}


compiling and executing....

55% >gi|11448650|gb|BF436335.1|BF436335 7p06d03.x1 NCI_CGA
55% >gi|23257376|gb|BU583411.1|BU583411 mai04h08.y1 McCarr
57% >gi|10811235|gb|BF057339.1|BF057339 7k19e02.x1 NCI_CGA
57% >gi|11271233|gb|BF321955.1|BF321955 uz66f08.y1 NCI_CGA
57% >gi|11271233|gb|BF321955.1|BF321955 uz66f08.y1 NCI_CGA
58% >gi|13675777|gb|BG625264.1|BG625264 pgn1c.pk002.c2 Nor
58% >gi|13675786|gb|BG625273.1|BG625273 pgn1c.pk002.d1 Nor
58% >gi|13675786|gb|BG625273.1|BG625273 pgn1c.pk002.d1 Nor
63% >gi|84131965|gb|CV878005.1|CV878005 PDUts1172A01 Porci


That's it.

Pierre

JAVA 1.6 Mustang, StAXand Bioinformatics

about StAX via xml.com ; Most current XML APIs fall into one of two broad classes: event-based APIs like SAX and XNI or tree-based APIs like DOM and JDOM. Most programmers find the tree-based APIs to be easier to use; but such APIs are less efficient, especially with respect to memory usage. (...) However, the common streaming APIs like SAX are all push APIs. They feed the content of the document to the application as soon as they see it, whether the application is ready to receive that data or not. SAX and XNI are fast and efficient, but the patterns they require programmers to adopt are unfamiliar and uncomfortable to many developers. (...)

StAX shares with SAX the ability to read arbitrarily large documents. However, in StAX the application is in control rather than the parser. The application tells the parser when it wants to receive the next data chunk rather than the parser telling the client when the next chunk of data is ready. Furthermore, StAX exceeds SAX by allowing programs to both read existing XML documents and create new ones. Unlike SAX, StAX is a bidirectional API.

I 've tested StaX to see how it could be used to read the NCBI/TinySeqXML format.
For each xml all TSeq sequence was parsed using the StaX API (XMLEventReader). Once in memory all sequences were printed to stdout using a XMLStreamWriter.

[the source code is here]


(...)
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty("javax.xml.stream.isNamespaceAware", Boolean.FALSE);
factory.setProperty("javax.xml.stream.isCoalescing", Boolean.TRUE);
/** create a XML Event parser */
XMLEventReader parser = factory.createXMLEventReader(in);
TSeq seq= null;


/** loop over the events */
while(parser.hasNext()) {
XMLEvent event = parser.nextEvent();

if(event.isStartElement())
{
StartElement start=((StartElement)event);
String localName= start.getName().getLocalPart();
if(localName.equals("TSeq"))
{
seq= new TSeq();
this.TSeqSet.addElement(seq);
}
else if(localName.equals("TSeq_seqtype"))
{
seq.type= start.getAttributeByName(new QName("value")).getValue();
}
else if(localName.equals("TSeq_gi"))
{
seq.gi= Integer.parseInt(parser.getElementText());
}
else if(localName.equals("TSeq_accver"))
{
seq.accver= parser.getElement
(...)

... and to write the sequences...
 (...)
XMLOutputFactory factory= XMLOutputFactory.newInstance();
XMLStreamWriter w= factory.createXMLStreamWriter(out);
w.writeStartDocument();
w.writeStartElement("TSeqSet");

for(TSeq seq: TSeqSet)
{
w.writeStartElement("TSeqSet");
w.writeEmptyElement("TSeq_seqtype");
w.writeAttribute("value", seq.type);
w.writeStartElement("TSeq_gi");
w.writeCharacters(String.valueOf(seq.gi));
w.writeEndElement();
w.writeStartElement("TSeq_accver");
w.writeCharacters(seq.accver);
w.writeEndElement();
w.writeStartElement("TSeq_sid");
w.writeCharacters(seq.sid);
w.writeEndElement();
w.writeStartElement("TSeq_taxid");
w.writeCharacters(String.valueOf(seq.taxid));
w.writeEndElement();
w.writeStartElement("TSeq_orgname");
w.writeCharacters(seq.orgname);
w.writeEndElement();
w.writeStartElement("TSeq_defline");
w.writeCharacters(seq.defline);
w.writeEndElement();
w.writeStartElement("TSeq_length");
w.writeCharacters(String.valueOf(seq.length));
w.writeEndElement();
w.writeStartElement("TSeq_sequence");
w.writeCharacters(seq.sequence);
w.writeEndElement();
w.writeEndElement();
}

w.writeEndElement();
w.writeEndDocument();
w.flush();
(....)

compiling and running...

pierre@linux:> javac org/lindenb/sandbox/STAXTinySeq.java

pierre@linux:> java org/lindenb/sandbox/STAXTinySeq tinyseq.xml


<?xml version="1.0" ?><TSeqSet><TSeqSet><TSeq_se
qtype value="nucleotide"/><TSeq_gi>27592135</TSeq_gi>&
lt;TSeq_accver>CB017399.1</TSeq_accver><TSeq_sid>gnl|d
bEST|16653996</TSeq_sid><TSeq_taxid>9031</TSeq_taxid&g
t;<TSeq_orgname>Gallus gallus</TSeq_orgname><TSeq_defl
ine>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gallus g
allus cDNA clone pgn1c.pk016.a18 5' similar to ref|XP_176823.1 simila
r to Rotavirus X associated non-structural protein (RoXaN) [Mus muscu
lus] ref|XP_193795.1| similar to Rotavirus X as></TSeq_defline&
gt;<TSeq_length>671</TSeq_length><TSeq_sequence>GGA
AGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGTCTGAGTGACTTC
ACGGGTCGCCTTTGTGCAGTACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCT
GGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGAC
TATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCGGGAAGAACAGCAACAGCAAGAAGCAATGG
CAGCAGCACATCCAGTCAGAGAAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGG
AGCTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTACCATGCACATGGCTCTGTTTGATCC
CAGAAGTGATGACTACTTAGTGGTAAAAACACATTTCCAGACACACAACTTCAGAAAATGAGTGCAAGC
TTCAAGTCTGCCCTTTGTAGCCATAATGTGCTCAGCTCTCGGTCTGCTGAACAGAGTCTACTTGGCTCA
ATTCTTGGGGGAATCCCAGATGCTTTATTAGATTGTTTGAATGTCTCACGCCCTCTGAATCAGTGCCTT



That's it.
Pierre

JAVA 1.6 Mustang, JAXB and Bioinformatics

JAXB provides a convenient way to bind an XML schema to a representation in Java code. It makes it easy for you to incorporate XML data and processing functions in applications based on Java technology without having to know much about XML itself. The Architecture for XML Binding is now included in the new Java1.6. I wanted to test JAXB to see how it could be used to parse the NCBI/TinySeqXML format. First I created a XSD description of a TSeq:

Source tinyseq.xsd
(...)
<xs:element name="TSeqSet">
<xs:annotation>
<xs:documentation>Set of sequences</xs:documentation>
</xs:annotation>
<xs:complextype>
<xs:sequence>
<xs:element ref="TSeq" maxoccurs="unbounded">
</xs:element>
</xs:sequence>
</xs:complextype>
(...)
I then invoked the binding compiler XJC.
XJC generates Java classes acorresponding to the elements. It
parsed my tinyseq xsd schema and created three files:
pierre@linux:~> xjc org/lindenb/sandbox/tinyseq.xsd -d ./ -p org.lindenb.sandbox.tinyseq
parsing a schema...
compiling a schema...
org/lindenb/sandbox/tinyseq/ObjectFactory.java
org/lindenb/sandbox/tinyseq/TSeq.java
org/lindenb/sandbox/tinyseq/TSeqSet.java
I then wrote a java class using the JAXB API to read an then write a TinySeq file .

[source is here]
/** find the JAXB context in the defined path */
JAXBContext jc = JAXBContext.newInstance("org.lindenb.sandbox.tinyseq");
Unmarshaller u = jc.createUnmarshaller();
/** read the sequence */
TSeqSet seqSet = (TSeqSet)u.unmarshal(new FileInputStream("org/lindenb/sandbox/tinyseq.xml"));

Marshaller m= jc.createMarshaller();
/** echo the sequence to stdout */
m.marshal(seqSet, System.out);
compiling and running...
jpierre@linux> javac org/lindenb/sandbox/JAXBTinySeq.java org/lindenb/sandbox/tinyseq/ObjectFactory.java
jpierre@linux> java -cp . org.lindenb.sandbox.JAXBTinySeq tinyseq.xml

<?xml version="1.0" ?><TSeqSet><TSeqSet><TSeq_se
qtype value="nucleotide"/><TSeq_gi>27592135</TSeq_gi><
TSeq_accver>CB017399.1</TSeq_accver><TSeq_sid>gnl|d
bEST|16653996</TSeq_sid><TSeq_taxid>9031</TSeq_taxid
><TSeq_orgname>Gallus gallus</TSeq_orgname><TSeq_defl
ine>pgn1c.pk016.a18 Chicken lymphoid cDNA library (pgn1c) Gallus g
allus cDNA clone pgn1c.pk016.a18 5' similar to ref|XP_176823.1 simila
r to Rotavirus X associated non-structural protein (RoXaN) [Mus muscu
lus] ref|XP_193795.1| similar to Rotavirus X as></TSeq_defline>
<TSeq_length>671</TSeq_length><TSeq_sequence>GGA
AGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGCGGGAGGTTGTCTGAGTGACTTC
ACGGGTCGCCTTTGTGCAGTACTAGATATGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCT
GGGAAGCCTGGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGATGCCCACTGAC
(...)


That's it, all the classes and the methods to store and parse the XML were generated using xjc and everything was ready for direct use.

Pierre

13 December 2006

JAVA 1.6 Mustang , Scripting and Bioinformatics.

Java 1.6 has been released and it is now open source. Among the new features of java, there is an embedded sql engine and a scripting engine. I've to tested this later one to see if I could create a simple filter for fasta sequences just like awk do with regular files. What is interesting is that you can call some (complex) methods already coded in java (see below with the trivial static functions 'reverseComplement' and 'gcPercent'.

The source code using the new Scripting API is at the bottom of this post.

It was compiled like this:

javac org/lindenb/sandox/FastaAWK.java

and here is a test: the program takes as input somes fasta sequences and it only prints the one where (the length is greater than 900 pb or the GC percent is lower than 0.45 or (the reverse complement contains ATGCTTCTTG and the name contains Xenopus)).

cat ~/roxan.fasta | java org/lindenb/sandox/FastaAWK "FastaAWK.gcPercent(sequence)< 0.45 || sequence.length>90 ||  (FastaAWK .reverseComplement(sequence).toUpperCase().indexOf('ATGCTTCTTG')!=-1 && name.indexOf('Xenopus')!=-1 )"


>gi|27592135|gb|CB017399.1|CB017399 pgn1c.pk016.a18 Chicken lymphoid cDNA librar
y (pgn1c) Gallus gallus cDNA clone pgn1c.pk016.a18 5' similar to ref|XP_176823.1
similar to Rotavirus X associated non-structural protein (RoXaN) [Mus musculus]
ref|XP_193795.1| similar to Rotavirus X associated non-structural protein (RoXa
N) [Mus musculus], mRNA sequence
GGAAGGGCTGCCCCACCATTCATCCTTTTCTCGTAGTTTGTGCACGGTGC
GGGAGGTTGTCTGAGTGACTTCACGGGTCGCCTTTGTGCAGTACTAGATA
TGCAGCAGACCTATGACATGTGGCTAAAGAAACACAATCCTGGGAAGCCT
GGAGAGGGAACACCACTCACTTCGCGAGAAGGGGAGAAACAGATCCAGAT
GCCCACTGACTATGCTGACATCATGATGGGCTACCACTGCTGGCTCTGCG
GGAAGAACAGCAACAGCAAGAAGCAATGGCAGCAGCACATCCAGTCAGAG
AAGCACAAGGAGAAGGTCTTCACCTCAGACAGTGACTCCAGCTGCTGGAG
CTATCGCTTCCCTATGGGCGAGTTCCAGCTCTGTGAAAGGTACCATGCAC
ATGGCTCTGTTTGATCCCAGAAGTGATGACTACTTAGTGGTAAAAACACA
TTTCCAGACACACAACTTCAGAAAATGAGTGCAAGCTTCAAGTCTGCCCT
TTGTAGCCATAATGTGCTCAGCTCTCGGTCTGCTGAACAGAGTCTACTTG
GCTCAATTCTTGGGGGAATCCCAGATGCTTTATTAGATTGTTTGAATGTC
TCACGCCCTCTGAATCAGTGCCTTGAGGTGCCTTCAGAAGGCTTGTGATG
GTTAGNNNTNGCATTTTGGTT
>gi|13675786|gb|BG625273.1|BG625273 pgn1c.pk002.d1 Normalized chicken lymphoid c
DNA library Gallus gallus cDNA clone pgn1c.pk002.d1 5' similar to gb|AAF05541.1|
AF188530_1 (AF188530) ubiquitous tetratricopeptide containing protein RoXaN [Hom
o sapiens]G, mRNA sequence
CAATTGATGACATCGAAACAGACTGCTCTATGGACCTGCAGTGCCTGCCA
GCTCCTGTGGCCACCTCCATCTCTGTGAGCGAGGGGCTGTCCCCTTTGCA
(...)




The source:

package org.lindenb.sandox;

import java.io.BufferedReader;
import java.io.InputStreamReader;
import javax.script.Compilable;
import javax.script.CompiledScript;
import javax.script.ScriptEngine;
import javax.script.ScriptEngineFactory;
import javax.script.ScriptEngineManager;
import javax.script.ScriptException;
import javax.script.SimpleBindings;

/**
* my first test for java 1.6 mustang/ ScriptEngine
* @author Pierre Lindenbaum PhD
*
* example: cat *.fasta | java org.lindenb.sandox.FastaAWK "FastaAWK.gcPercent(sequence)<0.45 || (sequence.length>900) || (FastaAWK .reverseComplement(sequence).toUpperCase().indexOf('ATGCTTCTTG')!=-1 && name.indexOf('Xenopus')!=-1 );"
*
*/

public class FastaAWK {
/** header of fasta sequence */
private String name=null;
/** dna sequence */
private StringBuilder sequence= new StringBuilder();
/** compiled user script */
private CompiledScript compiledScript=null;


/** constructor
* initialize the statements
* @param statements
*/

public FastaAWK(String args[]) throws ScriptException
{
//copy statements to add the 'importClass'
String statements[]= new String[args.length+1];
//import this class to get a handle on gcPercent and reverseComplement
statements[0]="importClass(Packages."+this.getClass().getName()+");";
System.arraycopy(args, 0, statements, 1, args.length);

//get a javascript engine
ScriptEngineManager sem = new ScriptEngineManager();
ScriptEngine scriptEngine = sem.getEngineByName("js");
ScriptEngineFactory scriptEngineFactory= scriptEngine.getFactory();
String program = scriptEngineFactory.getProgram(statements);
//compile this program
this.compiledScript=((Compilable) scriptEngine).compile(program);
}

/**
*
* @param sequence the dna sequence
* @return the GC%
*/

public static double gcPercent(String sequence)
{
int n=0;
for(int i=0;i< sequence.length();++i)
{
char base= Character.toUpperCase(sequence.charAt(i));
n+=(base=='G' || base=='C'?1:0);
}
return (double)(n/(double)sequence.length());
}

/**
* return the reverse complement of a sequence
* @param sequence the dna sequence
* @return the reverse complement of a sequence
*/

public static String reverseComplement(String sequence)
{
StringBuilder b= new StringBuilder();
for(int i=sequence.length()-1;i>=0;--i)
{
switch(sequence.charAt(i))
{
case 'A': b.append('T');break;
case 'T': b.append('A');break;
case 'G': b.append('C');break;
case 'C': b.append('G');break;
case 'a': b.append('t');break;
case 't': b.append('a');break;
case 'g': b.append('c');break;
case 'c': b.append('g');break;
default: b.append('N');break;
}
}

return b.toString();
}

/**
* print the current fasta sequence if the compiled script return true
* @throws ScriptException
*/

public void eval() throws ScriptException
{
/* bind name and sequence to javascript variable 'name' and 'seq
uence'*/

SimpleBindings bindings= new SimpleBindings();
bindings.put("name", this.name);
bindings.put("sequence", this.sequence.toString());
//invoke the script with the current binding and get the result
Object o=this.compiledScript.eval(bindings);
if(o==null || !(o instanceof Boolean ) ) return;
//if the result is true: print the fasta sequence
Boolean b=(Boolean)o;
if(b.equals(Boolean.FALSE))
{
return;
}
System.out.print(name);
for(int i=0;i< sequence.length();++i)
{
if(i%50==0) System.out.println();
System.out.print(sequence.charAt(i));
}
System.out.println();

}

/**
* @param args
*/

public static void main(String[] args)
{
try {
FastaAWK awk= new FastaAWK(args);
//loop over fasta sequences
BufferedReader in= new BufferedReader(new InputStreamReader(System.in));
String line=null;
while((line=in.readLine())!=null)
{
if(line.startsWith(">"))
{
if(awk.sequence.length()>0)
{
awk.eval();
}
awk.sequence.setLength(0);
awk.name=line.trim();
}
else
{
awk.sequence.append(line.trim());
}
}
if(awk.sequence.length()>0)
{
awk.eval();
}

in.close();
}
catch (Throwable e)
{
e.printStackTrace();
}

}

}




Pierre Lindenbaum

PS: hey I put my latest presentation (in french) for the first time on slideshare.net !

06 December 2006

Visual Pipeline Editor (Continued)

Following my post about amatea, Egon has suggested me to have a look at two softwares:

KNIME:
KNIME is a modular data exploration platform based on Eclipse that enables the user to visually create pipelines, selectively execute some or all analysis steps, and later investigate the results through interactive views on data and models.


NIME offers the possibility to extend its functionallity by creating your own nodes.

Taverna
The Taverna project aims to provide a language and software tools to facilitate easy use of workflow and distributed compute technology within the eScience community. At first glance, it works with web services, but I need to investigate a little more about it.


04 December 2006

Visual Unix Pipeline

A few days ago I was presented a really impressive demo of Amadea. In a simplistic view, it looks like a "visual unix pipeline": just drag your 'grep','sort',etc... on your desktop, draw the links, choose your data sources (mysql, excel spreadsheets...) and run your analysis.

Amadea Screenshot
Via ISoft: The central window enables the user to draw and control the execution of the transformation process. Selecting the output of one of the operators automatically updates the data grid to reflect the transformations processed by this operator on the input table. The frames on the left of the screen give access to transformation operators. The parameters of each operator can be spelt out in the right of the screen.


I wondered if there is any other tool which acts like such a visual editor and which could help people from my lab to perform some simple operation on their data ?

On the other hand and without pretention, it might be easy to create a naive version of those unix filters in java just by extending java.io.InputStream. For example, for a Grep, I would write:

public class GrepInputStream
extends InputStream
{
/** parent stream */
private InputStream parent;

/** regular expression */
private Pattern pattern;

/** byte buffer */
private byte byteBuffer[];

/** position in buffer */
private int curr;


/** constructor */
public GrepInputStream(
InputStream parent,
Pattern pattern)
{
this.parent= parent;
this.pattern=pattern;
this.byteBuffer=null;
}

@Override
public int read() throws IOException
{
/** byte buffer already exists */
if(byteBuffer!=null && curr< byteBuffer.length)
{
byte b= byteBuffer[curr++];
if(curr== byteBuffer.length)
{
byteBuffer=null;
curr=0;
}
return b;
}

while(true)
{
/** read next line from parent */
int i;
ByteArrayOutputStream byteOutputStream=new ByteArrayOutputStream();
while((i=parent.read())!=-1)
{
byteOutputStream.write(i);
if(i=='\n') break;//eol reached
}
this.byteBuffer= byteOutputStream.toByteArray();
if(byteBuffer.length==0) return -1;

/** creates the line. remove the CR if needed */
String line= new String(
this.byteBuffer,
0,
byteBuffer[byteBuffer.length-1]=='\n'?byteBuffer.length-1:byteBuffer.length
);

Matcher match= this.pattern.matcher(line);
/* this line matcth our pattern */
if(match.find()) break;
}
this.curr=1;
return this.byteBuffer[0];
}

}



Pierre