25 January 2013

Samtools tview as a library to display the BAM

I've forked samtools and modified the code of tview to use it as a library to display the alignments. The original code is an inreractive interface using the ncurses library. I've modified the original code and changed the structure of the C 'struct tview' with a few callbacks to make it more object-oriented:

(...)
typedef struct AbstractTview {
 int mrow, mcol;
   (...)
    khash_t(kh_rg) *rg_hash;
    /* callbacks */
    void (*my_destroy)(struct AbstractTview* );
    void (*my_mvprintw)(struct AbstractTview* ,int,int,const char*,...);
    void (*my_mvaddch)(struct AbstractTview*,int,int,int);
    void (*my_attron)(struct AbstractTview*,int);
    void (*my_attroff)(struct AbstractTview*,int);
    void (*my_clear)(struct AbstractTview*);
    int (*my_colorpair)(struct AbstractTview*,int);
    int (*my_drawaln)(struct AbstractTview*,int,int);
    int (*my_loop)(struct AbstractTview*);
    int (*my_underline)(struct AbstractTview*);
} tview_t;
With those callbacks, there is a strong separation between the 'view' and the 'model'. Tview can now be used as a library and it's now easy to create any kind of view you need by extending the 'struct tview_t'. I've created two new interfaces: one for HTML:


samtools tview -d H examples/sorted.bam  -p seq1:20 
seq1:20
 21        31        41        51        61        71        81                 
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ATGTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTGGSTG
ATGTGTGGTTTAACTCGTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG  GCCTGGCTGTGGGGGCACCAGCCGCTG
ATGTGTGGTTTAACTCGTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG  GCCTGGCTGTGGGGGCACCAGCCGCTG
ATGTGTGGTTTAACTCGTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG  GCCTGGCTGTGGGGGCACCAGCCGCTG
ATGTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGCGCTGTGGACCCTGC       CTGTGGGGGCCGCAGTGGCTG
ATGTGTGGTTTAACTCGT     GCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTG CTGTGGGGGCCGCAGTGGCTG
ATGTGTGGTTTAACTCGT     GCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTG CTGTGGGGGCCGCAGTGGGTG
ATGTGTGGTTTAACTCGTCC   GCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTG CTGTGGGGGCCGCAGTGGCTG
ATGTGTGGTTTAACTCGTCC         CATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTTG          GGCTG
ATGTGTGGTTTAACTCGTCC         CATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTTG          GGCTG
TTTTTTGTTTTAACTCTTCTCT       CATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTTG          GGCTG
TTTTTTGTTTTAACTCTTCTCT        ATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGGGG               
TTTTTTGTTTTAACTCTTCTCT        ATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGGGG               
ATGTGTGGTTTAACTCGTCCATGG      ATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGGGG               
ATGTGTGGTTTAACTCGTCCATGG       TTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGG              
ATGTGTGGTTTAACTCGTCCATGG       TTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGG              
ATGTGTGGTTTAACTCGTCCCTGGCCCA   TTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGG              
ATGTGTGGTTTAACTCGTCCATGGCCCAG   TAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGGGG            
ATGTGTGGTTTAACTCGTCCCTGGCCCA    TAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGGGG            
ATGTGTGGTTTAACTCGTCCATGGCCCAG   TAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGGGG            
ATGTGTGGTTTAACTCGTCCCTGGCCCA           CTGTGGACCCTGCAGCCTGGCTGTGGGGGGCGCCG      
ATGTGTGGTTTAACTCGTCCATGGCCCAG          CTGTGGACCCTGCAGCCTGGCTGTGGGGGGCGCCG      
ATGTGTGGTTTAACTCGTCCATTGCCCAGC         CTGTGGACCCTGCAGCCTGGCTGTGGGGGGCGCCG      
ATGTGTGGTTTAACTCGTCCATTGCCCAGC          TGTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTG    
ATGTGTGGTTTAACTCGTCCATTGCCCAGC          TGTGGACCCTGCAGCCTGGCTGGGGGGGGCGCAGT     
atgtgtggtttaactcgtccatggcccagcatt       TGTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTG    
atgtgtggtttaactcgtccatggcccagcatt       TGTGGACCCTGCAGCCTGGCTGGGGGGGGCGCAGT     
atgtgtggtttaactcgtccatggcccagcatt       TGTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTG    
  GTGTGGTTTAACTCGTCCATGGCCCAGCATTTGGG   TGTGGACCCTGCAGCCTGGCTGGGGGGGGCGCAGT     
  GTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGG    GTGGACCCTGCAGCCTGGCTGGGGGGGGCACGGGG    
  GTGTGGTTTAACTCGTCCATGGCCCAGCATTTGGG    GTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTG    
  GTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGG    GTGGACCCTGCAGCCTGGCTGGGGGGGGCACGGGG    
  GTGTGGTTTAACTCGTCCATGGCCCAGCATTTGGG    GTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTG    
  GTGTGGTTTAACTCGTCCATGGCCCAGCATTAGGG    GTGGACCCTGCAGCCTGGCTGGGGGGGGCACGGGG    
    GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC GTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTG    
    GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC   GGACCCTGCAGCCTGGCTGTGGGGGCCGCTGTGGG  
    GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC   GGACCCTGCAGCCTGGCTGTGGGGGCCGCTGTGGG  
        TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT    CCTGCAGCCTGGCTGTGGGGGCCGCAGCGGGTG
        TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT    CCTGCAGCCTGGCTGTGGGGGCCGCAGCGGGTG
        TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT    CCTGCAGCCTGGCTGTGGGGGCCGCAGCGGGTG
         TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG      GCAGCCTGGCTGTGGGGGCCGCAGTGGCTG
         TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG      GCAGCCTGGCTGTGGGGGCCGCAGTGGCTG
         TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG      GCAGCCTGGCTGTGGGGGCCGCAGTGGCTG
           AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA             CTGTGGGGGCCGCAGTGGGTG
           AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA             CTGTGGGGGCCGCAGTGGCTG
           AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA             CTGTGGGGGCCGCAGTGGCTG
                 TCCATGGCCCAGCATTAGGGCGCTGTGGACCCTGC       CTGTGGGGGCCGCAGTGGGTG
                 TCCATGGCCCAGCATTAGGGCGCTGTGGACCCTGC       CTGTGGGGGCCGCAGTGGCTG
                                           GGACCCTGCAGCCTGGCTGTGGGGGCCGCTGTGGG  
                                                            TGTGGGGGCCGCAGTGGCTG
                                                            TGTGGGGGCCGCAGTGGCTG
                                                            TGTGGGGGCCGCAGTGGCTG

And another one for TEXT (with colors on a terminal):

samtools tview -d t examples/sorted.bam  -p seq1:23

        31        41        51        61        71        81        91          
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
TGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTGGGGGCCGCAGTGGSTGAGG
TGTGGTTTAACTCGTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG  GCCTGGCTGTGGGGGCACCAGCCGCTGCGG
TGTGGTTTAACTCGTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG  GCCTGGCTGTGGGGGCACCAGCCGCTGCGG
TGTGGTTTAACTCGTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG  GCCTGGCTGTGGGGGCACCAGCCGCTGCGG
TGTGGTTTAACTCGTCCATGGCCCAGCATTAGGGCGCTGTGGACCCTGC       CTGTGGGGGCCGCAGTGGCTGAGG
TGTGGTTTAACTCGT     GCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTG CTGTGGGGGCCGCAGTGGCTGAGG
TGTGGTTTAACTCGT     GCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTG CTGTGGGGGCCGCAGTGGGTGAGG
TGTGGTTTAACTCGTCC   GCCCAGCATTAGGGAGCTGTGGACCCTGCAGCCTG CTGTGGGGGCCGCAGTGGCTGAGG
TGTGGTTTAACTCGTCC         CATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTTG          GGCTGAGG
TGTGGTTTAACTCGTCC         CATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTTG          GGCTGAGG
TTTGTTTTAACTCTTCTCT       CATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGTTG          GGCTGAGG
TTTGTTTTAACTCTTCTCT        ATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGGGG               AGG
TTTGTTTTAACTCTTCTCT        ATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGGGG               AGG
TGTGGTTTAACTCGTCCATGG      ATTAGGGAGCTGTGGACCCTGCAGCCTGGCTGGGG               AGG

The code is available on github:


That's it,

Pierre


09 January 2013

A XML schema (xsd) for GeneOntology

The GeneOntology can be downloaded as a RDF/XML file from http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz.
Although it is a RDF file, the structure of the file remains the same. As a consequence, it is shipped with a DTD that describes the structure of the document ( http://www.geneontology.org/dtd/go.dtd ).
I've just written a XML schema (XSD) for this RDF file. This schema is available on github at:
https://github.com/lindenb/xsd-sandbox/tree/master/schemas/bio/go.

Validation with xmllint

The RDF file is successfully validated against my xsd schema:
$ curl "http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz" |\
 gunzip -c | grep -v "<!DOCTYPE " > go.xml

xmllint  --noout --schema go.xsd go.xml
go.xml validates
Note: I've ignored the elements defined in the DTD but absent in the RDF file.

Code Generation with XJC

XJC can be used to generate the java classes for this schema:
xjc go.xsd 
parsing a schema...
compiling a schema...
org/w3/_1999/_02/_22_rdf_syntax_ns_/ObjectFactory.java
org/w3/_1999/_02/_22_rdf_syntax_ns_/RDF.java
org/w3/_1999/_02/_22_rdf_syntax_ns_/package-info.java
org/geneontology/dtds/go/AbstractRelation.java
org/geneontology/dtds/go/Go.java
org/geneontology/dtds/go/IsA.java
org/geneontology/dtds/go/NegativelyRegulates.java
org/geneontology/dtds/go/ObjectFactory.java
org/geneontology/dtds/go/PartOf.java
org/geneontology/dtds/go/PositivelyRegulates.java
org/geneontology/dtds/go/Regulates.java
org/geneontology/dtds/go/package-info.java

Java Parsing

... and we can parse the terms of GO with java without writing a new parser and without any dependencies. For example, the following code parses the whole ontology and prints it to stdout as XML:
import java.io.InputStream;
import java.io.StringWriter;
import org.geneontology.dtds.go.*;
import org.w3._1999._02._22_rdf_syntax_ns_.*;
import javax.xml.namespace.QName;
import javax.xml.bind.JAXBContext;
import javax.xml.bind.JAXBElement;
import javax.xml.bind.Unmarshaller;
import javax.xml.bind.Marshaller;
import javax.xml.transform.stream.StreamSource;

public class TestGo
    {
    public static void main(String[] args) throws Exception
        {
 JAXBContext jaxbCtxt=JAXBContext.newInstance("org.geneontology.dtds.go:org.w3._1999._02._22_rdf_syntax_ns_");
 Marshaller marshaller = jaxbCtxt.createMarshaller();
 Unmarshaller unmarshaller=jaxbCtxt.createUnmarshaller();
        marshaller.setProperty("jaxb.formatted.output",true);
        Object go=unmarshaller.unmarshal(new java.io.File("go.xml"));
        marshaller.marshal(go, System.out);
        }
    }
compile and run:
$javac TestGo.java \
  org/w3/_1999/_02/_22_rdf_syntax_ns_/ObjectFactory.java \
  org/geneontology/dtds/go/ObjectFactory.java

$ java TestGo  | head -n 100
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<go xmlns="http://www.geneontology.org/dtds/go.dtd#" xmlns:ns2="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <ns2:RDF>
        <term ns2:about="http://www.geneontology.org/go#GO:0000001">
            <accession>GO:0000001</accession>
            <name>mitochondrion inheritance</name>
            <synonym>mitochondrial inheritance</synonym>
            <definition>The distribution of mitochondria, including the mitochondrial genome, into daughter cells after mitosis or meiosis, mediated by interactions between mitochondria and the cytoskeleton.</definition>
            <is_a ns2:resource="http://www.geneontology.org/go#GO:0048308"/>
            <is_a ns2:resource="http://www.geneontology.org/go#GO:0048311"/>
        </term>
        <term ns2:about="http://www.geneontology.org/go#GO:0000002">
            <accession>GO:0000002</accession>
            <name>mitochondrial genome maintenance</name>
            <definition>The maintenance of the structure and integrity of the mitochondrial genome; includes replication and segregation of the mitochondrial chromosome.</definition>
            <is_a ns2:resource="http://www.geneontology.org/go#GO:0007005"/>
            <dbxref ns2:parseType="Resource">
                <database_symbol>InterPro</database_symbol>
                <reference>IPR009446</reference>

That's it,

Pierre

PS: many thanks to @bdoughan for his help on SO.