01 December 2011

Suggest some new terms for the EDAM Ontology for Bioinformatics

EDAM is an ontology of general bioinformatics concepts, including topics and data types, formats, identifiers and operations.
Is your specific subject of research present in this ontology (e.g "RNA-Seq") ? go and have a look at http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM. If it is not, feel free to suggest a new term in the form below. Your term might be included in the next version of the ontology and it might be used as a possible choice for the Bioinformatics Career Survey 2011/2012.

That's it, Pierre

20 November 2011

Processing json data with apache velocity.

I've written a tool named "apache velocity" which parse json data and processes it with "Apache velocity" (a template engine ). The (javacc) source code is available here:


https://github.com/lindenb/jsandbox/blob/master/src/sandbox/VelocityJson.jj

Example

Say you have defined some classes using JSON:

[
  {
    "type": "record",
    "name": "Exon",
    "fields" : [
      {"name": "start", "type": "int"},
      {"name": "end", "type": "int"}
    ]
  },
  {
    "type": "record",
    "name": "Gene",
    "fields" : [
      {"name": "chrom", "type": "string"},
      {"name": "name", "type": "string"},
      {"name": "txStart", "type": "int"},
      {"name": "txEnd", "type": "int"},
      {"name": "cdsStart", "type": "int"},
      {"name": "cdsEnd", "type": "int"},
      {"name": "exons", "type":{"type":"array","items":"Exon"}}
    ]
  } 
 ]
and here is a velocity template transforming this json structure to java :

#macro(javaName $s)$s.substring(0,1).toUpperCase()$s.substring(1)#end
#macro(setter $s)set#javaName($s)#end
#macro(getter $s)get#javaName($s)#end
#macro(javaType $f)
#if($f.type.equals("string"))
java.lang.String#elseif($f.type.equals("boolean"))
boolean#elseif($f.type.equals("long"))
long#elseif($f.type.equals("float"))
float#elseif($f.type.equals("double"))
double#elseif($f.type.equals("int"))
int#elseif($f.items)
$f.items#elseif($f.type.type.equals("array"))
java.util.List<#javaType($f.type)>#else
$f.type
#end
#end

#foreach( $class in $avro)

class $class.name
{
#foreach( $field in $class.fields )
private  #javaType($field) $field.name;
#end

public ${class.name}()
 {
 }

public ${class.name}(#foreach( $field in $class.fields )
 #if($velocityCount>1),#end#javaType($field) $field.name
 #end
 )
 {
 #foreach( $field in $class.fields )
 this.$field.name=$field.name;
 #end
 }
 


#foreach( $field in $class.fields )
public void #setter($field.name)(#javaType($field) $field.name)
 {
 this.$field.name=$field.name;
 }
public #javaType($field) #getter($field.name)()
 {
 return this.$field.name;
 }
#end
}
#end
The json file can be processed with velocity using the following command line:

$ java -jar velocityjson.jar -f avro structure.json json2java.vm

Result

class Exon
{
private  int start;
private  int end;

public Exon()
 {
 }

public Exon( int start
  ,int end
  )
 {
  this.start=start;
  this.end=end;
  }
 


public void setStart(int start)
 {
 this.start=start;
 }
public int getStart()
 {
 return this.start;
 }
public void setEnd(int end)
 {
 this.end=end;
 }
public int getEnd()
 {
 return this.end;
 }
}

class Gene
{
private  java.lang.String chrom;
private  java.lang.String name;
private  int txStart;
private  int txEnd;
private  int cdsStart;
private  int cdsEnd;
private  java.util.List<Exon> exons;

public Gene()
 {
 }

public Gene( java.lang.String chrom
  ,java.lang.String name
  ,int txStart
  ,int txEnd
  ,int cdsStart
  ,int cdsEnd
  ,java.util.List<Exon> exons
  )
 {
  this.chrom=chrom;
  this.name=name;
  this.txStart=txStart;
  this.txEnd=txEnd;
  this.cdsStart=cdsStart;
  this.cdsEnd=cdsEnd;
  this.exons=exons;
  }
 


public void setChrom(java.lang.String chrom)
 {
 this.chrom=chrom;
 }
public java.lang.String getChrom()
 {
 return this.chrom;
 }
public void setName(java.lang.String name)
 {
 this.name=name;
 }
public java.lang.String getName()
 {
 return this.name;
 }
public void setTxStart(int txStart)
 {
 this.txStart=txStart;
 }
public int getTxStart()
 {
 return this.txStart;
 }
public void setTxEnd(int txEnd)
 {
 this.txEnd=txEnd;
 }
public int getTxEnd()
 {
 return this.txEnd;
 }
public void setCdsStart(int cdsStart)
 {
 this.cdsStart=cdsStart;
 }
public int getCdsStart()
 {
 return this.cdsStart;
 }
public void setCdsEnd(int cdsEnd)
 {
 this.cdsEnd=cdsEnd;
 }
public int getCdsEnd()
 {
 return this.cdsEnd;
 }
public void setExons(java.util.List<Exon> exons)
 {
 this.exons=exons;
 }
public java.util.List<Exon> getExons()
 {
 return this.exons;
 }
}


That's it,

Pierre

16 November 2011

"VCF annotation" with the NHLBI GO Exome Sequencing Project (JAX-WS)

The NHLBI Exome Sequencing Project (ESP) has released a web service to query their data. "The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.".
In the current post, I'll show how I've used this web service to annotate a VCF file with this information.
The web service provided by the ESP is based on the SOAP protocol.
Here is an example of the XML response:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ns3:local xmlns:ns2="http://webservice.evs.gs.washington.edu/" xmlns:ns3="uri">
<chromosome>1</chromosome>
<start>120457968</start>
<stop>120457969</stop>
<strand>+</strand>
<snpList>
<positionString>1:120457968</positionString>
<chrPosition>120457968</chrPosition>
<alleles>T/C</alleles>
<uaAlleleCounts>1/2701</uaAlleleCounts>
<aaAlleleCounts>0/2176</aaAlleleCounts>
<totalAlleleCounts>1/4877</totalAlleleCounts>
<uaAlleleAndCount>T=1/C=2701</uaAlleleAndCount>
<aaAlleleAndCount>T=0/C=2176</aaAlleleAndCount>
<totalAlleleAndCount>T=1/C=4877</totalAlleleAndCount>
<uaMAF>0.037</uaMAF>
<aaMAF>0.0</aaMAF>
<totalMAF>0.0205</totalMAF>
<avgSampleReadDepth>198</avgSampleReadDepth>
<geneList>NOTCH2</geneList>
<snpFunction>
<chromosome>1</chromosome>
<position>120457968</position>
<conservationScore>1.0</conservationScore>
<conservationScoreGERP>5.5</conservationScoreGERP>
<snpFxnList>
<mrnaAccession>NM_024408</mrnaAccession>
<fxnClassGVS>missense</fxnClassGVS>
<aminoAcids>MET,ILE</aminoAcids>
<proteinPos>2459/2472</proteinPos>
<cdnaPos>7377</cdnaPos>
<pphPrediction>unknown</pphPrediction>
<granthamScore>10</granthamScore>
</snpFxnList>
<refAllele>C</refAllele>
<ancestralAllele>C</ancestralAllele>
<firstRsId>0</firstRsId>
<secondRsId>0</secondRsId>
<filters>PASS</filters>
<clinicalLink>unknown</clinicalLink>
</snpFunction>
<conservationScore>1.0</conservationScore>
<conservationScoreGERP>5.5</conservationScoreGERP>
<refAllele>C</refAllele>
<altAlleles>T</altAlleles>
<ancestralAllele>C</ancestralAllele>
<chromosome>1</chromosome>
<hasAtLeastOneAccession>true</hasAtLeastOneAccession>
<rsIds>none</rsIds>
<filters>PASS</filters>
<clinicalLink>unknown</clinicalLink>
</snpList>
<setOfSiteCoverageInfo>
<chromosome>1</chromosome>
<position>120457968</position>
<totalSamplesCovered>2439</totalSamplesCovered>
<avgSampleReadDepth>198.0</avgSampleReadDepth>
<eaSamplesCovered>1351</eaSamplesCovered>
<avgEaSampleReadDepth>202.0</avgEaSampleReadDepth>
<aaSamplesCovered>1088</aaSamplesCovered>
<avgAaSampleReadDepth>194.0</avgAaSampleReadDepth>
</setOfSiteCoverageInfo>
<setOfSiteCoverageInfo>
<chromosome>1</chromosome>
<position>120457969</position>
<totalSamplesCovered>2439</totalSamplesCovered>
<avgSampleReadDepth>197.0</avgSampleReadDepth>
<eaSamplesCovered>1351</eaSamplesCovered>
<avgEaSampleReadDepth>201.0</avgEaSampleReadDepth>
<aaSamplesCovered>1088</aaSamplesCovered>
<avgAaSampleReadDepth>193.0</avgAaSampleReadDepth>
</setOfSiteCoverageInfo>
</ns3:local>
view raw answer.xml hosted with ❤ by GitHub
We can generate the java classes for a client invoking this Web Service by using ${JAVA_HOME}/bin/wsimport.
$ wsimport -keep "http://evs.gs.washington.edu/wsEVS/EVSDataQueryService?wsdl"

parsing WSDL...
generating code...
compiling code...

Here is the java code running this client. It scans the VCF, calls the webservice for each variation and insert the annotation as JSON in a new column .
/**
* Author:
* Pierre Lindenbaum PhD
* WWW:
* http://plindenbaum.blogspot.com
* Date:
* 2011-11-16
* Motivation:
* annotate VCF with data from http://evs.gs.washington.edu/EVS/
*/
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.lang.reflect.Method;
import java.util.List;
import java.util.regex.Pattern;
import edu.washington.gs.evs.webservice.*;
/* first, generate the classes with wsimport -keep "http://evs.gs.washington.edu/wsEVS/EVSDataQueryService?wsdl" */
public class EVSClient
{
public static void main(String[] args)
{
try
{
Pattern tab=Pattern.compile("[\t]");
DataQueryService service=new DataQueryService();
DataQuery port=service.getDataQueryPort();
BufferedReader in=new BufferedReader(new InputStreamReader(System.in));
String line;
while((line=in.readLine())!=null)
{
if(line.startsWith("#"))
{
System.out.print(line);
if(!line.startsWith("##"))
{
System.out.print("\tEVS");
}
System.out.println();
continue;
}
String tokens[]=tab.split(line);
int position=Integer.parseInt(tokens[1]);
//calls the service chr:start-end
EvsData data=port.getEvsData(tokens[0]+":"+position+"-"+(position));
System.out.print(line);
System.out.print("\t");
if(data==null || data.getStart()!=position)
{
System.out.print(".");
}
else
{
printjson(data);
}
System.out.println();
}
}
catch(Throwable err)
{
err.printStackTrace();
}
}
/** transforms a java objet to json using reflection */
private static void printjson(Object o)throws Exception
{
if(o==null)
{
System.out.print("null");
}
else if(o instanceof Number || o.getClass()==Boolean.class)
{
System.out.print(o.toString());
}
else if(o.getClass()==String.class)
{
String s=o.toString();
System.out.print("\"");
for(int i=0;i< s.length();++i)
{
switch(s.charAt(i))
{
case '\"': System.out.print("\\\"");break;
case '\'': System.out.print("\\\'");break;
case '\\': System.out.print("\\\\");break;
case '\n': System.out.print("\\n");break;
case '\t': System.out.print("\\t");break;
default:System.out.print(s.charAt(i));break;
}
}
System.out.print("\"");
}
else if(o instanceof List)
{
@SuppressWarnings("rawtypes")
List L=(List)o;
System.out.print("[");
for(int i=0;i< L.size();++i)
{
if(i>0) System.out.print(",");
printjson(L.get(i));
}
System.out.print("]");
}
else
{
boolean first=true;
System.out.print("{");
for(Method method:o.getClass().getMethods())
{
String name=method.getName();
if(name.equals("getClass")) continue;
if(!name.startsWith("get")) continue;
if(method.getParameterTypes().length != 0) continue;
if(Void.class.equals(method.getReturnType())) continue;
if(!first) System.out.print(",");
first=false;
name=name.substring(3);
printjson(name.substring(0, 1).toLowerCase()+name.substring(1));
System.out.print(":");
printjson(method.invoke(o));
}
System.out.print("}");
}
}
}
view raw EVSClient.java hosted with ❤ by GitHub
... and the makefile:
evsclient.jar :EVSClient.java edu
javac $<
echo "Main-Class: EVSClient" > manifest.txt
jar cvfm $@ manifest.txt EVSClient.class edu
edu:
wsimport -keep "http://evs.gs.washington.edu/wsEVS/EVSDataQueryService?wsdl"
test: evsclient.jar
curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/EUR.2of4intersection_allele_freq.20100804.sites.vcf.gz" |\
gunzip -c |\
java -jar evsclient.jar
clean:
rm -rf *.class evsclient.jar edu
view raw Makefile hosted with ❤ by GitHub

Result (some columns have been cut)

curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/EUR.2of4intersection_allele_freq.20100804.sites.vcf.gz" |\
 gunzip -c |\
 java -jar evsclient.jar 



##fileformat=VCFv4.0
##filedat=20101112
##datarelease=20100804
##samples=629
##description="Where BI calls are present, genotypes and alleles are from BI.  In there absence, UM genotypes are used.  If neither are available, no genotype information is present and the alleles are from the NCBI calls."
(...)
#CHROM POS ID EVS
1 10469 rs117577454 {"start":10469,"chromosome":"1","stop":10470,"strand":"+","snpList":[],"setOfSiteCoverageInfo":[]}
1 10583 rs58108140 {"start":10583,"chromosome":"1","stop":10584,"strand":"+","snpList":[],"setOfSiteCoverageInfo":[]}
1 11508 . {"start":11508,"chromosome":"1","stop":11509,"strand":"
(...)
1 69511 . {"start":69511,"chromosome":"1","stop":69512,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"1.0","conservationScoreGERP":"0.5","refAllele":"A","ancestralAllele":"G","filters":"PASS","clinicalLink":"unknown","positionString":"1:69511","chrPosition":69511,"alleles":"G/A","uaAlleleCounts":"1373/47","aaAlleleCounts":"880/600","totalAlleleCounts":"2253/647","uaAlleleAndCount":"G=1373/A=47","aaAlleleAndCount":"G=880/A=600","totalAlleleAndCount":"G=2253/A=647","uaMAF":3.3099,"aaMAF":40.5405,"totalMAF":22.3103,"avgSampleReadDepth":185,"geneList":"OR4F5","snpFunction":{"chromosome":"1","position":69511,"conservationScore":"1.0","conservationScoreGERP":"0.5","snpFxnList":[{"mrnaAccession":"NM_001005484","fxnClassGVS":"missense","aminoAcids":"THR,ALA","proteinPos":"141/306","cdnaPos":421,"pphPrediction":"benign","granthamScore":"58"}],"refAllele":"A","ancestralAllele":"G","firstRsId":75062661,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"G","hasAtLeastOneAccession":"true","rsIds":"rs75062661"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":69511,"avgSampleReadDepth":185.0,"totalSamplesCovered":1452,"eaSamplesCovered":712,"avgEaSampleReadDepth":157.0,"aaSamplesCovered":740,"avgAaSampleReadDepth":211.0},{"chromosome":"1","position":69512,"avgSampleReadDepth":180.0,"totalSamplesCovered":1501,"eaSamplesCovered":739,"avgEaSampleReadDepth":153.0,"aaSamplesCovered":762,"avgAaSampleReadDepth":207.0}]}
(...)
1 901923 . {"start":901923,"chromosome":"1","stop":901924,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"1.0","conservationScoreGERP":"5.0","refAllele":"C","ancestralAllele":"C","filters":"PASS","clinicalLink":"unknown","positionString":"1:901923","chrPosition":901923,"alleles":"A/C","uaAlleleCounts":"2/2542","aaAlleleCounts":"52/1934","totalAlleleCounts":"54/4476","uaAlleleAndCount":"A=2/C=2542","aaAlleleAndCount":"A=52/C=1934","totalAlleleAndCount":"A=54/C=4476","uaMAF":0.0786,"aaMAF":2.6183,"totalMAF":1.1921,"avgSampleReadDepth":35,"geneList":"PLEKHN1","snpFunction":{"chromosome":"1","position":901923,"conservationScore":"1.0","conservationScoreGERP":"5.0","snpFxnList":[{"mrnaAccession":"NM_032129","fxnClassGVS":"missense","aminoAcids":"SER,ARG","proteinPos":"4/612","cdnaPos":12,"pphPrediction":"probably-damaging","granthamScore":"110"}],"refAllele":"C","ancestralAllele":"C","firstRsId":0,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"A","hasAtLeastOneAccession":"true","rsIds":"none"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":901923,"avgSampleReadDepth":35.0,"totalSamplesCovered":2280,"eaSamplesCovered":1272,"avgEaSampleReadDepth":32.0,"aaSamplesCovered":1008,"avgAaSampleReadDepth":38.0},{"chromosome":"1","position":901924,"avgSampleReadDepth":35.0,"totalSamplesCovered":2283,"eaSamplesCovered":1273,"avgEaSampleReadDepth":32.0,"aaSamplesCovered":1010,"avgAaSampleReadDepth":38.0}]}
1 902069 rs116147894 {"start":902069,"chromosome":"1","stop":902070,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"0.0","conservationScoreGERP":"1.0","refAllele":"T","ancestralAllele":"T","filters":"PASS","clinicalLink":"unknown","positionString":"1:902069","chrPosition":902069,"alleles":"C/T","uaAlleleCounts":"2/320","aaAlleleCounts":"18/212","totalAlleleCounts":"20/532","uaAlleleAndCount":"C=2/T=320","aaAlleleAndCount":"C=18/T=212","totalAlleleAndCount":"C=20/T=532","uaMAF":0.6211,"aaMAF":7.8261,"totalMAF":3.6232,"avgSampleReadDepth":13,"geneList":"PLEKHN1","snpFunction":{"chromosome":"1","position":902069,"conservationScore":"0.0","conservationScoreGERP":"1.0","snpFxnList":[{"mrnaAccession":"NM_032129","fxnClassGVS":"intron","aminoAcids":"none","proteinPos":"NA","cdnaPos":-1,"pphPrediction":"unknown","granthamScore":"NA"}],"refAllele":"T","ancestralAllele":"T","firstRsId":0,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"C","hasAtLeastOneAccession":"true","rsIds":"none"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":902069,"avgSampleReadDepth":13.0,"totalSamplesCovered":304,"eaSamplesCovered":169,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":135,"avgAaSampleReadDepth":12.0},{"chromosome":"1","position":902070,"avgSampleReadDepth":12.0,"totalSamplesCovered":338,"eaSamplesCovered":190,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":148,"avgAaSampleReadDepth":12.0}]}
1 902108 rs62639981 {"start":902108,"chromosome":"1","stop":902109,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"0.0","conservationScoreGERP":"-8.7","refAllele":"C","ancestralAllele":"unknown","filters":"PASS","clinicalLink":"unknown","positionString":"1:902108","chrPosition":902108,"alleles":"T/C","uaAlleleCounts":"5/333","aaAlleleCounts":"0/248","totalAlleleCounts":"5/581","uaAlleleAndCount":"T=5/C=333","aaAlleleAndCount":"T=0/C=248","totalAlleleAndCount":"T=5/C=581","uaMAF":1.4793,"aaMAF":0.0,"totalMAF":0.8532,"avgSampleReadDepth":13,"geneList":"PLEKHN1","snpFunction":{"chromosome":"1","position":902108,"conservationScore":"0.0","conservationScoreGERP":"-8.7","snpFxnList":[{"mrnaAccession":"NM_032129","fxnClassGVS":"coding-synonymous","aminoAcids":"none","proteinPos":"36/612","cdnaPos":108,"pphPrediction":"unknown","granthamScore":"NA"}],"refAllele":"C","ancestralAllele":"unknown","firstRsId":62639981,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"T","hasAtLeastOneAccession":"true","rsIds":"rs62639981"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":902108,"avgSampleReadDepth":13.0,"totalSamplesCovered":294,"eaSamplesCovered":170,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":124,"avgAaSampleReadDepth":13.0},{"chromosome":"1","position":902109,"avgSampleReadDepth":13.0,"totalSamplesCovered":309,"eaSamplesCovered":177,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":132,"avgAaSampleReadDepth":13.0}]}
(...)
That's it
Pierre

01 November 2011

The paper about BioStar has been published in "PLoS Computational Biology"

The article describing BioStar has been published in PLoS Computational Biology:

BioStar: An Online Question & Answer Resource for the Bioinformatics Community


Laurence D. Parnell, Pierre Lindenbaum, Khader Shameer, Giovanni Marco Dall'Olio, Daniel C. Swan, Lars Juhl Jensen, Simon J. Cockell, Brent S. Pedersen, Mary E. Mangan, Christopher A. Miller, Istvan Albert. 2011
PLoS Comput Biol 7(10)
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002216
Giovanni has already blogged about this paper here, and on my side, I've collected some tweets about this paper.

Many thanks to all the Biostar users and to the contributors of this paper.

That's it
Pierre

21 October 2011

A reference genome with or without the 'chr' prefix

The name of the chromosomes in the fasta files for the human genome are prefixed with 'chr' :

$  grep ">" hg19.fa 
>chr1
>chr2
>chr3
>chr4
>chr5
>chr6
(...)
The FAIDX index for this fasta file looks like this:
chr1 249250621 6 50 51
chr2 243199373 254235646 50 51
chr3 198022430 502299013 50 51
chr4 191154276 704281898 50 51
chr5 180915260 899259266 50 51
chr6 171115067 1083792838 50 51
(...)
.Today, I've been asked to call the variations for a set of BAM files mapped on a reference genome without this 'chr' prefix. One way to get around this problem is to change the header for those BAM. Another way is to create a copy of the faidx file where the 'chr' prefixes have been removed (the faidx is still valid as the positions in the chromosomes didn't change):
sed 's/^chr//' hg19.fa.fai > hg19_NOPREFIX.fa.fai
and to create a symbolic link named hg19_NOPREFIX.fa pointing to the original reference:
 ln -s hg19.fa hg19_NOPREFIX.fa
. The result:
ls -lah

-rw-r--r-- 1 root root 3.0G Jan  4  2011 hg19.fa
-rw-r--r-- 1 root    root   788 Jan 27  2011 hg19.fa.fai
lrwxrwxrwx 1 root    root     7 Oct 20 16:12 hg19_NOPREFIX.fa -> hg19.fa
-rw-r--r-- 1 root    root   713 Oct 20 16:12 hg19_NOPREFIX.fa.fai
This solution worked so far with samtools mpileup.

That's it,

Pierre

07 October 2011

Knime4Bio: a set of custom nodes for the interpretation of NGS data with KNIME

Our paper has just been published in Bioinformatics  :-)

http://bioinformatics.oxfordjournals.org/content/early/2011/10/07/bioinformatics.btr554.abstract



Knime4Bio: a set of custom nodes for the interpretation of Next Generation Sequencing data with KNIME.
   Pierre Lindenbaum, Solena Le Scouarnec, Vincent Portero and Richard Redon


Summary: Here, we describe Knime4Bio, a set of custom nodes for the KNIME (The Konstanz Information Miner) interactive graphical workbench, for the interpretation of large biological datasets. We demonstrate that this tool can be utilised to quickly retrieve previously published scientific findings.
Availability: http://code.google.com/p/knime4bio/.





That's it,
Pierre

05 October 2011

Grouping mutations/Gene=f(sample)

GroupByGene is a small C++ tool grouping the data:

  • CHROM
  • POS
  • REF
  • GENE
  • SAMPLE
by gene=f(sample). This tool is available on google code : http://code.google.com/p/variationtoolkit/source/browse/trunk/src/groupbygene.cpp
Example:
$ cat input.tsv

#CHROM	POS	REF	ALT	GENE	SAMPLE	
chr1	10	A	T	gene1	indi1
chr1	10	A	T	gene1	indi2
chr1	11	C	G	gene1	indi2
chr2	110	C	G	gene2	indi3
chr3	210	A	T	gene3	indi1
chr3	211	C	T	gene3	indi2
chr3	211	C	T	gene3	indi3
chr3	215	C	G	gene3	indi3
chr3	216	C	T	gene3	indi3
chr4	390	C	T	gene4	indi1
chr4	390	C	A	gene4	indi3

Calling "groupbygene:


$ groupbygene  --chrom 1 --pos 2 --ref 3 --alt 4 --sample 6 --gene 5 < input.tsv

GENECHROMSTARTENDcount
SAMPLES
distinct
MUTATIONS
count(indi1)count(indi2)count(indi3)
gene1chr1101122120
gene2chr211011011001
gene3chr321021634113
gene4chr439039022101


$ groupbygene  --chrom 1 --pos 2 --ref 3 --alt 4 --sample 6 --gene 5 --norefalt < input.tsv

GENECHROMSTARTENDcount
SAMPLES
distinct
MUTATIONS
count(indi1)count(indi2)count(indi3)
gene1chr1101122120
gene2chr211011011001
gene3chr321021634113
gene4chr439039021101


That's it,

Pierre

Verticalize: printing the input stream vertically.

A useful tool: verticalize is a small C++ tool printing the input stream vertically. The source is available on github : https://github.com/lindenb/ccsandbox/blob/master/src/verticalize.cpp.
An Example with 1000genomes.org :

$ curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz"|\
gunzip  -c | grep -v "##" |\
verticalize  | head -n 30

>>>	2
$1	#CHROM	1
$2	POS   	10327
$3	ID    	rs112750067
$4	REF   	T
$5	ALT   	C
$6	QUAL  	.
$7	FILTER	PASS
$8	INFO  	DP=65;AF=0.208;CB=BC,NCBI
<<<	2

>>>	3
$1	#CHROM	1
$2	POS   	10469
$3	ID    	rs117577454
$4	REF   	C
$5	ALT   	G
$6	QUAL  	.
$7	FILTER	PASS
$8	INFO  	DP=2055;AF=0.020;CB=UM,BC,NCBI
<<<	3

(...)
That's it, Pierre

26 September 2011

PostScript as a Programming Language for Bioinformatics: mynotebook

"PostScript (PS) is an interpreted, stack-based programming language. It is best known for its use as a page description language in the electronic and desktop publishing areas."[wikipedia]. In this post, I'll show how I've used to create a simple and lightweight view of the genome.

Introduction: just a simple postscript program

The following PS program fills a rectangular gray shape; You can display the result using ghostview, a2ps, etc...
%!PS
newpath
50 50 moveto
0 100 rlineto
100 0 rlineto
0 -100 rlineto
closepath
0.5 setgray
fill
showpage

Some global variables

The page width

/screenWidth 1000 def

The page width

/screenHeight 1000 def

The minimum 5' position

/minChromStart 1E9 def

The maximum 3' position

/maxChromEnd -1 def

The size of a genomic feature

/featureHeight 20 def

The distance between two 'ticks' for drawing the orientation

/ticksx 20 def

The font size

/theFontSize 9 def
The variable knownGene is a PS array of genes.

/knownGene [
[(uc002zkr.3) (chr22) (-) 161242...
...]
] def

Each Gene is a PS array holding the structure of the UCSC knownGene table, that is to say: name , chromosome, txStart, txEnd, cdsStart, cdsEnd, exonStarts, exonEnds:

[(uc002zmh.2) (chr22) (-) 17618410 17646177 17618910 17646134
   [17618410 17619439 17621948 17623987 17625913 17629337 17630431 17646098 ]
   [17619247 17619628 17622123 17624021 17626007 17629450 17630635 17646177 ]
]
. a simple command line can be used to fetch those data:
%  curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" |\
gunzip -c | grep chr22 | head -n 20 |\
awk '{printf("[(%s) (%s) (%s) %s %s %s %s [%s] [%s] ]\n",$1,$2,$3,$4,$5,$6,$7,$9,$10);}' |\
tr "," " " > result.txt 

Some utilities

converting a PS object to string

/toString
{
20 string cvs 
} bind def

Converting a string to interger (loop over each character an increase the current value)

/toInteger
{
3 dict begin
/s exch def
/i 0 def
/n 0 def
s {
  n 10 mul /n exch def
  s i get 48 sub n add /n exch def %48=ascii('0')
  i 1 add /i exch def
  } forall
n % leave n on the stack
end
} bind def

Convert a genomic position to a index on the page 'x' axis

/convertPos2pixel
{
minChromStart sub maxChromEnd minChromStart sub div screenWidth mul
} bind def

Extract the chromosome (that is to say, extract the 1st element of the current array on the stack)

/getChrom
{
1 get
} bind def

Create a hyperlink to the UCSC genome browser

/getHyperLink
{
3 dict begin
/E exch def %% END
/S exch def %% START
/C exch def %% CHROMOSOME
[ (http://genome.ucsc.edu/cgi-bin/hgTracks?position=) C (:) S toString (-) E toString (&) (&db=hg19) ] concatstringarray
end
} bind def

Paint a rectangle

/box
{
4 dict begin
/height exch def
/width exch def
/y exch def
/x exch def
x y moveto
width 0 rlineto
0 height rlineto
width -1 mul 0 rlineto
0 height -1 mul rlineto
end
} bind def

Paint a gray gradient

/gradient
{
4 dict begin
/height exch def
/width exch def
/y exch def
/x exch def
/i 0 def
height 2 div /i exch def

0 1 height 2 div {
	1 i height 2.0 div div sub setgray
	newpath
	x  
	y height 2 div i sub  add
	width
	i 2 mul
	box
	closepath
	fill
	i 1 sub /i exch def
	}for
newpath
0 setgray
0.4 setlinewidth
x y width height box
closepath
stroke
end
} bind def

Methods extracting a data about the current gene on the PS stack.

Extract the transcription start:

/getTxStart
{
3 get
} bind def

Extract the transcription end:

/getTxEnd
{
4 get
} bind def

Extract the CDS start:

/getCdsStart
{
5 get
} bind def

Extract the transcription end:

/getCdsEnd
{
6 get
} bind def

Extract the strand:

/getStrand
{
2 get (+) eq {1} {-1} ifelse
} bind def
Get the gene name

/getKgName
{
0 get
} bind def

Get the number of exons:

/getExonCount
{
7 get length
} bind def

Get the start position of the i-th exon:

/getExonStart
{
2 dict begin
/i exch def
/gene exch def
gene 7 get i get
end
} bind def

Get the end position of the i-th exon:

/getExonEnd
{
2 dict begin
/i exch def
/gene exch def
gene 8 get i get
end
} bind def

Should we draw this gene on the page ?

/isVisible
{
1 dict begin
/gene exch def
minChromStart gene getTxEnd gt 
	{
	false
	}
	{
	gene getTxStart maxChromEnd gt
		{
		false
		}
		{
		true
		}ifelse
	}ifelse
end
}bind def

Methods for an array of genes

Loop over the genes and extract the lowest 5' index:

/getMinChromStart
{
3 dict begin
/genes exch def
/pos 10E9 def
/i 0 def
genes length {
	genes i get getTxStart pos min /pos  exch def
	i 1 add /i exch def
	}repeat
pos
end
} bind def

Loop over the genes and extract the highest 3' index:

/getMaxChromEnd
{
3 dict begin
/genes exch def
/pos -1E9 def
/i 0 def
genes length {
	genes i get getTxEnd pos max /pos  exch def
	i 1 add /i exch def
	}repeat
pos
end
} bind def

Painting ONE Gene

/paintGene
{
5 dict begin
/gene exch def %% the GENE argument
/midy featureHeight 2.0 div def %the middle of the row
/x0 gene getTxStart convertPos2pixel def % 5' side of the gene in pixel
/x1 gene getTxEnd convertPos2pixel def % 3' side of the gene in pixel
/i 0 def
0.1 setlinewidth

1 0 0 setrgbcolor

newpath
x0 midy moveto
x1 midy lineto
closepath
stroke


% paint ticks
0 1 x1 x0 sub ticksx div{
	newpath
	gene getStrand 1 eq 
		{
		x0 ticksHeight sub i add midy ticksHeight add moveto
		x0 i add midy lineto
		x0 ticksHeight sub i add midy ticksHeight sub lineto
		}
	%else
		{
		x0 ticksHeight add i add midy ticksHeight add moveto
		x0 i add midy lineto
		x0 ticksHeight add i add midy ticksHeight sub lineto
		} ifelse
	stroke
	i ticksx add /i exch def
	} for

%paint Transcript start-end
0 0 1 setrgbcolor
newpath
gene getCdsStart convertPos2pixel
midy cdsHeight 2 div sub
gene getCdsEnd convertPos2pixel gene getCdsStart convertPos2pixel sub 
cdsHeight box
closepath
fill

% loop over exons
0 /i exch def
gene getExonCount
	{
	gene i getExonStart convertPos2pixel
	midy exonHeight 2 div sub
	gene i getExonEnd convertPos2pixel gene i getExonStart convertPos2pixel sub
	exonHeight gradient
	i 1 add /i exch def
	} repeat
0 setgray
gene getTxEnd convertPos2pixel 10 add midy moveto
gene getKgName show

%URL 
[ /Rect [x0 0 x1 1 add featureHeight]
/Border [1 0 0]
/Color [1 0 0]
/Action << /Subtype /URI /URI gene getChrom gene getTxStart gene getTxEnd getHyperLink  >>
/Subtype /Link
/ANN pdfmark

end
} bind def

Paint all Genes

/paintGenes
{
3 dict begin
/genes exch def %the GENE argument (an array)
/i 0 def % loop iterator
/j 0 def % row iterator


% draw 10 vertical lines
i 0 /i exch def
0 setgray
0 1 10 {
	%draw a vertical line
	screenWidth 10 div i mul 0 moveto
	screenWidth 10 div i mul screenHeight lineto
	stroke
	% print the position at the top rotate by 90°
	screenWidth 10 div i mul 10 add screenHeight 5 sub moveto
	-90 rotate
	maxChromEnd minChromStart sub i 10 div mul minChromStart add toString show
	90 rotate
	i 1 add /i exch def
	} for

0 /i exch def
genes length {
	genes i get isVisible
		{
		gsave
		0 j  featureHeight 2 add mul translate
		genes i get paintGene
		j 1 add /j exch def
		grestore
		} if
	i 1 add /i exch def
	}repeat
end
} bind def

All in one: the postscript code

%!PS
%%Creator: Pierre Lindenbaum PhD plindenbaum@yahoo.fr http://plindenbaum.blogspot.com
%%Title: Genome Browser
%%CreationDate: 2011-09-25
%%BoundingBox: 0 0 1000 1000
%%Pages! 1
%
% curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" | gunzip -c | head -n 20 | awk '{printf("[(%s) (%s) (%s) %s %s %s %s [%s] [%s] ]\n",$1,$2,$3,$4,$5,$6,$7,$9,$10);}' | tr "," " "
%
/pdfmark where {pop} {userdict /pdfmark /cleartomark load put} ifelse
/screenWidth 1000 def
/screenHeight 1000 def
/minChromStart 1E9 def
/maxChromEnd -1 def
/featureHeight 20 def
/ticksx 20 def
/theFontSize 9 def
/Courier findfont theFontSize scalefont setfont
%% http://en.wikibooks.org/wiki/PostScript_FAQ#How_to_concatenate_strings.3F
/concatstringarray % [(a) (b) ... (z)] --> (ab...z)
{ 0 1 index { length add } forall string
0 3 2 roll
{ 3 copy putinterval
length add
}
forall pop
} bind def
/toString
{
20 string cvs
} bind def
/toInteger
{
3 dict begin
/s exch def
/i 0 def
/n 0 def
s {
n 10 mul /n exch def
s i get 48 sub n add /n exch def %48=ascii('0')
i 1 add /i exch def
} forall
n
end
} bind def
/convertPos2pixel
{
minChromStart sub maxChromEnd minChromStart sub div screenWidth mul
} bind def
/getChrom
{
1 get
} bind def
/getTxStart
{
3 get
} bind def
/getTxEnd
{
4 get
} bind def
/getCdsStart
{
5 get
} bind def
/getCdsEnd
{
6 get
} bind def
/getStrand
{
2 get (+) eq {1} {-1} ifelse
} bind def
/getKgName
{
0 get
} bind def
/getExonCount
{
7 get length
} bind def
/getExonStart
{
2 dict begin
/i exch def
/gene exch def
gene 7 get i get
end
} bind def
/getExonEnd
{
2 dict begin
/i exch def
/gene exch def
gene 8 get i get
end
} bind def
/isVisible
{
1 dict begin
/gene exch def
minChromStart gene getTxEnd gt
{
false
}
{
gene getTxStart maxChromEnd gt
{
false
}
{
true
}ifelse
}ifelse
end
}bind def
/getHyperLink
{
3 dict begin
/E exch def
/S exch def
/C exch def
[ (http://genome.ucsc.edu/cgi-bin/hgTracks?position=) C (:) S toString (-) E toString (&) (&db=hg19) ] concatstringarray
end
} bind def
/getMinChromStart
{
3 dict begin
/genes exch def
/pos 10E9 def
/i 0 def
genes length {
genes i get getTxStart pos min /pos exch def
i 1 add /i exch def
}repeat
pos
end
} bind def
/getMaxChromEnd
{
3 dict begin
/genes exch def
/pos -1E9 def
/i 0 def
genes length {
genes i get getTxEnd pos max /pos exch def
i 1 add /i exch def
}repeat
pos
end
} bind def
/box
{
4 dict begin
/height exch def
/width exch def
/y exch def
/x exch def
x y moveto
width 0 rlineto
0 height rlineto
width -1 mul 0 rlineto
0 height -1 mul rlineto
end
} bind def
/gradient
{
4 dict begin
/height exch def
/width exch def
/y exch def
/x exch def
/i 0 def
height 2 div /i exch def
0 1 height 2 div {
1 i height 2.0 div div sub setgray
newpath
x
y height 2 div i sub add
width
i 2 mul
box
closepath
fill
i 1 sub /i exch def
}for
newpath
0 setgray
0.4 setlinewidth
x y width height box
closepath
stroke
end
} bind def
/ticksHeight
{
featureHeight 0.2 mul
} bind def
/cdsHeight
{
featureHeight 0.5 mul
} bind def
/exonHeight
{
featureHeight 0.8 mul
} bind def
%*********************************
%
% paint one gene
% @arg one knownGene array
% @return void
%
/paintGene
{
5 dict begin
/gene exch def %% the GENE argument
/midy featureHeight 2.0 div def %the middle of the row
/x0 gene getTxStart convertPos2pixel def % 5' side of the gene in pixel
/x1 gene getTxEnd convertPos2pixel def % 3' side of the gene in pixel
/i 0 def
0.1 setlinewidth
1 0 0 setrgbcolor
newpath
x0 midy moveto
x1 midy lineto
closepath
stroke
% paint ticks
0 1 x1 x0 sub ticksx div{
newpath
gene getStrand 1 eq
{
x0 ticksHeight sub i add midy ticksHeight add moveto
x0 i add midy lineto
x0 ticksHeight sub i add midy ticksHeight sub lineto
}
%else
{
x0 ticksHeight add i add midy ticksHeight add moveto
x0 i add midy lineto
x0 ticksHeight add i add midy ticksHeight sub lineto
} ifelse
stroke
i ticksx add /i exch def
} for
%paint Transcript start-end
0 0 1 setrgbcolor
newpath
gene getCdsStart convertPos2pixel
midy cdsHeight 2 div sub
gene getCdsEnd convertPos2pixel gene getCdsStart convertPos2pixel sub
cdsHeight box
closepath
fill
% loop over exons
0 /i exch def
gene getExonCount
{
gene i getExonStart convertPos2pixel
midy exonHeight 2 div sub
gene i getExonEnd convertPos2pixel gene i getExonStart convertPos2pixel sub
exonHeight gradient
i 1 add /i exch def
} repeat
0 setgray
gene getTxEnd convertPos2pixel 10 add midy moveto
gene getKgName show
%URL
[ /Rect [x0 0 x1 1 add featureHeight]
/Border [1 0 0]
/Color [1 0 0]
/Action << /Subtype /URI /URI gene getChrom gene getTxStart gene getTxEnd getHyperLink >>
/Subtype /Link
/ANN pdfmark
end
} bind def
%********************************************
%
% count the number of visible genes
%
%
/countVisibleGenes
{
3 dict begin
/genes exch def %the GENE argument (an array)
/i 0 def % loop iterator
/n 0 def % the count
genes length {
genes i get isVisible
{
n 1 add /n exch def
} if
i 1 add /i exch def
}repeat
n
end
} bind def
%********************************************
%
% draw an array of genes
%
% @arg an array of knownGene
% @return void
%
/paintGenes
{
3 dict begin
/genes exch def %the GENE argument (an array)
/i 0 def % loop iterator
/j 0 def % row iterator
% draw 10 vertical lines
i 0 /i exch def
0 setgray
0 1 10 {
%draw a vertical line
screenWidth 10 div i mul 0 moveto
screenWidth 10 div i mul screenHeight lineto
stroke
% print the position at the top rotate by 90°
screenWidth 10 div i mul 10 add screenHeight 5 sub moveto
-90 rotate
maxChromEnd minChromStart sub i 10 div mul minChromStart add toString show
90 rotate
i 1 add /i exch def
} for
0 /i exch def
genes length {
genes i get isVisible
{
gsave
0 j featureHeight 2 add mul translate
genes i get paintGene
j 1 add /j exch def
grestore
} if
i 1 add /i exch def
}repeat
end
} bind def
/knownGene [
[(uc002zkr.3) (chr22) (-) 16124263 16193004 16124263 16124263 [16124263 16162396 16186810 16187164 16192905 ] [16124973 16162487 16186946 16187302 16193004 ] ]
[(uc002zks.3) (chr22) (-) 16150261 16193004 16150261 16150261 [16150261 16162396 16186810 16187164 16189031 16189263 16190680 16192905 ] [16151821 16162487 16186946 16187302 16189143 16189411 16190791 16193004 ] ]
[(uc002zkt.2) (chr22) (+) 16162065 16172264 16162065 16162065 [16162065 16164481 16171951 ] [16162388 16164569 16172264 ] ]
[(uc002zku.2) (chr22) (-) 16179619 16181004 16179619 16179619 [16179619 ] [16181004 ] ]
[(uc002zkv.3) (chr22) (-) 16187164 16193004 16187164 16187164 [16187164 16189031 16189263 16190680 16192905 ] [16187302 16189143 16189378 16190791 16193004 ] ]
[(uc002zkw.2) (chr22) (-) 16240244 16240281 16240244 16240244 [16240244 ] [16240281 ] ]
[(uc002zkx.1) (chr22) (-) 16240300 16240339 16240300 16240300 [16240300 ] [16240339 ] ]
[(uc002zky.1) (chr22) (+) 16241086 16241125 16241086 16241086 [16241086 ] [16241125 ] ]
[(uc002zkz.1) (chr22) (-) 16242000 16242030 16242000 16242000 [16242000 ] [16242030 ] ]
[(uc002zla.1) (chr22) (-) 16243380 16243414 16243380 16243380 [16243380 ] [16243414 ] ]
[(uc002zlb.2) (chr22) (-) 16243908 16243948 16243908 16243908 [16243908 ] [16243948 ] ]
[(uc002zlc.1) (chr22) (-) 16245151 16245185 16245151 16245151 [16245151 ] [16245185 ] ]
[(uc002zld.2) (chr22) (-) 16245679 16245719 16245679 16245679 [16245679 ] [16245719 ] ]
[(uc002zle.1) (chr22) (-) 16248998 16249023 16248998 16248998 [16248998 ] [16249023 ] ]
[(uc002zlf.1) (chr22) (-) 16251234 16254941 16251234 16251234 [16251234 ] [16254941 ] ]
[(uc002zlg.1) (chr22) (-) 16256331 16287425 16256331 16256331 [16256331 16258184 16262903 16266928 16268136 16269872 16275206 16277747 16279194 16282144 16282477 16287253 ] [16256677 16258303 16262952 16267095 16268181 16269943 16275277 16277885 16279301 16282318 16282592 16287425 ] ]
[(uc002zlh.1) (chr22) (-) 16256331 16287425 16258185 16280411 [16256331 16258184 16266928 16268136 16269872 16275206 16277747 16279194 16280333 16282144 16282477 16287253 ] [16256677 16258303 16267095 16268181 16269943 16275277 16277885 16279301 16280589 16282318 16282592 16287425 ] ]
[(uc010gqp.2) (chr22) (-) 16256331 16287937 16258185 16287885 [16256331 16258184 16266928 16268136 16269872 16275206 16277747 16279194 16282144 16282477 16287253 ] [16256677 16258303 16267095 16268181 16269943 16275277 16277885 16279301 16282318 16282592 16287937 ] ]
[(uc002zlj.1) (chr22) (-) 16266928 16287937 16266930 16287390 [16266928 16268136 16269872 16275206 16277747 16279194 16282144 16282477 16287253 16287537 ] [16267095 16268181 16269943 16275277 16277885 16279301 16282318 16282592 16287425 16287937 ] ]
[(uc002zlk.2) (chr22) (+) 16274557 16278598 16274557 16274557 [16274557 16276480 ] [16275003 16278598 ] ]
[(uc002zll.1) (chr22) (+) 16373080 16377057 16373080 16373080 [16373080 16373829 16375448 ] [16373121 16373911 16377057 ] ]
[(uc011agd.1) (chr22) (-) 16448825 16449804 16448826 16449804 [16448825 ] [16449804 ] ]
[(uc002zln.1) (chr22) (+) 16492811 16492932 16492811 16492811 [16492811 ] [16492932 ] ]
[(uc011age.1) (chr22) (+) 17029052 17029078 17029052 17029052 [17029052 ] [17029078 ] ]
[(uc002zlo.1) (chr22) (+) 17029615 17029643 17029615 17029615 [17029615 ] [17029643 ] ]
[(uc002zlp.1) (chr22) (-) 17071647 17073700 17071766 17073440 [17071647 ] [17073700 ] ]
[(uc010gqq.2) (chr22) (+) 17082800 17095998 17082800 17082800 [17082800 17092547 17094966 17095588 ] [17083105 17092783 17095068 17095998 ] ]
[(uc002zlq.3) (chr22) (+) 17082800 17095998 17082800 17082800 [17082800 17092547 17094966 ] [17083105 17092783 17095998 ] ]
[(uc002zlr.2) (chr22) (+) 17082800 17129719 17082800 17082800 [17082800 17092547 17094966 17103730 17117929 17119468 17128057 17128552 17129416 ] [17083105 17092783 17095068 17103787 17117980 17119630 17128147 17128675 17129719 ] ]
[(uc002zls.1) (chr22) (+) 17082800 17179521 17082800 17082800 [17082800 17119468 17178385 ] [17083105 17119630 17179521 ] ]
[(uc002zlt.2) (chr22) (+) 17100506 17134580 17100506 17100506 [17100506 17100941 17103730 17117929 17119468 17128494 17129416 17134399 ] [17100610 17101050 17103787 17117980 17119630 17128675 17129538 17134580 ] ]
[(uc002zlu.2) (chr22) (-) 17227759 17229328 17227759 17227759 [17227759 17229165 ] [17228629 17229328 ] ]
[(uc002zlv.2) (chr22) (-) 17264312 17302584 17264508 17288963 [17264312 17280660 17288628 17302496 ] [17265299 17280914 17288973 17302584 ] ]
[(uc011agf.1) (chr22) (-) 17264312 17302584 17264508 17288963 [17264312 17280660 17288628 17302496 ] [17265299 17280914 17288976 17302584 ] ]
[(uc010gqr.1) (chr22) (+) 17308363 17310225 17308363 17308363 [17308363 17309431 ] [17308950 17310225 ] ]
[(uc011agg.1) (chr22) (-) 17385314 17385395 17385314 17385314 [17385314 ] [17385395 ] ]
[(uc002zlw.2) (chr22) (-) 17442828 17489112 17443622 17489004 [17442828 17444614 17445655 17446067 17446989 17449187 17450832 17468849 17472762 17488830 ] [17443766 17444719 17445752 17446158 17447254 17449273 17451083 17469057 17473066 17489112 ] ]
[(uc010gqs.1) (chr22) (-) 17446991 17489112 17449218 17489004 [17446991 17449187 17468849 17472762 17488830 ] [17448371 17449273 17469006 17473066 17489112 ] ]
[(uc002zlx.1) (chr22) (+) 17517459 17539682 17517459 17517459 [17517459 17525762 17528213 17537962 ] [17518234 17525890 17528316 17539682 ] ]
[(uc002zly.2) (chr22) (+) 17565848 17591387 17565981 17590710 [17565848 17577951 17578686 17579664 17581244 17582885 17583028 17584383 17585615 17586480 17586742 17588616 17589196 ] [17566119 17577976 17578833 17579777 17581371 17582933 17583192 17584467 17585700 17586492 17586844 17588658 17591387 ] ]
[(uc010gqt.2) (chr22) (+) 17565848 17591387 17565981 17590710 [17565848 17577951 17578686 17579664 17581244 17582885 17583028 17584383 17585615 17589196 ] [17566119 17577976 17578833 17579777 17581371 17582933 17583192 17584467 17585700 17591387 ] ]
[(uc002zlz.1) (chr22) (+) 17593917 17596582 17593917 17593917 [17593917 ] [17596582 ] ]
[(uc002zmb.2) (chr22) (-) 17597188 17602213 17600280 17602017 [17597188 ] [17602213 ] ]
[(uc002zma.2) (chr22) (-) 17597188 17602257 17600280 17600952 [17597188 17602141 ] [17601033 17602257 ] ]
[(uc002zmc.2) (chr22) (+) 17602484 17612993 17602484 17602484 [17602484 17605544 17611251 17612504 ] [17602929 17605661 17611374 17612993 ] ]
[(uc002zmd.2) (chr22) (-) 17618410 17622467 17618910 17622286 [17618410 17619439 17621948 17622282 ] [17619247 17619628 17622123 17622467 ] ]
[(uc002zme.2) (chr22) (-) 17618410 17629450 17618910 17622070 [17618410 17619439 17621948 17625913 17629337 ] [17619247 17619628 17622123 17626007 17629450 ] ]
[(uc002zmf.2) (chr22) (-) 17618410 17640169 17618910 17640141 [17618410 17619439 17621948 17623987 17625913 17629337 17630431 17640015 ] [17619247 17619628 17622123 17624021 17626007 17629450 17630635 17640169 ] ]
[(uc002zmg.2) (chr22) (-) 17618410 17640169 17618910 17640141 [17618410 17621948 17623987 17640015 ] [17619247 17622123 17624021 17640169 ] ]
[(uc002zmh.2) (chr22) (-) 17618410 17646177 17618910 17646134 [17618410 17619439 17621948 17623987 17625913 17629337 17630431 17646098 ] [17619247 17619628 17622123 17624021 17626007 17629450 17630635 17646177 ] ]
] def
knownGene getMinChromStart 10 sub /minChromStart exch def
knownGene getMaxChromEnd 10 add /maxChromEnd exch def
systemdict /userChromStart known {
userChromStart toInteger /minChromStart exch def
} if
systemdict /userChromEnd known {
userChromEnd toInteger /maxChromEnd exch def
} if
screenHeight knownGene countVisibleGenes 1 add div 20.0 min /featureHeight exch def
0 0 screenWidth 1 sub screenHeight 1 sub box
0.95 setgray
fill
knownGene paintGenes
0 0 screenWidth screenHeight box
0 setgray
stroke
showpage

Open the PS file in ghostview, evince, ...

Zooming ? Yes we can.

Ghostview has an option -Sname=string
       -Sname=string
       -sname=string
              Define  a  name  in  "systemdict"  with a given string as value.
              This is different from -d.

In my postscript file, the default values for minChromStart and maxChromEnd are overridden by the user's parameters:

systemdict /userChromStart known {
	userChromStart toInteger /minChromStart  exch def
	} if

systemdict /userChromEnd known {
	userChromEnd toInteger /maxChromEnd  exch def
	} if
That's it,

Pierre

23 September 2011

Joining genomic annotations files with the tabix API.

Tabix is a software that is part of the samtools package. After indexing a file, tabix is able to quickly retrieve data lines overlapping genomic regions (see also my previous post about tabix). Here, I wrote a tool named jointabix that joins the data of a (chrom/start/end) file with a file indexed with tabix. I've posted the code on github at: https://github.com/lindenb/samtools-utilities/blob/master/src/jointabix.c.

Usage


$ jointabix  -h

Usage: jointabix (options) {stdin|file|gzfiles}:

  -d   column delimiter. default: TAB
  -c   chromosome column (1).
  -s   start column (2).
  -e   end column (2).
  -i   ignore lines starting with ('#').
  -t   tabix file (required).
  +1 add 1 to the genomic coodinates.
  -1 remove 1 to the genomic coodinates.
 

Example:

In the following example, I'm going to join the SNPs from the 1000 genome project with the "cytoband" database of the UCSC.

##download and index UCSC-cytobands:
$ wget -O cytoBand.txt.gz "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz"
$ gunzip cytoBand.txt.gz
$ bgzip cytoBand.txt


$ curl -s  "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz" |\
   gunzip -c |\
   sed 's/^\([^#]\)/chr\1/' |\
   cut -d '   ' -f 1-5 |\
   jointabix -c 1 -s 2 -e 2 -1 -f cytoBand.txt.gz |\
   grep -v "##"
 
#CHROM	POS	ID	REF	ALT
chr1	10327	rs112750067	T	C	chr1	0	2300000	p36.33	gneg
chr1	10469	rs117577454	C	G	chr1	0	2300000	p36.33	gneg
chr1	10492	rs55998931	C	T	chr1	0	2300000	p36.33	gneg
chr1	10583	rs58108140	G	A	chr1	0	2300000	p36.33	gneg
chr1	11508	.	A	G	chr1	0	2300000	p36.33	gneg
chr1	11565	.	G	T	chr1	0	2300000	p36.33	gneg
chr1	12783	.	G	A	chr1	0	2300000	p36.33	gneg
chr1	13116	.	T	G	chr1	0	2300000	p36.33	gneg
chr1	13327	.	G	C	chr1	0	2300000	p36.33	gneg
chr1	13980	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	14699	.	C	G	chr1	0	2300000	p36.33	gneg
chr1	14930	.	A	G	chr1	0	2300000	p36.33	gneg
chr1	14933	.	G	A	chr1	0	2300000	p36.33	gneg
chr1	14948	.	G	A	chr1	0	2300000	p36.33	gneg
chr1	15118	.	A	G	chr1	0	2300000	p36.33	gneg
chr1	15211	.	T	G	chr1	0	2300000	p36.33	gneg
chr1	15274	.	A	T	chr1	0	2300000	p36.33	gneg
chr1	15820	.	G	T	chr1	0	2300000	p36.33	gneg
chr1	16206	.	T	A	chr1	0	2300000	p36.33	gneg
chr1	16257	.	G	C	chr1	0	2300000	p36.33	gneg
chr1	16280	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	16298	.	C	T	chr1	0	2300000	p36.33	gneg
chr1	16378	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	16495	.	G	C	chr1	0	2300000	p36.33	gneg
chr1	16534	.	C	T	chr1	0	2300000	p36.33	gneg
chr1	16841	.	G	T	chr1	0	2300000	p36.33	gneg
chr1	28376	.	G	A	chr1	0	2300000	p36.33	gneg
chr1	28563	.	A	G	chr1	0	2300000	p36.33	gneg
chr1	30860	.	G	C	chr1	0	2300000	p36.33	gneg
chr1	30885	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	30923	.	G	T	chr1	0	2300000	p36.33	gneg
chr1	31295	.	A	C	chr1	0	2300000	p36.33	gneg
chr1	31467	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	31487	.	G	A	chr1	0	2300000	p36.33	gneg
chr1	40261	.	C	A	chr1	0	2300000	p36.33	gneg
chr1	46633	.	T	A	chr1	0	2300000	p36.33	gneg
chr1	48183	.	C	A	chr1	0	2300000	p36.33	gneg
chr1	48186	.	T	G	chr1	0	2300000	p36.33	gneg
chr1	49272	.	G	A	chr1	0	2300000	p36.33	gneg
chr1	49298	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	49554	.	A	G	chr1	0	2300000	p36.33	gneg
chr1	51479	rs116400033	T	A	chr1	0	2300000	p36.33	gneg
chr1	51673	.	T	C	chr1	0	2300000	p36.33	gneg
chr1	51803	rs62637812	T	C	chr1	0	2300000	p36.33	gneg
chr1	51898	rs76402894	C	A	chr1	0	2300000	p36.33	gneg
chr1	52058	rs62637813	G	C	chr1	0	2300000	p36.33	gneg
chr1	52238	.	T	G	chr1	0	2300000	p36.33	gneg
chr1	52727	.	C	G	chr1	0	2300000	p36.33	gneg
chr1	54353	.	C	A	chr1	0	2300000	p36.33	gneg
(...)

That's it,

Pierre

11 September 2011

The Wikipedia Template:Infobox_biodatabase is now integrated in DBPedia

In January 2011, I started the project Template:Infobox_biodatabase. The goal of this project is the annotation of the biological databases in wikipedia using an infobox. The pages annotated with this template have now been integrated into DBpedia 3.7 and it is now possible to query the data through a SPARQL endpoint.
(Note: during the process of writing the new pages in wikipedia, a few articles have been proposed for deletion for notability reasons: I din't fight against the choise of the WP editors).

Articles in category: "Biological database"

SPARQL

List the biological databases.
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbpedia: <http://dbpedia.org/ontology/>

SELECT   ?title ?uri WHERE {
  ?uri a dbpedia:BiologicalDatabase .
  OPTIONAL {
  	?uri dbpedia:title ?title.
  	}
} ORDER By ?uri

Result:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| title                                                                      | uri                                                                                                                     |
========================================================================================================================================================================================================
| "3did"@en                                                                  | <http://dbpedia.org/resource/3did>                                                                                      |
| "ABCdb"@en                                                                 | <http://dbpedia.org/resource/ABCdb>                                                                                     |
| "AREsite"@en                                                               | <http://dbpedia.org/resource/AREsite>                                                                                   |
| "AlloSteric Database"@en                                                   | <http://dbpedia.org/resource/ASD_%28database%29>                                                                        |
| "AgBase"@en                                                                | <http://dbpedia.org/resource/AgBase>                                                                                    |
| "Allele frequency net"@en                                                  | <http://dbpedia.org/resource/Allele_frequency_net_database>                                                             |
| "ASTD"@en                                                                  | <http://dbpedia.org/resource/Alternative_splicing_and_transcript_diversity_database>                                    |
| "ASAP"@en                                                                  | <http://dbpedia.org/resource/Alternative_splicing_annotation_project>                                                   |
| "AmoebaDB"@en                                                              | <http://dbpedia.org/resource/AmoebaDB>                                                                                  |
| "ArachnoServer"@en                                                         | <http://dbpedia.org/resource/ArachnoServer>                                                                             |
| "ArtadeDB"@en                                                              | <http://dbpedia.org/resource/Artade>                                                                                    |
| "ASPicDB"@en                                                               | <http://dbpedia.org/resource/AspicDB>                                                                                   |
| "The Autophagy Database"@en                                                | <http://dbpedia.org/resource/Autophagy_database>                                                                        |
| "BGMUT"@en                                                                 | <http://dbpedia.org/resource/BGMUT>                                                                                     |
| "BISC"@en                                                                  | <http://dbpedia.org/resource/BISC_%28database%29>                                                                       |
| "BRENDA"@en                                                                | <http://dbpedia.org/resource/BRENDA>                                                                                    |
| "The BRENDA Tissue Ontology (BTO)"@en                                      | <http://dbpedia.org/resource/BRENDA_tissue_ontology>                                                                    |
|                                                                            | <http://dbpedia.org/resource/BindingDB>                                                                                 |
| "Bio2RDF"@en                                                               | <http://dbpedia.org/resource/Bio2RDF>                                                                                   |
| "BioGRID"@en                                                               | <http://dbpedia.org/resource/BioGRID>                                                                                   |
| "BioModels Database"@en                                                    | <http://dbpedia.org/resource/BioModels_Database>                                                                        |
| "BSDB"@en                                                                  | <http://dbpedia.org/resource/Biomolecule_stretching_database>                                                           |
| "Bovine Genome Database"@en                                                | <http://dbpedia.org/resource/Bovine_genome_database>                                                                    |
| "BriX"@en                                                                  | <http://dbpedia.org/resource/Brix_%28database%29>                                                                       |
| "CADgene"@en                                                               | <http://dbpedia.org/resource/CADgene>                                                                                   |
| "CATH"@en                                                                  | <http://dbpedia.org/resource/CATH>                                                                                      |
| "CLIPZ:"@en                                                                | <http://dbpedia.org/resource/CLIPZ>                                                                                     |
| "COSMIC"@en                                                                | <http://dbpedia.org/resource/COSMIC_cancer_database>                                                                    |
| "CaSNP"@en                                                                 | <http://dbpedia.org/resource/CaSNP>                                                                                     |
| "CancerResource:"@en                                                       | <http://dbpedia.org/resource/CancerResource>                                                                            |
| "cBARBEL"@en                                                               | <http://dbpedia.org/resource/Catfish_genome_database>                                                                   |
| "CCDB"@en                                                                  | <http://dbpedia.org/resource/Cervical_cancer_gene_database>                                                             |
|                                                                            | <http://dbpedia.org/resource/ChEBI>                                                                                     |
|                                                                            | <http://dbpedia.org/resource/ChEMBL>                                                                                    |
| "ChemProt"@en                                                              | <http://dbpedia.org/resource/ChemProt>                                                                                  |
| "ChimerDB"@en                                                              | <http://dbpedia.org/resource/ChimerDB>                                                                                  |
| "MDS_IES_DB"@en                                                            | <http://dbpedia.org/resource/Ciliate_MDS/IES_database>                                                                  |
| "Ciona intestinalis protein database"@en                                   | <http://dbpedia.org/resource/Ciona_intestinalis_protein_database>                                                       |
| "ACLAME"@en                                                                | <http://dbpedia.org/resource/Classification_of_mobile_genetic_elements>                                                 |
| "COMBREX: COMputational BRidges to EXperiments"@en                         | <http://dbpedia.org/resource/Combrex>                                                                                   |
| "CAMERA"@en                                                                | <http://dbpedia.org/resource/Community_Cyberinfrastructure_for_Advanced_Marine_Microbial_Ecology_Research_and_Analysis> |
| "CORG"@en                                                                  | <http://dbpedia.org/resource/Comparative_regulatory_genomics_database>                                                  |
| "CPLA"@en                                                                  | <http://dbpedia.org/resource/Compendium_of_protein_lysine_acetylation>                                                  |
| "Conformational dynamics data bank"@en                                     | <http://dbpedia.org/resource/Conformational_dynamics_data_bank>                                                         |
| "ConsensusPathDB"@en                                                       | <http://dbpedia.org/resource/ConsensusPathDB>                                                                           |
| "CDD"@en                                                                   | <http://dbpedia.org/resource/Conserved_domain_database>                                                                 |
| "DAnCER"@en                                                                | <http://dbpedia.org/resource/DAnCER_%28database%29>                                                                     |
| "DBASS3 and DBASS5"@en                                                     | <http://dbpedia.org/resource/DBASS3/5>                                                                                  |
| "DIMA"@en                                                                  | <http://dbpedia.org/resource/DIMA_%28database%29>                                                                       |
| "DNA Data Bank of Japan"@en                                                | <http://dbpedia.org/resource/DNA_Data_Bank_of_Japan>                                                                    |
| "PCDB"@en                                                                  | <http://dbpedia.org/resource/Database_of_protein_conformational_diversity>                                              |
| "dbCRID"@en                                                                | <http://dbpedia.org/resource/DbCRID>                                                                                    |
| "dbDNV"@en                                                                 | <http://dbpedia.org/resource/DbDNV>                                                                                     |
| "dbSNP"@en                                                                 | <http://dbpedia.org/resource/DbSNP>                                                                                     |
| "DiProDB: a database for dinucleotide properties."@en                      | <http://dbpedia.org/resource/DiProDB>                                                                                   |
| "dictyBase"@en                                                             | <http://dbpedia.org/resource/DictyBase>                                                                                 |
| "DOMINE"@en                                                                | <http://dbpedia.org/resource/Domine_Database>                                                                           |
| "DroID"@en                                                                 | <http://dbpedia.org/resource/Droid_%28database%29>                                                                      |
| "ECRbase"@en                                                               | <http://dbpedia.org/resource/ECRbase>                                                                                   |
| "ECgene"@en                                                                | <http://dbpedia.org/resource/ECgene>                                                                                    |
| "EDAS."@en                                                                 | <http://dbpedia.org/resource/EDAS>                                                                                      |
| "EMAGE"@en                                                                 | <http://dbpedia.org/resource/EMAGE>                                                                                     |
| "EMDataBank.org"@en                                                        | <http://dbpedia.org/resource/EM_Data_Bank>                                                                              |
| "ENCODE"@en                                                                | <http://dbpedia.org/resource/ENCODE>                                                                                    |
| "EcoCyc"@en                                                                | <http://dbpedia.org/resource/EcoCyc>                                                                                    |
| "Effective-"@en                                                            | <http://dbpedia.org/resource/Effective_%28database%29>                                                                  |
| "The Ensembl genome database project."@en                                  | <http://dbpedia.org/resource/Ensembl>                                                                                   |
| "EID"@en                                                                   | <http://dbpedia.org/resource/Exon-intron_database>                                                                      |
| "ExtraTrain"@en                                                            | <http://dbpedia.org/resource/ExtraTrain>                                                                                |
| "FANTOM"@en                                                                | <http://dbpedia.org/resource/FANTOM>                                                                                    |
| "FINDbase"@en                                                              | <http://dbpedia.org/resource/FINDbase>                                                                                  |
| "FREP"@en                                                                  | <http://dbpedia.org/resource/FREP>                                                                                      |
| "FishBase"@en                                                              | <http://dbpedia.org/resource/FishBase>                                                                                  |
| "FlyFactorSurvey"@en                                                       | <http://dbpedia.org/resource/FlyFactorSurvey>                                                                           |
| "Full-parasites"@en                                                        | <http://dbpedia.org/resource/Full-parasites>                                                                            |
| "FESD"@en                                                                  | <http://dbpedia.org/resource/Functional_element_SNPs_database>                                                          |
| "FGDB"@en                                                                  | <http://dbpedia.org/resource/Fusarium_graminearum_genome_database>                                                      |
| "GISSD"@en                                                                 | <http://dbpedia.org/resource/GISSD>                                                                                     |
| "GPnotebook"@en                                                            | <http://dbpedia.org/resource/GPnotebook>                                                                                |
| "GPCRDB"@en                                                                | <http://dbpedia.org/resource/G_protein-coupled_receptors_database>                                                      |
| "GenBank"@en                                                               | <http://dbpedia.org/resource/GenBank>                                                                                   |
| "Genetic codes"@en                                                         | <http://dbpedia.org/resource/Genetic_codes_%28database%29>                                                              |
| "GlycomeDB"@en                                                             | <http://dbpedia.org/resource/GlycomeDB>                                                                                 |
| "GyDB of mobile genetic elements:"@en                                      | <http://dbpedia.org/resource/Gypsy_%28database%29>                                                                      |
| "The H-Invitational"@en                                                    | <http://dbpedia.org/resource/H-Invitational>                                                                            |
| "HGNC"@en                                                                  | <http://dbpedia.org/resource/HUGO_Gene_Nomenclature_Committee>                                                          |
| "HitPredict"@en                                                            | <http://dbpedia.org/resource/HitPredict>                                                                                |
| "HOLLYWOOD"@en                                                             | <http://dbpedia.org/resource/Hollywood_%28database%29>                                                                  |
| "HUMHOT"@en                                                                | <http://dbpedia.org/resource/HumHot>                                                                                    |
| "H-DBAS"@en                                                                | <http://dbpedia.org/resource/Human-transcriptome_database_for_alternative_splicing>                                     |
| "Hymenoptera Genome Database"@en                                           | <http://dbpedia.org/resource/Hymenoptera_genome_database>                                                               |
| "IGRhCellID"@en                                                            | <http://dbpedia.org/resource/IGRhCellID>                                                                                |
| "IUPHAR-DB."@en                                                            | <http://dbpedia.org/resource/IUPHAR_%28database%29>                                                                     |
| "InSatDb"@en                                                               | <http://dbpedia.org/resource/InSatDb>                                                                                   |
|                                                                            | <http://dbpedia.org/resource/Indian_Genetic_Disease_Database_%28IGDD%29>                                                |
| "InterPro"@en                                                              | <http://dbpedia.org/resource/InterPro>                                                                                  |
| "INTERFEROME"@en                                                           | <http://dbpedia.org/resource/Interferome>                                                                               |
| "IKMC: International Knockout Mouse Consortium"@en                         | <http://dbpedia.org/resource/International_Knockout_Mouse_Consortium>                                                   |
| "Intronerator"@en                                                          | <http://dbpedia.org/resource/Intronerator>                                                                              |
| "ISfinder"@en                                                              | <http://dbpedia.org/resource/Isfinder>                                                                                  |
| "Islander"@en                                                              | <http://dbpedia.org/resource/Islander_%28database%29>                                                                   |
| "IsoBase"@en                                                               | <http://dbpedia.org/resource/IsoBase>                                                                                   |
| "KEGG"@en                                                                  | <http://dbpedia.org/resource/KEGG>                                                                                      |
| "KUPS"@en                                                                  | <http://dbpedia.org/resource/KUPS_%28database%29>                                                                       |
| "KaPPA-View4"@en                                                           | <http://dbpedia.org/resource/KaPPA-View4>                                                                               |
| "L1Base"@en                                                                | <http://dbpedia.org/resource/L1Base>                                                                                    |
| "Laminin database"@en                                                      | <http://dbpedia.org/resource/Laminin_database>                                                                          |
| "LarvalBase"@en                                                            | <http://dbpedia.org/resource/LarvalBase>                                                                                |
| "lncRNAdb"@en                                                              | <http://dbpedia.org/resource/LncRNAdb>                                                                                  |
| "LocDB"@en                                                                 | <http://dbpedia.org/resource/LocDB>                                                                                     |
| "mESAdb"@en                                                                | <http://dbpedia.org/resource/MESAdb>                                                                                    |
| "MICdb"@en                                                                 | <http://dbpedia.org/resource/MICdb>                                                                                     |
| "MPromDb"@en                                                               | <http://dbpedia.org/resource/Mammalian_promoter_database>                                                               |
| "MatrixDB, the extracellular matrix interaction database."@en              | <http://dbpedia.org/resource/MatrixDB>                                                                                  |
| "MetaCyc"@en                                                               | <http://dbpedia.org/resource/MetaCyc>                                                                                   |
| "MethDB-"@en                                                               | <http://dbpedia.org/resource/MethDB>                                                                                    |
| "miRBase"@en                                                               | <http://dbpedia.org/resource/MiRBase>                                                                                   |
| "miRGator"@en                                                              | <http://dbpedia.org/resource/MiRGator>                                                                                  |
| "miRTarBase"@en                                                            | <http://dbpedia.org/resource/MiRTarBase>                                                                                |
| "ModBase"@en                                                               | <http://dbpedia.org/resource/ModBase>                                                                                   |
| "The Mouse Genome Database"@en                                             | <http://dbpedia.org/resource/Mouse_Genome_Database>                                                                     |
| "The mouse Gene Expression Database"@en                                    | <http://dbpedia.org/resource/Mouse_gene_expression_database>                                                            |
| "MIPS"@en                                                                  | <http://dbpedia.org/resource/Munich_Information_Center_for_Protein_Sequences>                                           |
| "NCBI Epigenomics"@en                                                      | <http://dbpedia.org/resource/NCBI_Epigenomics>                                                                          |
| "PID"@en                                                                   | <http://dbpedia.org/resource/NCI-Nature_Pathway_Interaction_Database>                                                   |
| "NGSmethDB"@en                                                             | <http://dbpedia.org/resource/NGSmethDB>                                                                                 |
| "neXtProt"@en                                                              | <http://dbpedia.org/resource/NeXtProt>                                                                                  |
| "NetPath"@en                                                               | <http://dbpedia.org/resource/Netpath>                                                                                   |
| "NeuroLex"@en                                                              | <http://dbpedia.org/resource/NeuroLex>                                                                                  |
| "Non-B DB"@en                                                              | <http://dbpedia.org/resource/Non-B_database>                                                                            |
| "NPRD"@en                                                                  | <http://dbpedia.org/resource/Nucleosome_positioning_region_database>                                                    |
| "OMPdb"@en                                                                 | <http://dbpedia.org/resource/OMPdb>                                                                                     |
| "TOPSAN"@en                                                                | <http://dbpedia.org/resource/Open_protein_structure_annotation_network>                                                 |
| "ODB"@en                                                                   | <http://dbpedia.org/resource/Operon_database>                                                                           |
| "OriDB"@en                                                                 | <http://dbpedia.org/resource/OriDB>                                                                                     |
| "Orientations of Proteins in Membranes"@en                                 | <http://dbpedia.org/resource/Orientations_of_Proteins_in_Membranes_database>                                            |
| "OrthoDB"@en                                                               | <http://dbpedia.org/resource/OrthoDB>                                                                                   |
| "OMA"@en                                                                   | <http://dbpedia.org/resource/Orthologous_MAtrix>                                                                        |
| "P2CS"@en                                                                  | <http://dbpedia.org/resource/P2CS>                                                                                      |
| "PANDIT"@en                                                                | <http://dbpedia.org/resource/PANDIT_%28database%29>                                                                     |
| "PCRPi-DB"@en                                                              | <http://dbpedia.org/resource/PCRPi-DB>                                                                                  |
| "PDBSum"@en                                                                | <http://dbpedia.org/resource/PDBsum>                                                                                    |
| "PROSITE"@en                                                               | <http://dbpedia.org/resource/PROSITE>                                                                                   |
| "PSORTdb"@en                                                               | <http://dbpedia.org/resource/PSORTdb>                                                                                   |
| "ParameciumDB"@en                                                          | <http://dbpedia.org/resource/ParameciumDB>                                                                              |
| "Pathway Commons"@en                                                       | <http://dbpedia.org/resource/Pathway_commons>                                                                           |
| "Patome"@en                                                                | <http://dbpedia.org/resource/Patome>                                                                                    |
| "PREX"@en                                                                  | <http://dbpedia.org/resource/Peroxiredoxin_classification_index>                                                        |
| "Pfam"@en                                                                  | <http://dbpedia.org/resource/Pfam>                                                                                      |
| "PhEVER"@en                                                                | <http://dbpedia.org/resource/PhEVER>                                                                                    |
| "PHOSIDA"@en                                                               | <http://dbpedia.org/resource/Phosida>                                                                                   |
| "Phospho.ELM"@en                                                           | <http://dbpedia.org/resource/Phospho.ELM>                                                                               |
| "Phospho3D"@en                                                             | <http://dbpedia.org/resource/Phospho3D>                                                                                 |
| "PhylomeDB"@en                                                             | <http://dbpedia.org/resource/PhylomeDB>                                                                                 |
| "PlasmoDB"@en                                                              | <http://dbpedia.org/resource/PlasmoDB>                                                                                  |
| "PmiRKB"@en                                                                | <http://dbpedia.org/resource/PmiRKB>                                                                                    |
| "PolyQ"@en                                                                 | <http://dbpedia.org/resource/PolyQ_%28database%29>                                                                      |
| "PolymiRTS"@en                                                             | <http://dbpedia.org/resource/PolymiRTS>                                                                                 |
| "PSSRdb"@en                                                                | <http://dbpedia.org/resource/Polymorphic_simple_sequence_repeats_database>                                              |
| "ProSAS"@en                                                                | <http://dbpedia.org/resource/ProSAS>                                                                                    |
| "ProtCID"@en                                                               | <http://dbpedia.org/resource/ProtCID>                                                                                   |
| "PRIDB"@en                                                                 | <http://dbpedia.org/resource/Protein-RNA_interface_database>                                                            |
| "The Protein Data Bank."@en                                                | <http://dbpedia.org/resource/Protein_Data_Bank>                                                                         |
| "PCDDB"@en                                                                 | <http://dbpedia.org/resource/Protein_circular_dichroism_data_bank>                                                      |
| "Pseudogene.org"@en                                                        | <http://dbpedia.org/resource/Pseudogene_%28database%29>                                                                 |
| "Pseudomonas Genome Database"@en                                           | <http://dbpedia.org/resource/Pseudomonas_genome_database>                                                               |
| "PubChem"@en                                                               | <http://dbpedia.org/resource/PubChem>                                                                                   |
| "PubMed"@en                                                                | <http://dbpedia.org/resource/PubMed>                                                                                    |
| "REDfly"@en                                                                | <http://dbpedia.org/resource/REDfly>                                                                                    |
| "REPAIRtoire"@en                                                           | <http://dbpedia.org/resource/REPAIRtoire>                                                                               |
| "RIKEN integrated database of mammals."@en                                 | <http://dbpedia.org/resource/RIKEN_integrated_database_of_mammals>                                                      |
| "RBPDB"@en                                                                 | <http://dbpedia.org/resource/RNA-binding_protein_database>                                                              |
| "RNA helicase database."@en                                                | <http://dbpedia.org/resource/RNA_helicase_database>                                                                     |
| "RNAMDB"@en                                                                | <http://dbpedia.org/resource/RNA_modification_database>                                                                 |
| "Reactome: a database of reactions, pathways and biological processes."@en | <http://dbpedia.org/resource/Reactome>                                                                                  |
| "REBASE"@en                                                                | <http://dbpedia.org/resource/Rebase_%28database%29>                                                                     |
| "RECODE"@en                                                                | <http://dbpedia.org/resource/Recode_%28database%29>                                                                     |
| "Refseq"@en                                                                | <http://dbpedia.org/resource/RefSeq>                                                                                    |
| "RegPhos"@en                                                               | <http://dbpedia.org/resource/RegPhos>                                                                                   |
| "RegTransBase"@en                                                          | <http://dbpedia.org/resource/RegTransBase>                                                                              |
| "RegulonDB"@en                                                             | <http://dbpedia.org/resource/RegulonDB>                                                                                 |
| "RepTar"@en                                                                | <http://dbpedia.org/resource/RepTar_%28database%29>                                                                     |
| "RetrOryza"@en                                                             | <http://dbpedia.org/resource/RetrOryza>                                                                                 |
| "Rfam"@en                                                                  | <http://dbpedia.org/resource/Rfam>                                                                                      |
| "S/MARt DB"@en                                                             | <http://dbpedia.org/resource/S/MARt>                                                                                    |
| "STRING"@en                                                                | <http://dbpedia.org/resource/STRING>                                                                                    |
| "SUPERFAMILY"@en                                                           | <http://dbpedia.org/resource/SUPERFAMILY>                                                                               |
| "SeaLifeBase"@en                                                           | <http://dbpedia.org/resource/SeaLifeBase>                                                                               |
| "SMART"@en                                                                 | <http://dbpedia.org/resource/Simple_Modular_Architecture_Research_Tool>                                                 |
| "SNPSTR"@en                                                                | <http://dbpedia.org/resource/Snptstr_%28database%29>                                                                    |
| "SPIKE"@en                                                                 | <http://dbpedia.org/resource/Spike_%28database%29>                                                                      |
| "SpliceInfo"@en                                                            | <http://dbpedia.org/resource/SpliceInfo>                                                                                |
| "StarBase"@en                                                              | <http://dbpedia.org/resource/StarBase_%28database%29>                                                                   |
| "SCLD"@en                                                                  | <http://dbpedia.org/resource/Stem_cell_lineage_database>                                                                |
| "STRBase"@en                                                               | <http://dbpedia.org/resource/Strbase>                                                                                   |
| "SAHG"@en                                                                  | <http://dbpedia.org/resource/Structure_atlas_of_human_genome>                                                           |
| "SuperSweet"@en                                                            | <http://dbpedia.org/resource/SuperSweet>                                                                                |
| "SGDB"@en                                                                  | <http://dbpedia.org/resource/Synthetic_gene_database>                                                                   |
| "TIARA"@en                                                                 | <http://dbpedia.org/resource/TIARA_%28database%29>                                                                      |
| "The TIGR Plant Repeat Databases"@en                                       | <http://dbpedia.org/resource/TIGR_plant_repeat_database>                                                                |
| "TIGR Plant Transcript Assemblies database."@en                            | <http://dbpedia.org/resource/TIGR_plant_transcript_assembly_database>                                                   |
| "TMPad"@en                                                                 | <http://dbpedia.org/resource/TMPad>                                                                                     |
| "tRNADB"@en                                                                | <http://dbpedia.org/resource/TRNADB>                                                                                    |
| "TRDB-"@en                                                                 | <http://dbpedia.org/resource/Tandem_repeats_database>                                                                   |
| "TassDB"@en                                                                | <http://dbpedia.org/resource/TassDB>                                                                                    |
| "TcoF-DB"@en                                                               | <http://dbpedia.org/resource/TcoF-DB>                                                                                   |
| "ThYme"@en                                                                 | <http://dbpedia.org/resource/ThYme_%28database%29>                                                                      |
| "TADB"@en                                                                  | <http://dbpedia.org/resource/Toxin-antitoxin_database>                                                                  |
| "TRIP"@en                                                                  | <http://dbpedia.org/resource/Transient_receptor_potential_channel-interacting_protein_database>                         |
| "TranspoGene and microTranspoGene"@en                                      | <http://dbpedia.org/resource/Transpogene>                                                                               |
| "TreeFam"@en                                                               | <http://dbpedia.org/resource/TreeFam>                                                                                   |
| "U12DB"@en                                                                 | <http://dbpedia.org/resource/U12_intron_database>                                                                       |
| "The UCSC Genome Browser"@en                                               | <http://dbpedia.org/resource/UCSC_Genome_Browser>                                                                       |
| "UCbase & miRfunc"@en                                                      | <http://dbpedia.org/resource/UCbase>                                                                                    |
| "UKPMC"@en                                                                 | <http://dbpedia.org/resource/UK_PubMed_Central>                                                                         |
| "UTRdb and UTRsite"@en                                                     | <http://dbpedia.org/resource/UTRdb>                                                                                     |
| "UTRome"@en                                                                | <http://dbpedia.org/resource/UTRome>                                                                                    |
| "UgMicroSatdb"@en                                                          | <http://dbpedia.org/resource/UgMicroSatdb>                                                                              |
| "UniGene"@en                                                               | <http://dbpedia.org/resource/UniGene>                                                                                   |
| "UniPROBE"@en                                                              | <http://dbpedia.org/resource/UniPROBE>                                                                                  |
| "UniProt"@en                                                               | <http://dbpedia.org/resource/UniProt>                                                                                   |
| "UniVec"@en                                                                | <http://dbpedia.org/resource/Univec>                                                                                    |
| "VISTA Enhancer Browser"@en                                                | <http://dbpedia.org/resource/VISTA_%28comparative_genomics%29>                                                          |
| "VnD"@en                                                                   | <http://dbpedia.org/resource/Variations_and_drugs_database>                                                             |
| "VectorDB"@en                                                              | <http://dbpedia.org/resource/VectorDB>                                                                                  |
| "ViralZon"@en                                                              | <http://dbpedia.org/resource/ViralZone>                                                                                 |
| "VKCDB"@en                                                                 | <http://dbpedia.org/resource/Voltage-gated_potassium_channel_database>                                                  |
| "WebGeSTer DB"@en                                                          | <http://dbpedia.org/resource/WebGeSTer>                                                                                 |
| "WormBase"@en                                                              | <http://dbpedia.org/resource/WormBase>                                                                                  |
| "YPA"@en                                                                   | <http://dbpedia.org/resource/Yeast_promoter_atlas>                                                                      |
| "YEASTRACT"@en                                                             | <http://dbpedia.org/resource/Yeastract>                                                                                 |
|                                                                            | <http://dbpedia.org/resource/ZINC_database>                                                                             |
| "ZFIN"@en                                                                  | <http://dbpedia.org/resource/Zebrafish_Information_Network>                                                             |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

List the biological databases in the category "Systems Biology"

SPARQL

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbpedia: <http://dbpedia.org/ontology/>

SELECT   ?title ?description ?uri WHERE {
  ?uri a dbpedia:BiologicalDatabase .
  ?uri <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Systems_biology> .
  OPTIONAL {
  	?uri <http://dbpedia.org/property/title> ?title.
  	}
  OPTIONAL {
  	?uri <http://dbpedia.org/property/description> ?description .
  	}
} ORDER By ?title

Results:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| title                                       | description                                                                                                  | uri                                                                   |
======================================================================================================================================================================================================================================
| "BISC"@en                                   | "Protein–protein interaction database linking structural biology with functional genomics"@en                | <http://dbpedia.org/resource/BISC_%28database%29>                     |
| "BioGRID"@en                                | "interaction data."@en                                                                                       | <http://dbpedia.org/resource/BioGRID>                                 |
| "BioModels Database"@en                     | "A database for storing, exchanging and retrieving published quantitative models of biological interest."@en | <http://dbpedia.org/resource/BioModels_Database>                      |
| "ChemProt"@en                               | "disease chemical biology database."@en                                                                      | <http://dbpedia.org/resource/ChemProt>                                |
| "ConsensusPathDB"@en                        | "human functional interaction networks."@en                                                                  | <http://dbpedia.org/resource/ConsensusPathDB>                         |
| "DIMA"@en                                   | "predicted and known interactions between protein domains"@en                                                | <http://dbpedia.org/resource/DIMA_%28database%29>                     |
| "HitPredict"@en                             | "quality assessed protein-protein interactions in nine species."@en                                          | <http://dbpedia.org/resource/HitPredict>                              |
| "KEGG"@en                                   | "The KEGG resource for deciphering the genome."@en                                                           | <http://dbpedia.org/resource/KEGG>                                    |
| "KUPS"@en                                   | "datasets of interacting and non-interacting protein pairs with associated attributions."@en                 | <http://dbpedia.org/resource/KUPS_%28database%29>                     |
| "PID"@en                                    | "Pathway Interaction Database."@en                                                                           | <http://dbpedia.org/resource/NCI-Nature_Pathway_Interaction_Database> |
| "Pathway Commons"@en                        | "biological pathways."@en                                                                                    | <http://dbpedia.org/resource/Pathway_commons>                         |
| "ProtCID"@en                                | "interactions of homologous proteins in multiple crystal forms."@en                                          | <http://dbpedia.org/resource/ProtCID>                                 |
| "REPAIRtoire"@en                            | <http://dbpedia.org/resource/DNA_repair>                                                                     | <http://dbpedia.org/resource/REPAIRtoire>                             |
| "REPAIRtoire"@en                            | <http://dbpedia.org/resource/Systems_biology>                                                                | <http://dbpedia.org/resource/REPAIRtoire>                             |
| "SPIKE"@en                                  | "highly curated human signaling pathways."@en                                                                | <http://dbpedia.org/resource/Spike_%28database%29>                    |
| "STRING"@en                                 | "Search Tool for the Retrieval of Interacting Genes/Proteins"@en                                             | <http://dbpedia.org/resource/STRING>                                  |
| "3"^^<http://www.w3.org/2001/XMLSchema#int> | "identification and classification of domain-based interactions of known three-dimensional structure."@en    | <http://dbpedia.org/resource/3did>                                    |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

list the databases available at the NCBI

Sparql query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbpedia: <http://dbpedia.org/ontology/>


SELECT  ?title ?description ?uri WHERE {
  
  
  ?uri a <http://dbpedia.org/ontology/BiologicalDatabase> .
  ?uri <http://dbpedia.org/property/center> <http://dbpedia.org/resource/National_Center_for_Biotechnology_Information> .
  OPTIONAL {
  	?uri <http://dbpedia.org/property/title> ?title .
  	}
  OPTIONAL {
  	?uri <http://dbpedia.org/property/description> ?description .
  	}
} ORDER BY ?title

Result:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| title                 | description                                                                                                        | uri                                                     |
========================================================================================================================================================================================================
| "BGMUT"@en            | "database of variations in the genes that encode antigens of blood group systems"@en                               | <http://dbpedia.org/resource/BGMUT>                     |
| "CDD"@en              | "Conserved Domain Database for the functional annotation of proteins."@en                                          | <http://dbpedia.org/resource/Conserved_domain_database> |
| "GenBank"@en          | "Nucleotide sequences for more than 300 000 organisms with supporting bibliographic and biological annotation."@en | <http://dbpedia.org/resource/GenBank>                   |
| "NCBI Epigenomics"@en | "epigenomic data sets."@en                                                                                         | <http://dbpedia.org/resource/NCBI_Epigenomics>          |
| "Refseq"@en           | "curated non-redundant sequence database of genomes."@en                                                           | <http://dbpedia.org/resource/RefSeq>                    |
| "UniGene"@en          | <http://dbpedia.org/resource/Transcriptome>                                                                        | <http://dbpedia.org/resource/UniGene>                   |
| "dbSNP"@en            | <http://dbpedia.org/resource/Database>                                                                             | <http://dbpedia.org/resource/DbSNP>                     |
| "dbSNP"@en            | <http://dbpedia.org/resource/Single-nucleotide_polymorphism>                                                       | <http://dbpedia.org/resource/DbSNP>                     |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

List the biological databases having a SPARQL endpoint

SPARQL query

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>


SELECT ?uri ?endpoint WHERE {
  ?uri a <http://dbpedia.org/ontology/BiologicalDatabase> .
  ?uri <http://dbpedia.org/property/sparql> ?endpoint .
}

Result:

------------------------------------------------------------------------------------
| uri                                  | endpoint                                  |
====================================================================================
| <http://dbpedia.org/resource/ChEBI>  | <http://chebi.bio2rdf.org>                |
| <http://dbpedia.org/resource/ChEMBL> | <http://rdf.farmbio.uu.se/chembl/snorql/> |
------------------------------------------------------------------------------------
That's It, Pierre