Is your specific subject of research present in this ontology (e.g "RNA-Seq") ? go and have a look at http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM. If it is not, feel free to suggest a new term in the form below. Your term might be included in the next version of the ontology and it might be used as a possible choice for the Bioinformatics Career Survey 2011/2012.
01 December 2011
Suggest some new terms for the EDAM Ontology for Bioinformatics
EDAM is an ontology of general bioinformatics concepts, including topics and data types, formats, identifiers and operations.
Is your specific subject of research present in this ontology (e.g "RNA-Seq") ? go and have a look at http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM. If it is not, feel free to suggest a new term in the form below. Your term might be included in the next version of the ontology and it might be used as a possible choice for the Bioinformatics Career Survey 2011/2012.
That's it,
Pierre
Is your specific subject of research present in this ontology (e.g "RNA-Seq") ? go and have a look at http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM. If it is not, feel free to suggest a new term in the form below. Your term might be included in the next version of the ontology and it might be used as a possible choice for the Bioinformatics Career Survey 2011/2012.
20 November 2011
Processing json data with apache velocity.
I've written a tool named "apache velocity" which parse json data and processes it with "Apache velocity" (a template engine ). The (javacc) source code is available here:
https://github.com/lindenb/jsandbox/blob/master/src/sandbox/VelocityJson.jj
That's it,
Pierre
Example
Say you have defined some classes using JSON:[ { "type": "record", "name": "Exon", "fields" : [ {"name": "start", "type": "int"}, {"name": "end", "type": "int"} ] }, { "type": "record", "name": "Gene", "fields" : [ {"name": "chrom", "type": "string"}, {"name": "name", "type": "string"}, {"name": "txStart", "type": "int"}, {"name": "txEnd", "type": "int"}, {"name": "cdsStart", "type": "int"}, {"name": "cdsEnd", "type": "int"}, {"name": "exons", "type":{"type":"array","items":"Exon"}} ] } ]and here is a velocity template transforming this json structure to java :
#macro(javaName $s)$s.substring(0,1).toUpperCase()$s.substring(1)#end
#macro(setter $s)set#javaName($s)#end
#macro(getter $s)get#javaName($s)#end
#macro(javaType $f)
#if($f.type.equals("string"))
java.lang.String#elseif($f.type.equals("boolean"))
boolean#elseif($f.type.equals("long"))
long#elseif($f.type.equals("float"))
float#elseif($f.type.equals("double"))
double#elseif($f.type.equals("int"))
int#elseif($f.items)
$f.items#elseif($f.type.type.equals("array"))
java.util.List<#javaType($f.type)>#else
$f.type
#end
#end
#foreach( $class in $avro)
class $class.name
{
#foreach( $field in $class.fields )
private #javaType($field) $field.name;
#end
public ${class.name}()
{
}
public ${class.name}(#foreach( $field in $class.fields )
#if($velocityCount>1),#end#javaType($field) $field.name
#end
)
{
#foreach( $field in $class.fields )
this.$field.name=$field.name;
#end
}
#foreach( $field in $class.fields )
public void #setter($field.name)(#javaType($field) $field.name)
{
this.$field.name=$field.name;
}
public #javaType($field) #getter($field.name)()
{
return this.$field.name;
}
#end
}
#end
The json file can be processed with velocity using the following command line:$ java -jar velocityjson.jar -f avro structure.json json2java.vm
Result
class Exon { private int start; private int end; public Exon() { } public Exon( int start ,int end ) { this.start=start; this.end=end; } public void setStart(int start) { this.start=start; } public int getStart() { return this.start; } public void setEnd(int end) { this.end=end; } public int getEnd() { return this.end; } } class Gene { private java.lang.String chrom; private java.lang.String name; private int txStart; private int txEnd; private int cdsStart; private int cdsEnd; private java.util.List<Exon> exons; public Gene() { } public Gene( java.lang.String chrom ,java.lang.String name ,int txStart ,int txEnd ,int cdsStart ,int cdsEnd ,java.util.List<Exon> exons ) { this.chrom=chrom; this.name=name; this.txStart=txStart; this.txEnd=txEnd; this.cdsStart=cdsStart; this.cdsEnd=cdsEnd; this.exons=exons; } public void setChrom(java.lang.String chrom) { this.chrom=chrom; } public java.lang.String getChrom() { return this.chrom; } public void setName(java.lang.String name) { this.name=name; } public java.lang.String getName() { return this.name; } public void setTxStart(int txStart) { this.txStart=txStart; } public int getTxStart() { return this.txStart; } public void setTxEnd(int txEnd) { this.txEnd=txEnd; } public int getTxEnd() { return this.txEnd; } public void setCdsStart(int cdsStart) { this.cdsStart=cdsStart; } public int getCdsStart() { return this.cdsStart; } public void setCdsEnd(int cdsEnd) { this.cdsEnd=cdsEnd; } public int getCdsEnd() { return this.cdsEnd; } public void setExons(java.util.List<Exon> exons) { this.exons=exons; } public java.util.List<Exon> getExons() { return this.exons; } }
That's it,
Pierre
16 November 2011
"VCF annotation" with the NHLBI GO Exome Sequencing Project (JAX-WS)
The NHLBI Exome Sequencing Project (ESP) has released a web service to query their data. "The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.".
In the current post, I'll show how I've used this web service to annotate a VCF file with this information.
The web service provided by the ESP is based on the SOAP protocol.
Here is an example of the XML response: We can generate the java classes for a client invoking this Web Service by using ${JAVA_HOME}/bin/wsimport.
Here is the java code running this client. It scans the VCF, calls the webservice for each variation and insert the annotation as JSON in a new column .
... and the makefile:
Pierre
In the current post, I'll show how I've used this web service to annotate a VCF file with this information.
The web service provided by the ESP is based on the SOAP protocol.
Here is an example of the XML response: We can generate the java classes for a client invoking this Web Service by using ${JAVA_HOME}/bin/wsimport.
$ wsimport -keep "http://evs.gs.washington.edu/wsEVS/EVSDataQueryService?wsdl" parsing WSDL... generating code... compiling code...
Here is the java code running this client. It scans the VCF, calls the webservice for each variation and insert the annotation as JSON in a new column .
... and the makefile:
Result (some columns have been cut)
curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/EUR.2of4intersection_allele_freq.20100804.sites.vcf.gz" |\ gunzip -c |\ java -jar evsclient.jar ##fileformat=VCFv4.0 ##filedat=20101112 ##datarelease=20100804 ##samples=629 ##description="Where BI calls are present, genotypes and alleles are from BI. In there absence, UM genotypes are used. If neither are available, no genotype information is present and the alleles are from the NCBI calls." (...) #CHROM POS ID EVS 1 10469 rs117577454 {"start":10469,"chromosome":"1","stop":10470,"strand":"+","snpList":[],"setOfSiteCoverageInfo":[]} 1 10583 rs58108140 {"start":10583,"chromosome":"1","stop":10584,"strand":"+","snpList":[],"setOfSiteCoverageInfo":[]} 1 11508 . {"start":11508,"chromosome":"1","stop":11509,"strand":" (...) 1 69511 . {"start":69511,"chromosome":"1","stop":69512,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"1.0","conservationScoreGERP":"0.5","refAllele":"A","ancestralAllele":"G","filters":"PASS","clinicalLink":"unknown","positionString":"1:69511","chrPosition":69511,"alleles":"G/A","uaAlleleCounts":"1373/47","aaAlleleCounts":"880/600","totalAlleleCounts":"2253/647","uaAlleleAndCount":"G=1373/A=47","aaAlleleAndCount":"G=880/A=600","totalAlleleAndCount":"G=2253/A=647","uaMAF":3.3099,"aaMAF":40.5405,"totalMAF":22.3103,"avgSampleReadDepth":185,"geneList":"OR4F5","snpFunction":{"chromosome":"1","position":69511,"conservationScore":"1.0","conservationScoreGERP":"0.5","snpFxnList":[{"mrnaAccession":"NM_001005484","fxnClassGVS":"missense","aminoAcids":"THR,ALA","proteinPos":"141/306","cdnaPos":421,"pphPrediction":"benign","granthamScore":"58"}],"refAllele":"A","ancestralAllele":"G","firstRsId":75062661,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"G","hasAtLeastOneAccession":"true","rsIds":"rs75062661"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":69511,"avgSampleReadDepth":185.0,"totalSamplesCovered":1452,"eaSamplesCovered":712,"avgEaSampleReadDepth":157.0,"aaSamplesCovered":740,"avgAaSampleReadDepth":211.0},{"chromosome":"1","position":69512,"avgSampleReadDepth":180.0,"totalSamplesCovered":1501,"eaSamplesCovered":739,"avgEaSampleReadDepth":153.0,"aaSamplesCovered":762,"avgAaSampleReadDepth":207.0}]} (...) 1 901923 . {"start":901923,"chromosome":"1","stop":901924,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"1.0","conservationScoreGERP":"5.0","refAllele":"C","ancestralAllele":"C","filters":"PASS","clinicalLink":"unknown","positionString":"1:901923","chrPosition":901923,"alleles":"A/C","uaAlleleCounts":"2/2542","aaAlleleCounts":"52/1934","totalAlleleCounts":"54/4476","uaAlleleAndCount":"A=2/C=2542","aaAlleleAndCount":"A=52/C=1934","totalAlleleAndCount":"A=54/C=4476","uaMAF":0.0786,"aaMAF":2.6183,"totalMAF":1.1921,"avgSampleReadDepth":35,"geneList":"PLEKHN1","snpFunction":{"chromosome":"1","position":901923,"conservationScore":"1.0","conservationScoreGERP":"5.0","snpFxnList":[{"mrnaAccession":"NM_032129","fxnClassGVS":"missense","aminoAcids":"SER,ARG","proteinPos":"4/612","cdnaPos":12,"pphPrediction":"probably-damaging","granthamScore":"110"}],"refAllele":"C","ancestralAllele":"C","firstRsId":0,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"A","hasAtLeastOneAccession":"true","rsIds":"none"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":901923,"avgSampleReadDepth":35.0,"totalSamplesCovered":2280,"eaSamplesCovered":1272,"avgEaSampleReadDepth":32.0,"aaSamplesCovered":1008,"avgAaSampleReadDepth":38.0},{"chromosome":"1","position":901924,"avgSampleReadDepth":35.0,"totalSamplesCovered":2283,"eaSamplesCovered":1273,"avgEaSampleReadDepth":32.0,"aaSamplesCovered":1010,"avgAaSampleReadDepth":38.0}]} 1 902069 rs116147894 {"start":902069,"chromosome":"1","stop":902070,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"0.0","conservationScoreGERP":"1.0","refAllele":"T","ancestralAllele":"T","filters":"PASS","clinicalLink":"unknown","positionString":"1:902069","chrPosition":902069,"alleles":"C/T","uaAlleleCounts":"2/320","aaAlleleCounts":"18/212","totalAlleleCounts":"20/532","uaAlleleAndCount":"C=2/T=320","aaAlleleAndCount":"C=18/T=212","totalAlleleAndCount":"C=20/T=532","uaMAF":0.6211,"aaMAF":7.8261,"totalMAF":3.6232,"avgSampleReadDepth":13,"geneList":"PLEKHN1","snpFunction":{"chromosome":"1","position":902069,"conservationScore":"0.0","conservationScoreGERP":"1.0","snpFxnList":[{"mrnaAccession":"NM_032129","fxnClassGVS":"intron","aminoAcids":"none","proteinPos":"NA","cdnaPos":-1,"pphPrediction":"unknown","granthamScore":"NA"}],"refAllele":"T","ancestralAllele":"T","firstRsId":0,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"C","hasAtLeastOneAccession":"true","rsIds":"none"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":902069,"avgSampleReadDepth":13.0,"totalSamplesCovered":304,"eaSamplesCovered":169,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":135,"avgAaSampleReadDepth":12.0},{"chromosome":"1","position":902070,"avgSampleReadDepth":12.0,"totalSamplesCovered":338,"eaSamplesCovered":190,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":148,"avgAaSampleReadDepth":12.0}]} 1 902108 rs62639981 {"start":902108,"chromosome":"1","stop":902109,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"0.0","conservationScoreGERP":"-8.7","refAllele":"C","ancestralAllele":"unknown","filters":"PASS","clinicalLink":"unknown","positionString":"1:902108","chrPosition":902108,"alleles":"T/C","uaAlleleCounts":"5/333","aaAlleleCounts":"0/248","totalAlleleCounts":"5/581","uaAlleleAndCount":"T=5/C=333","aaAlleleAndCount":"T=0/C=248","totalAlleleAndCount":"T=5/C=581","uaMAF":1.4793,"aaMAF":0.0,"totalMAF":0.8532,"avgSampleReadDepth":13,"geneList":"PLEKHN1","snpFunction":{"chromosome":"1","position":902108,"conservationScore":"0.0","conservationScoreGERP":"-8.7","snpFxnList":[{"mrnaAccession":"NM_032129","fxnClassGVS":"coding-synonymous","aminoAcids":"none","proteinPos":"36/612","cdnaPos":108,"pphPrediction":"unknown","granthamScore":"NA"}],"refAllele":"C","ancestralAllele":"unknown","firstRsId":62639981,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"T","hasAtLeastOneAccession":"true","rsIds":"rs62639981"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":902108,"avgSampleReadDepth":13.0,"totalSamplesCovered":294,"eaSamplesCovered":170,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":124,"avgAaSampleReadDepth":13.0},{"chromosome":"1","position":902109,"avgSampleReadDepth":13.0,"totalSamplesCovered":309,"eaSamplesCovered":177,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":132,"avgAaSampleReadDepth":13.0}]} (...)That's it
Pierre
01 November 2011
The paper about BioStar has been published in "PLoS Computational Biology"
The article describing BioStar has been published in PLoS Computational Biology:
Laurence D. Parnell, Pierre Lindenbaum, Khader Shameer, Giovanni Marco Dall'Olio, Daniel C. Swan, Lars Juhl Jensen, Simon J. Cockell, Brent S. Pedersen, Mary E. Mangan, Christopher A. Miller, Istvan Albert. 2011
PLoS Comput Biol 7(10)
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002216
Giovanni has already blogged about this paper here, and on my side, I've collected some tweets about this paper.
Many thanks to all the Biostar users and to the contributors of this paper.
That's it
Pierre
BioStar: An Online Question & Answer Resource for the Bioinformatics Community
Laurence D. Parnell, Pierre Lindenbaum, Khader Shameer, Giovanni Marco Dall'Olio, Daniel C. Swan, Lars Juhl Jensen, Simon J. Cockell, Brent S. Pedersen, Mary E. Mangan, Christopher A. Miller, Istvan Albert. 2011
PLoS Comput Biol 7(10)
http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1002216
Many thanks to all the Biostar users and to the contributors of this paper.
That's it
Pierre
21 October 2011
A reference genome with or without the 'chr' prefix
The name of the chromosomes in the fasta files for the human genome are prefixed with 'chr' :
That's it,
Pierre
$ grep ">" hg19.fa >chr1 >chr2 >chr3 >chr4 >chr5 >chr6 (...)The FAIDX index for this fasta file looks like this:
chr1 249250621 6 50 51 chr2 243199373 254235646 50 51 chr3 198022430 502299013 50 51 chr4 191154276 704281898 50 51 chr5 180915260 899259266 50 51 chr6 171115067 1083792838 50 51 (...).Today, I've been asked to call the variations for a set of BAM files mapped on a reference genome without this 'chr' prefix. One way to get around this problem is to change the header for those BAM. Another way is to create a copy of the faidx file where the 'chr' prefixes have been removed (the faidx is still valid as the positions in the chromosomes didn't change):
sed 's/^chr//' hg19.fa.fai > hg19_NOPREFIX.fa.faiand to create a symbolic link named hg19_NOPREFIX.fa pointing to the original reference:
ln -s hg19.fa hg19_NOPREFIX.fa. The result:
ls -lah -rw-r--r-- 1 root root 3.0G Jan 4 2011 hg19.fa -rw-r--r-- 1 root root 788 Jan 27 2011 hg19.fa.fai lrwxrwxrwx 1 root root 7 Oct 20 16:12 hg19_NOPREFIX.fa -> hg19.fa -rw-r--r-- 1 root root 713 Oct 20 16:12 hg19_NOPREFIX.fa.faiThis solution worked so far with samtools mpileup.
That's it,
Pierre
07 October 2011
Knime4Bio: a set of custom nodes for the interpretation of NGS data with KNIME
Our paper has just been published in Bioinformatics :-)
http://bioinformatics.oxfordjournals.org/content/early/2011/10/07/bioinformatics.btr554.abstract
Summary: Here, we describe Knime4Bio, a set of custom nodes for the KNIME (The Konstanz Information Miner) interactive graphical workbench, for the interpretation of large biological datasets. We demonstrate that this tool can be utilised to quickly retrieve previously published scientific findings.
Availability: http://code.google.com/p/knime4bio/.
That's it,
Pierre
http://bioinformatics.oxfordjournals.org/content/early/2011/10/07/bioinformatics.btr554.abstract
Knime4Bio: a set of custom nodes for the interpretation of Next Generation Sequencing data with KNIME.
Pierre Lindenbaum, Solena Le Scouarnec, Vincent Portero and Richard Redon
Summary: Here, we describe Knime4Bio, a set of custom nodes for the KNIME (The Konstanz Information Miner) interactive graphical workbench, for the interpretation of large biological datasets. We demonstrate that this tool can be utilised to quickly retrieve previously published scientific findings.
Availability: http://code.google.com/p/knime4bio/.
That's it,
Pierre
05 October 2011
Grouping mutations/Gene=f(sample)
GroupByGene is a small C++ tool grouping the data:
Example:
Calling "groupbygene:
That's it,
Pierre
- CHROM
- POS
- REF
- GENE
- SAMPLE
Example:
$ cat input.tsv #CHROM POS REF ALT GENE SAMPLE chr1 10 A T gene1 indi1 chr1 10 A T gene1 indi2 chr1 11 C G gene1 indi2 chr2 110 C G gene2 indi3 chr3 210 A T gene3 indi1 chr3 211 C T gene3 indi2 chr3 211 C T gene3 indi3 chr3 215 C G gene3 indi3 chr3 216 C T gene3 indi3 chr4 390 C T gene4 indi1 chr4 390 C A gene4 indi3
Calling "groupbygene:
$ groupbygene --chrom 1 --pos 2 --ref 3 --alt 4 --sample 6 --gene 5 < input.tsv
GENE | CHROM | START | END | count SAMPLES | distinct MUTATIONS | count(indi1) | count(indi2) | count(indi3) |
gene1 | chr1 | 10 | 11 | 2 | 2 | 1 | 2 | 0 |
gene2 | chr2 | 110 | 110 | 1 | 1 | 0 | 0 | 1 |
gene3 | chr3 | 210 | 216 | 3 | 4 | 1 | 1 | 3 |
gene4 | chr4 | 390 | 390 | 2 | 2 | 1 | 0 | 1 |
$ groupbygene --chrom 1 --pos 2 --ref 3 --alt 4 --sample 6 --gene 5 --norefalt < input.tsv
GENE | CHROM | START | END | count SAMPLES | distinct MUTATIONS | count(indi1) | count(indi2) | count(indi3) |
gene1 | chr1 | 10 | 11 | 2 | 2 | 1 | 2 | 0 |
gene2 | chr2 | 110 | 110 | 1 | 1 | 0 | 0 | 1 |
gene3 | chr3 | 210 | 216 | 3 | 4 | 1 | 1 | 3 |
gene4 | chr4 | 390 | 390 | 2 | 1 | 1 | 0 | 1 |
That's it,
Pierre
Verticalize: printing the input stream vertically.
A useful tool: verticalize is a small C++ tool printing the input stream vertically. The source is available on github : https://github.com/lindenb/ccsandbox/blob/master/src/verticalize.cpp.
An Example with 1000genomes.org :
An Example with 1000genomes.org :
$ curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz"|\ gunzip -c | grep -v "##" |\ verticalize | head -n 30 >>> 2 $1 #CHROM 1 $2 POS 10327 $3 ID rs112750067 $4 REF T $5 ALT C $6 QUAL . $7 FILTER PASS $8 INFO DP=65;AF=0.208;CB=BC,NCBI <<< 2 >>> 3 $1 #CHROM 1 $2 POS 10469 $3 ID rs117577454 $4 REF C $5 ALT G $6 QUAL . $7 FILTER PASS $8 INFO DP=2055;AF=0.020;CB=UM,BC,NCBI <<< 3 (...)That's it, Pierre
26 September 2011
PostScript as a Programming Language for Bioinformatics: mynotebook
"PostScript (PS) is an interpreted, stack-based programming language. It is best known for its use as a page description language in the electronic and desktop publishing areas."[wikipedia]. In this post, I'll show how I've used to create a simple and lightweight view
of the genome.
Each Gene is a PS array holding the structure of the UCSC knownGene table, that is to say: name , chromosome, txStart, txEnd, cdsStart, cdsEnd, exonStarts, exonEnds:
Extract the transcription end:
Extract the CDS start:
Extract the transcription end:
Extract the strand:
Get the number of exons:
Get the start position of the i-th exon:
Get the end position of the i-th exon:
Should we draw this gene on the page ?
Loop over the genes and extract the highest 3' index:
In my postscript file, the default values for minChromStart and maxChromEnd are overridden by the user's parameters:
Pierre
Introduction: just a simple postscript program
The following PS program fills a rectangular gray shape; You can display the result using ghostview, a2ps, etc...%!PS newpath 50 50 moveto 0 100 rlineto 100 0 rlineto 0 -100 rlineto closepath 0.5 setgray fill showpage
Some global variables
The page width
/screenWidth 1000 def
The page width
/screenHeight 1000 def
The minimum 5' position
/minChromStart 1E9 def
The maximum 3' position
/maxChromEnd -1 def
The size of a genomic feature
/featureHeight 20 def
The distance between two 'ticks' for drawing the orientation
/ticksx 20 def
The font size
/theFontSize 9 defThe variable knownGene is a PS array of genes.
/knownGene [ [(uc002zkr.3) (chr22) (-) 161242... ...] ] def
Each Gene is a PS array holding the structure of the UCSC knownGene table, that is to say: name , chromosome, txStart, txEnd, cdsStart, cdsEnd, exonStarts, exonEnds:
[(uc002zmh.2) (chr22) (-) 17618410 17646177 17618910 17646134 [17618410 17619439 17621948 17623987 17625913 17629337 17630431 17646098 ] [17619247 17619628 17622123 17624021 17626007 17629450 17630635 17646177 ] ]. a simple command line can be used to fetch those data:
% curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/knownGene.txt.gz" |\ gunzip -c | grep chr22 | head -n 20 |\ awk '{printf("[(%s) (%s) (%s) %s %s %s %s [%s] [%s] ]\n",$1,$2,$3,$4,$5,$6,$7,$9,$10);}' |\ tr "," " " > result.txt
Some utilities
converting a PS object to string
/toString { 20 string cvs } bind def
Converting a string to interger (loop over each character an increase the current value)
/toInteger { 3 dict begin /s exch def /i 0 def /n 0 def s { n 10 mul /n exch def s i get 48 sub n add /n exch def %48=ascii('0') i 1 add /i exch def } forall n % leave n on the stack end } bind def
Convert a genomic position to a index on the page 'x' axis
/convertPos2pixel { minChromStart sub maxChromEnd minChromStart sub div screenWidth mul } bind def
Extract the chromosome (that is to say, extract the 1st element of the current array on the stack)
/getChrom { 1 get } bind def
Create a hyperlink to the UCSC genome browser
/getHyperLink { 3 dict begin /E exch def %% END /S exch def %% START /C exch def %% CHROMOSOME [ (http://genome.ucsc.edu/cgi-bin/hgTracks?position=) C (:) S toString (-) E toString (&) (&db=hg19) ] concatstringarray end } bind def
Paint a rectangle
/box { 4 dict begin /height exch def /width exch def /y exch def /x exch def x y moveto width 0 rlineto 0 height rlineto width -1 mul 0 rlineto 0 height -1 mul rlineto end } bind def
Paint a gray gradient
/gradient { 4 dict begin /height exch def /width exch def /y exch def /x exch def /i 0 def height 2 div /i exch def 0 1 height 2 div { 1 i height 2.0 div div sub setgray newpath x y height 2 div i sub add width i 2 mul box closepath fill i 1 sub /i exch def }for newpath 0 setgray 0.4 setlinewidth x y width height box closepath stroke end } bind def
Methods extracting a data about the current gene on the PS stack.
Extract the transcription start:/getTxStart { 3 get } bind def
Extract the transcription end:
/getTxEnd { 4 get } bind def
Extract the CDS start:
/getCdsStart { 5 get } bind def
Extract the transcription end:
/getCdsEnd { 6 get } bind def
Extract the strand:
/getStrand { 2 get (+) eq {1} {-1} ifelse } bind defGet the gene name
/getKgName { 0 get } bind def
Get the number of exons:
/getExonCount { 7 get length } bind def
Get the start position of the i-th exon:
/getExonStart { 2 dict begin /i exch def /gene exch def gene 7 get i get end } bind def
Get the end position of the i-th exon:
/getExonEnd { 2 dict begin /i exch def /gene exch def gene 8 get i get end } bind def
Should we draw this gene on the page ?
/isVisible { 1 dict begin /gene exch def minChromStart gene getTxEnd gt { false } { gene getTxStart maxChromEnd gt { false } { true }ifelse }ifelse end }bind def
Methods for an array of genes
Loop over the genes and extract the lowest 5' index:/getMinChromStart { 3 dict begin /genes exch def /pos 10E9 def /i 0 def genes length { genes i get getTxStart pos min /pos exch def i 1 add /i exch def }repeat pos end } bind def
Loop over the genes and extract the highest 3' index:
/getMaxChromEnd { 3 dict begin /genes exch def /pos -1E9 def /i 0 def genes length { genes i get getTxEnd pos max /pos exch def i 1 add /i exch def }repeat pos end } bind def
Painting ONE Gene
/paintGene { 5 dict begin /gene exch def %% the GENE argument /midy featureHeight 2.0 div def %the middle of the row /x0 gene getTxStart convertPos2pixel def % 5' side of the gene in pixel /x1 gene getTxEnd convertPos2pixel def % 3' side of the gene in pixel /i 0 def 0.1 setlinewidth 1 0 0 setrgbcolor newpath x0 midy moveto x1 midy lineto closepath stroke % paint ticks 0 1 x1 x0 sub ticksx div{ newpath gene getStrand 1 eq { x0 ticksHeight sub i add midy ticksHeight add moveto x0 i add midy lineto x0 ticksHeight sub i add midy ticksHeight sub lineto } %else { x0 ticksHeight add i add midy ticksHeight add moveto x0 i add midy lineto x0 ticksHeight add i add midy ticksHeight sub lineto } ifelse stroke i ticksx add /i exch def } for %paint Transcript start-end 0 0 1 setrgbcolor newpath gene getCdsStart convertPos2pixel midy cdsHeight 2 div sub gene getCdsEnd convertPos2pixel gene getCdsStart convertPos2pixel sub cdsHeight box closepath fill % loop over exons 0 /i exch def gene getExonCount { gene i getExonStart convertPos2pixel midy exonHeight 2 div sub gene i getExonEnd convertPos2pixel gene i getExonStart convertPos2pixel sub exonHeight gradient i 1 add /i exch def } repeat 0 setgray gene getTxEnd convertPos2pixel 10 add midy moveto gene getKgName show %URL [ /Rect [x0 0 x1 1 add featureHeight] /Border [1 0 0] /Color [1 0 0] /Action << /Subtype /URI /URI gene getChrom gene getTxStart gene getTxEnd getHyperLink >> /Subtype /Link /ANN pdfmark end } bind def
Paint all Genes
/paintGenes { 3 dict begin /genes exch def %the GENE argument (an array) /i 0 def % loop iterator /j 0 def % row iterator % draw 10 vertical lines i 0 /i exch def 0 setgray 0 1 10 { %draw a vertical line screenWidth 10 div i mul 0 moveto screenWidth 10 div i mul screenHeight lineto stroke % print the position at the top rotate by 90° screenWidth 10 div i mul 10 add screenHeight 5 sub moveto -90 rotate maxChromEnd minChromStart sub i 10 div mul minChromStart add toString show 90 rotate i 1 add /i exch def } for 0 /i exch def genes length { genes i get isVisible { gsave 0 j featureHeight 2 add mul translate genes i get paintGene j 1 add /j exch def grestore } if i 1 add /i exch def }repeat end } bind def
All in one: the postscript code
Open the PS file in ghostview, evince, ...
Zooming ? Yes we can.
Ghostview has an option -Sname=string-Sname=string -sname=string Define a name in "systemdict" with a given string as value. This is different from -d.
In my postscript file, the default values for minChromStart and maxChromEnd are overridden by the user's parameters:
systemdict /userChromStart known { userChromStart toInteger /minChromStart exch def } if systemdict /userChromEnd known { userChromEnd toInteger /maxChromEnd exch def } ifThat's it,
Pierre
23 September 2011
Joining genomic annotations files with the tabix API.
Tabix is a software that is part of the samtools package.
After indexing a file, tabix is able to quickly retrieve data lines overlapping genomic regions (see also my previous post about tabix). Here, I wrote a tool named jointabix that joins the data of a (chrom/start/end) file with a file indexed with tabix. I've posted the code on github at: https://github.com/lindenb/samtools-utilities/blob/master/src/jointabix.c.
That's it,
Pierre
Usage
$ jointabix -h Usage: jointabix (options) {stdin|file|gzfiles}: -dcolumn delimiter. default: TAB -c chromosome column (1). -s start column (2). -e end column (2). -i ignore lines starting with ('#'). -t tabix file (required).
+1 add 1 to the genomic coodinates. -1 remove 1 to the genomic coodinates.
Example:
In the following example, I'm going to join the SNPs from the 1000 genome project with the "cytoband" database of the UCSC.##download and index UCSC-cytobands: $ wget -O cytoBand.txt.gz "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/cytoBand.txt.gz" $ gunzip cytoBand.txt.gz $ bgzip cytoBand.txt $ curl -s "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.sites.vcf.gz" |\ gunzip -c |\ sed 's/^\([^#]\)/chr\1/' |\ cut -d ' ' -f 1-5 |\ jointabix -c 1 -s 2 -e 2 -1 -f cytoBand.txt.gz |\ grep -v "##" #CHROM POS ID REF ALT chr1 10327 rs112750067 T C chr1 0 2300000 p36.33 gneg chr1 10469 rs117577454 C G chr1 0 2300000 p36.33 gneg chr1 10492 rs55998931 C T chr1 0 2300000 p36.33 gneg chr1 10583 rs58108140 G A chr1 0 2300000 p36.33 gneg chr1 11508 . A G chr1 0 2300000 p36.33 gneg chr1 11565 . G T chr1 0 2300000 p36.33 gneg chr1 12783 . G A chr1 0 2300000 p36.33 gneg chr1 13116 . T G chr1 0 2300000 p36.33 gneg chr1 13327 . G C chr1 0 2300000 p36.33 gneg chr1 13980 . T C chr1 0 2300000 p36.33 gneg chr1 14699 . C G chr1 0 2300000 p36.33 gneg chr1 14930 . A G chr1 0 2300000 p36.33 gneg chr1 14933 . G A chr1 0 2300000 p36.33 gneg chr1 14948 . G A chr1 0 2300000 p36.33 gneg chr1 15118 . A G chr1 0 2300000 p36.33 gneg chr1 15211 . T G chr1 0 2300000 p36.33 gneg chr1 15274 . A T chr1 0 2300000 p36.33 gneg chr1 15820 . G T chr1 0 2300000 p36.33 gneg chr1 16206 . T A chr1 0 2300000 p36.33 gneg chr1 16257 . G C chr1 0 2300000 p36.33 gneg chr1 16280 . T C chr1 0 2300000 p36.33 gneg chr1 16298 . C T chr1 0 2300000 p36.33 gneg chr1 16378 . T C chr1 0 2300000 p36.33 gneg chr1 16495 . G C chr1 0 2300000 p36.33 gneg chr1 16534 . C T chr1 0 2300000 p36.33 gneg chr1 16841 . G T chr1 0 2300000 p36.33 gneg chr1 28376 . G A chr1 0 2300000 p36.33 gneg chr1 28563 . A G chr1 0 2300000 p36.33 gneg chr1 30860 . G C chr1 0 2300000 p36.33 gneg chr1 30885 . T C chr1 0 2300000 p36.33 gneg chr1 30923 . G T chr1 0 2300000 p36.33 gneg chr1 31295 . A C chr1 0 2300000 p36.33 gneg chr1 31467 . T C chr1 0 2300000 p36.33 gneg chr1 31487 . G A chr1 0 2300000 p36.33 gneg chr1 40261 . C A chr1 0 2300000 p36.33 gneg chr1 46633 . T A chr1 0 2300000 p36.33 gneg chr1 48183 . C A chr1 0 2300000 p36.33 gneg chr1 48186 . T G chr1 0 2300000 p36.33 gneg chr1 49272 . G A chr1 0 2300000 p36.33 gneg chr1 49298 . T C chr1 0 2300000 p36.33 gneg chr1 49554 . A G chr1 0 2300000 p36.33 gneg chr1 51479 rs116400033 T A chr1 0 2300000 p36.33 gneg chr1 51673 . T C chr1 0 2300000 p36.33 gneg chr1 51803 rs62637812 T C chr1 0 2300000 p36.33 gneg chr1 51898 rs76402894 C A chr1 0 2300000 p36.33 gneg chr1 52058 rs62637813 G C chr1 0 2300000 p36.33 gneg chr1 52238 . T G chr1 0 2300000 p36.33 gneg chr1 52727 . C G chr1 0 2300000 p36.33 gneg chr1 54353 . C A chr1 0 2300000 p36.33 gneg (...)
That's it,
Pierre
11 September 2011
The Wikipedia Template:Infobox_biodatabase is now integrated in DBPedia
In January 2011, I started the project Template:Infobox_biodatabase. The goal of this project is the annotation of the biological databases in wikipedia using an infobox. The pages annotated with this template have now been integrated into DBpedia 3.7 and it is now possible to query the data through a SPARQL endpoint.
(Note: during the process of writing the new pages in wikipedia, a few articles have been proposed for deletion for notability reasons: I din't fight against the choise of the WP editors).
(Note: during the process of writing the new pages in wikipedia, a few articles have been proposed for deletion for notability reasons: I din't fight against the choise of the WP editors).
Articles in category: "Biological database"
SPARQL
List the biological databases.PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX dbpedia: <http://dbpedia.org/ontology/> SELECT ?title ?uri WHERE { ?uri a dbpedia:BiologicalDatabase . OPTIONAL { ?uri dbpedia:title ?title. } } ORDER By ?uri
Result:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | title | uri | ======================================================================================================================================================================================================== | "3did"@en | <http://dbpedia.org/resource/3did> | | "ABCdb"@en | <http://dbpedia.org/resource/ABCdb> | | "AREsite"@en | <http://dbpedia.org/resource/AREsite> | | "AlloSteric Database"@en | <http://dbpedia.org/resource/ASD_%28database%29> | | "AgBase"@en | <http://dbpedia.org/resource/AgBase> | | "Allele frequency net"@en | <http://dbpedia.org/resource/Allele_frequency_net_database> | | "ASTD"@en | <http://dbpedia.org/resource/Alternative_splicing_and_transcript_diversity_database> | | "ASAP"@en | <http://dbpedia.org/resource/Alternative_splicing_annotation_project> | | "AmoebaDB"@en | <http://dbpedia.org/resource/AmoebaDB> | | "ArachnoServer"@en | <http://dbpedia.org/resource/ArachnoServer> | | "ArtadeDB"@en | <http://dbpedia.org/resource/Artade> | | "ASPicDB"@en | <http://dbpedia.org/resource/AspicDB> | | "The Autophagy Database"@en | <http://dbpedia.org/resource/Autophagy_database> | | "BGMUT"@en | <http://dbpedia.org/resource/BGMUT> | | "BISC"@en | <http://dbpedia.org/resource/BISC_%28database%29> | | "BRENDA"@en | <http://dbpedia.org/resource/BRENDA> | | "The BRENDA Tissue Ontology (BTO)"@en | <http://dbpedia.org/resource/BRENDA_tissue_ontology> | | | <http://dbpedia.org/resource/BindingDB> | | "Bio2RDF"@en | <http://dbpedia.org/resource/Bio2RDF> | | "BioGRID"@en | <http://dbpedia.org/resource/BioGRID> | | "BioModels Database"@en | <http://dbpedia.org/resource/BioModels_Database> | | "BSDB"@en | <http://dbpedia.org/resource/Biomolecule_stretching_database> | | "Bovine Genome Database"@en | <http://dbpedia.org/resource/Bovine_genome_database> | | "BriX"@en | <http://dbpedia.org/resource/Brix_%28database%29> | | "CADgene"@en | <http://dbpedia.org/resource/CADgene> | | "CATH"@en | <http://dbpedia.org/resource/CATH> | | "CLIPZ:"@en | <http://dbpedia.org/resource/CLIPZ> | | "COSMIC"@en | <http://dbpedia.org/resource/COSMIC_cancer_database> | | "CaSNP"@en | <http://dbpedia.org/resource/CaSNP> | | "CancerResource:"@en | <http://dbpedia.org/resource/CancerResource> | | "cBARBEL"@en | <http://dbpedia.org/resource/Catfish_genome_database> | | "CCDB"@en | <http://dbpedia.org/resource/Cervical_cancer_gene_database> | | | <http://dbpedia.org/resource/ChEBI> | | | <http://dbpedia.org/resource/ChEMBL> | | "ChemProt"@en | <http://dbpedia.org/resource/ChemProt> | | "ChimerDB"@en | <http://dbpedia.org/resource/ChimerDB> | | "MDS_IES_DB"@en | <http://dbpedia.org/resource/Ciliate_MDS/IES_database> | | "Ciona intestinalis protein database"@en | <http://dbpedia.org/resource/Ciona_intestinalis_protein_database> | | "ACLAME"@en | <http://dbpedia.org/resource/Classification_of_mobile_genetic_elements> | | "COMBREX: COMputational BRidges to EXperiments"@en | <http://dbpedia.org/resource/Combrex> | | "CAMERA"@en | <http://dbpedia.org/resource/Community_Cyberinfrastructure_for_Advanced_Marine_Microbial_Ecology_Research_and_Analysis> | | "CORG"@en | <http://dbpedia.org/resource/Comparative_regulatory_genomics_database> | | "CPLA"@en | <http://dbpedia.org/resource/Compendium_of_protein_lysine_acetylation> | | "Conformational dynamics data bank"@en | <http://dbpedia.org/resource/Conformational_dynamics_data_bank> | | "ConsensusPathDB"@en | <http://dbpedia.org/resource/ConsensusPathDB> | | "CDD"@en | <http://dbpedia.org/resource/Conserved_domain_database> | | "DAnCER"@en | <http://dbpedia.org/resource/DAnCER_%28database%29> | | "DBASS3 and DBASS5"@en | <http://dbpedia.org/resource/DBASS3/5> | | "DIMA"@en | <http://dbpedia.org/resource/DIMA_%28database%29> | | "DNA Data Bank of Japan"@en | <http://dbpedia.org/resource/DNA_Data_Bank_of_Japan> | | "PCDB"@en | <http://dbpedia.org/resource/Database_of_protein_conformational_diversity> | | "dbCRID"@en | <http://dbpedia.org/resource/DbCRID> | | "dbDNV"@en | <http://dbpedia.org/resource/DbDNV> | | "dbSNP"@en | <http://dbpedia.org/resource/DbSNP> | | "DiProDB: a database for dinucleotide properties."@en | <http://dbpedia.org/resource/DiProDB> | | "dictyBase"@en | <http://dbpedia.org/resource/DictyBase> | | "DOMINE"@en | <http://dbpedia.org/resource/Domine_Database> | | "DroID"@en | <http://dbpedia.org/resource/Droid_%28database%29> | | "ECRbase"@en | <http://dbpedia.org/resource/ECRbase> | | "ECgene"@en | <http://dbpedia.org/resource/ECgene> | | "EDAS."@en | <http://dbpedia.org/resource/EDAS> | | "EMAGE"@en | <http://dbpedia.org/resource/EMAGE> | | "EMDataBank.org"@en | <http://dbpedia.org/resource/EM_Data_Bank> | | "ENCODE"@en | <http://dbpedia.org/resource/ENCODE> | | "EcoCyc"@en | <http://dbpedia.org/resource/EcoCyc> | | "Effective-"@en | <http://dbpedia.org/resource/Effective_%28database%29> | | "The Ensembl genome database project."@en | <http://dbpedia.org/resource/Ensembl> | | "EID"@en | <http://dbpedia.org/resource/Exon-intron_database> | | "ExtraTrain"@en | <http://dbpedia.org/resource/ExtraTrain> | | "FANTOM"@en | <http://dbpedia.org/resource/FANTOM> | | "FINDbase"@en | <http://dbpedia.org/resource/FINDbase> | | "FREP"@en | <http://dbpedia.org/resource/FREP> | | "FishBase"@en | <http://dbpedia.org/resource/FishBase> | | "FlyFactorSurvey"@en | <http://dbpedia.org/resource/FlyFactorSurvey> | | "Full-parasites"@en | <http://dbpedia.org/resource/Full-parasites> | | "FESD"@en | <http://dbpedia.org/resource/Functional_element_SNPs_database> | | "FGDB"@en | <http://dbpedia.org/resource/Fusarium_graminearum_genome_database> | | "GISSD"@en | <http://dbpedia.org/resource/GISSD> | | "GPnotebook"@en | <http://dbpedia.org/resource/GPnotebook> | | "GPCRDB"@en | <http://dbpedia.org/resource/G_protein-coupled_receptors_database> | | "GenBank"@en | <http://dbpedia.org/resource/GenBank> | | "Genetic codes"@en | <http://dbpedia.org/resource/Genetic_codes_%28database%29> | | "GlycomeDB"@en | <http://dbpedia.org/resource/GlycomeDB> | | "GyDB of mobile genetic elements:"@en | <http://dbpedia.org/resource/Gypsy_%28database%29> | | "The H-Invitational"@en | <http://dbpedia.org/resource/H-Invitational> | | "HGNC"@en | <http://dbpedia.org/resource/HUGO_Gene_Nomenclature_Committee> | | "HitPredict"@en | <http://dbpedia.org/resource/HitPredict> | | "HOLLYWOOD"@en | <http://dbpedia.org/resource/Hollywood_%28database%29> | | "HUMHOT"@en | <http://dbpedia.org/resource/HumHot> | | "H-DBAS"@en | <http://dbpedia.org/resource/Human-transcriptome_database_for_alternative_splicing> | | "Hymenoptera Genome Database"@en | <http://dbpedia.org/resource/Hymenoptera_genome_database> | | "IGRhCellID"@en | <http://dbpedia.org/resource/IGRhCellID> | | "IUPHAR-DB."@en | <http://dbpedia.org/resource/IUPHAR_%28database%29> | | "InSatDb"@en | <http://dbpedia.org/resource/InSatDb> | | | <http://dbpedia.org/resource/Indian_Genetic_Disease_Database_%28IGDD%29> | | "InterPro"@en | <http://dbpedia.org/resource/InterPro> | | "INTERFEROME"@en | <http://dbpedia.org/resource/Interferome> | | "IKMC: International Knockout Mouse Consortium"@en | <http://dbpedia.org/resource/International_Knockout_Mouse_Consortium> | | "Intronerator"@en | <http://dbpedia.org/resource/Intronerator> | | "ISfinder"@en | <http://dbpedia.org/resource/Isfinder> | | "Islander"@en | <http://dbpedia.org/resource/Islander_%28database%29> | | "IsoBase"@en | <http://dbpedia.org/resource/IsoBase> | | "KEGG"@en | <http://dbpedia.org/resource/KEGG> | | "KUPS"@en | <http://dbpedia.org/resource/KUPS_%28database%29> | | "KaPPA-View4"@en | <http://dbpedia.org/resource/KaPPA-View4> | | "L1Base"@en | <http://dbpedia.org/resource/L1Base> | | "Laminin database"@en | <http://dbpedia.org/resource/Laminin_database> | | "LarvalBase"@en | <http://dbpedia.org/resource/LarvalBase> | | "lncRNAdb"@en | <http://dbpedia.org/resource/LncRNAdb> | | "LocDB"@en | <http://dbpedia.org/resource/LocDB> | | "mESAdb"@en | <http://dbpedia.org/resource/MESAdb> | | "MICdb"@en | <http://dbpedia.org/resource/MICdb> | | "MPromDb"@en | <http://dbpedia.org/resource/Mammalian_promoter_database> | | "MatrixDB, the extracellular matrix interaction database."@en | <http://dbpedia.org/resource/MatrixDB> | | "MetaCyc"@en | <http://dbpedia.org/resource/MetaCyc> | | "MethDB-"@en | <http://dbpedia.org/resource/MethDB> | | "miRBase"@en | <http://dbpedia.org/resource/MiRBase> | | "miRGator"@en | <http://dbpedia.org/resource/MiRGator> | | "miRTarBase"@en | <http://dbpedia.org/resource/MiRTarBase> | | "ModBase"@en | <http://dbpedia.org/resource/ModBase> | | "The Mouse Genome Database"@en | <http://dbpedia.org/resource/Mouse_Genome_Database> | | "The mouse Gene Expression Database"@en | <http://dbpedia.org/resource/Mouse_gene_expression_database> | | "MIPS"@en | <http://dbpedia.org/resource/Munich_Information_Center_for_Protein_Sequences> | | "NCBI Epigenomics"@en | <http://dbpedia.org/resource/NCBI_Epigenomics> | | "PID"@en | <http://dbpedia.org/resource/NCI-Nature_Pathway_Interaction_Database> | | "NGSmethDB"@en | <http://dbpedia.org/resource/NGSmethDB> | | "neXtProt"@en | <http://dbpedia.org/resource/NeXtProt> | | "NetPath"@en | <http://dbpedia.org/resource/Netpath> | | "NeuroLex"@en | <http://dbpedia.org/resource/NeuroLex> | | "Non-B DB"@en | <http://dbpedia.org/resource/Non-B_database> | | "NPRD"@en | <http://dbpedia.org/resource/Nucleosome_positioning_region_database> | | "OMPdb"@en | <http://dbpedia.org/resource/OMPdb> | | "TOPSAN"@en | <http://dbpedia.org/resource/Open_protein_structure_annotation_network> | | "ODB"@en | <http://dbpedia.org/resource/Operon_database> | | "OriDB"@en | <http://dbpedia.org/resource/OriDB> | | "Orientations of Proteins in Membranes"@en | <http://dbpedia.org/resource/Orientations_of_Proteins_in_Membranes_database> | | "OrthoDB"@en | <http://dbpedia.org/resource/OrthoDB> | | "OMA"@en | <http://dbpedia.org/resource/Orthologous_MAtrix> | | "P2CS"@en | <http://dbpedia.org/resource/P2CS> | | "PANDIT"@en | <http://dbpedia.org/resource/PANDIT_%28database%29> | | "PCRPi-DB"@en | <http://dbpedia.org/resource/PCRPi-DB> | | "PDBSum"@en | <http://dbpedia.org/resource/PDBsum> | | "PROSITE"@en | <http://dbpedia.org/resource/PROSITE> | | "PSORTdb"@en | <http://dbpedia.org/resource/PSORTdb> | | "ParameciumDB"@en | <http://dbpedia.org/resource/ParameciumDB> | | "Pathway Commons"@en | <http://dbpedia.org/resource/Pathway_commons> | | "Patome"@en | <http://dbpedia.org/resource/Patome> | | "PREX"@en | <http://dbpedia.org/resource/Peroxiredoxin_classification_index> | | "Pfam"@en | <http://dbpedia.org/resource/Pfam> | | "PhEVER"@en | <http://dbpedia.org/resource/PhEVER> | | "PHOSIDA"@en | <http://dbpedia.org/resource/Phosida> | | "Phospho.ELM"@en | <http://dbpedia.org/resource/Phospho.ELM> | | "Phospho3D"@en | <http://dbpedia.org/resource/Phospho3D> | | "PhylomeDB"@en | <http://dbpedia.org/resource/PhylomeDB> | | "PlasmoDB"@en | <http://dbpedia.org/resource/PlasmoDB> | | "PmiRKB"@en | <http://dbpedia.org/resource/PmiRKB> | | "PolyQ"@en | <http://dbpedia.org/resource/PolyQ_%28database%29> | | "PolymiRTS"@en | <http://dbpedia.org/resource/PolymiRTS> | | "PSSRdb"@en | <http://dbpedia.org/resource/Polymorphic_simple_sequence_repeats_database> | | "ProSAS"@en | <http://dbpedia.org/resource/ProSAS> | | "ProtCID"@en | <http://dbpedia.org/resource/ProtCID> | | "PRIDB"@en | <http://dbpedia.org/resource/Protein-RNA_interface_database> | | "The Protein Data Bank."@en | <http://dbpedia.org/resource/Protein_Data_Bank> | | "PCDDB"@en | <http://dbpedia.org/resource/Protein_circular_dichroism_data_bank> | | "Pseudogene.org"@en | <http://dbpedia.org/resource/Pseudogene_%28database%29> | | "Pseudomonas Genome Database"@en | <http://dbpedia.org/resource/Pseudomonas_genome_database> | | "PubChem"@en | <http://dbpedia.org/resource/PubChem> | | "PubMed"@en | <http://dbpedia.org/resource/PubMed> | | "REDfly"@en | <http://dbpedia.org/resource/REDfly> | | "REPAIRtoire"@en | <http://dbpedia.org/resource/REPAIRtoire> | | "RIKEN integrated database of mammals."@en | <http://dbpedia.org/resource/RIKEN_integrated_database_of_mammals> | | "RBPDB"@en | <http://dbpedia.org/resource/RNA-binding_protein_database> | | "RNA helicase database."@en | <http://dbpedia.org/resource/RNA_helicase_database> | | "RNAMDB"@en | <http://dbpedia.org/resource/RNA_modification_database> | | "Reactome: a database of reactions, pathways and biological processes."@en | <http://dbpedia.org/resource/Reactome> | | "REBASE"@en | <http://dbpedia.org/resource/Rebase_%28database%29> | | "RECODE"@en | <http://dbpedia.org/resource/Recode_%28database%29> | | "Refseq"@en | <http://dbpedia.org/resource/RefSeq> | | "RegPhos"@en | <http://dbpedia.org/resource/RegPhos> | | "RegTransBase"@en | <http://dbpedia.org/resource/RegTransBase> | | "RegulonDB"@en | <http://dbpedia.org/resource/RegulonDB> | | "RepTar"@en | <http://dbpedia.org/resource/RepTar_%28database%29> | | "RetrOryza"@en | <http://dbpedia.org/resource/RetrOryza> | | "Rfam"@en | <http://dbpedia.org/resource/Rfam> | | "S/MARt DB"@en | <http://dbpedia.org/resource/S/MARt> | | "STRING"@en | <http://dbpedia.org/resource/STRING> | | "SUPERFAMILY"@en | <http://dbpedia.org/resource/SUPERFAMILY> | | "SeaLifeBase"@en | <http://dbpedia.org/resource/SeaLifeBase> | | "SMART"@en | <http://dbpedia.org/resource/Simple_Modular_Architecture_Research_Tool> | | "SNPSTR"@en | <http://dbpedia.org/resource/Snptstr_%28database%29> | | "SPIKE"@en | <http://dbpedia.org/resource/Spike_%28database%29> | | "SpliceInfo"@en | <http://dbpedia.org/resource/SpliceInfo> | | "StarBase"@en | <http://dbpedia.org/resource/StarBase_%28database%29> | | "SCLD"@en | <http://dbpedia.org/resource/Stem_cell_lineage_database> | | "STRBase"@en | <http://dbpedia.org/resource/Strbase> | | "SAHG"@en | <http://dbpedia.org/resource/Structure_atlas_of_human_genome> | | "SuperSweet"@en | <http://dbpedia.org/resource/SuperSweet> | | "SGDB"@en | <http://dbpedia.org/resource/Synthetic_gene_database> | | "TIARA"@en | <http://dbpedia.org/resource/TIARA_%28database%29> | | "The TIGR Plant Repeat Databases"@en | <http://dbpedia.org/resource/TIGR_plant_repeat_database> | | "TIGR Plant Transcript Assemblies database."@en | <http://dbpedia.org/resource/TIGR_plant_transcript_assembly_database> | | "TMPad"@en | <http://dbpedia.org/resource/TMPad> | | "tRNADB"@en | <http://dbpedia.org/resource/TRNADB> | | "TRDB-"@en | <http://dbpedia.org/resource/Tandem_repeats_database> | | "TassDB"@en | <http://dbpedia.org/resource/TassDB> | | "TcoF-DB"@en | <http://dbpedia.org/resource/TcoF-DB> | | "ThYme"@en | <http://dbpedia.org/resource/ThYme_%28database%29> | | "TADB"@en | <http://dbpedia.org/resource/Toxin-antitoxin_database> | | "TRIP"@en | <http://dbpedia.org/resource/Transient_receptor_potential_channel-interacting_protein_database> | | "TranspoGene and microTranspoGene"@en | <http://dbpedia.org/resource/Transpogene> | | "TreeFam"@en | <http://dbpedia.org/resource/TreeFam> | | "U12DB"@en | <http://dbpedia.org/resource/U12_intron_database> | | "The UCSC Genome Browser"@en | <http://dbpedia.org/resource/UCSC_Genome_Browser> | | "UCbase & miRfunc"@en | <http://dbpedia.org/resource/UCbase> | | "UKPMC"@en | <http://dbpedia.org/resource/UK_PubMed_Central> | | "UTRdb and UTRsite"@en | <http://dbpedia.org/resource/UTRdb> | | "UTRome"@en | <http://dbpedia.org/resource/UTRome> | | "UgMicroSatdb"@en | <http://dbpedia.org/resource/UgMicroSatdb> | | "UniGene"@en | <http://dbpedia.org/resource/UniGene> | | "UniPROBE"@en | <http://dbpedia.org/resource/UniPROBE> | | "UniProt"@en | <http://dbpedia.org/resource/UniProt> | | "UniVec"@en | <http://dbpedia.org/resource/Univec> | | "VISTA Enhancer Browser"@en | <http://dbpedia.org/resource/VISTA_%28comparative_genomics%29> | | "VnD"@en | <http://dbpedia.org/resource/Variations_and_drugs_database> | | "VectorDB"@en | <http://dbpedia.org/resource/VectorDB> | | "ViralZon"@en | <http://dbpedia.org/resource/ViralZone> | | "VKCDB"@en | <http://dbpedia.org/resource/Voltage-gated_potassium_channel_database> | | "WebGeSTer DB"@en | <http://dbpedia.org/resource/WebGeSTer> | | "WormBase"@en | <http://dbpedia.org/resource/WormBase> | | "YPA"@en | <http://dbpedia.org/resource/Yeast_promoter_atlas> | | "YEASTRACT"@en | <http://dbpedia.org/resource/Yeastract> | | | <http://dbpedia.org/resource/ZINC_database> | | "ZFIN"@en | <http://dbpedia.org/resource/Zebrafish_Information_Network> | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
List the biological databases in the category "Systems Biology"
SPARQL
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX dbpedia: <http://dbpedia.org/ontology/> SELECT ?title ?description ?uri WHERE { ?uri a dbpedia:BiologicalDatabase . ?uri <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:Systems_biology> . OPTIONAL { ?uri <http://dbpedia.org/property/title> ?title. } OPTIONAL { ?uri <http://dbpedia.org/property/description> ?description . } } ORDER By ?title
Results:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | title | description | uri | ====================================================================================================================================================================================================================================== | "BISC"@en | "Protein–protein interaction database linking structural biology with functional genomics"@en | <http://dbpedia.org/resource/BISC_%28database%29> | | "BioGRID"@en | "interaction data."@en | <http://dbpedia.org/resource/BioGRID> | | "BioModels Database"@en | "A database for storing, exchanging and retrieving published quantitative models of biological interest."@en | <http://dbpedia.org/resource/BioModels_Database> | | "ChemProt"@en | "disease chemical biology database."@en | <http://dbpedia.org/resource/ChemProt> | | "ConsensusPathDB"@en | "human functional interaction networks."@en | <http://dbpedia.org/resource/ConsensusPathDB> | | "DIMA"@en | "predicted and known interactions between protein domains"@en | <http://dbpedia.org/resource/DIMA_%28database%29> | | "HitPredict"@en | "quality assessed protein-protein interactions in nine species."@en | <http://dbpedia.org/resource/HitPredict> | | "KEGG"@en | "The KEGG resource for deciphering the genome."@en | <http://dbpedia.org/resource/KEGG> | | "KUPS"@en | "datasets of interacting and non-interacting protein pairs with associated attributions."@en | <http://dbpedia.org/resource/KUPS_%28database%29> | | "PID"@en | "Pathway Interaction Database."@en | <http://dbpedia.org/resource/NCI-Nature_Pathway_Interaction_Database> | | "Pathway Commons"@en | "biological pathways."@en | <http://dbpedia.org/resource/Pathway_commons> | | "ProtCID"@en | "interactions of homologous proteins in multiple crystal forms."@en | <http://dbpedia.org/resource/ProtCID> | | "REPAIRtoire"@en | <http://dbpedia.org/resource/DNA_repair> | <http://dbpedia.org/resource/REPAIRtoire> | | "REPAIRtoire"@en | <http://dbpedia.org/resource/Systems_biology> | <http://dbpedia.org/resource/REPAIRtoire> | | "SPIKE"@en | "highly curated human signaling pathways."@en | <http://dbpedia.org/resource/Spike_%28database%29> | | "STRING"@en | "Search Tool for the Retrieval of Interacting Genes/Proteins"@en | <http://dbpedia.org/resource/STRING> | | "3"^^<http://www.w3.org/2001/XMLSchema#int> | "identification and classification of domain-based interactions of known three-dimensional structure."@en | <http://dbpedia.org/resource/3did> | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
list the databases available at the NCBI
Sparql query
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX dbpedia: <http://dbpedia.org/ontology/> SELECT ?title ?description ?uri WHERE { ?uri a <http://dbpedia.org/ontology/BiologicalDatabase> . ?uri <http://dbpedia.org/property/center> <http://dbpedia.org/resource/National_Center_for_Biotechnology_Information> . OPTIONAL { ?uri <http://dbpedia.org/property/title> ?title . } OPTIONAL { ?uri <http://dbpedia.org/property/description> ?description . } } ORDER BY ?title
Result:
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | title | description | uri | ======================================================================================================================================================================================================== | "BGMUT"@en | "database of variations in the genes that encode antigens of blood group systems"@en | <http://dbpedia.org/resource/BGMUT> | | "CDD"@en | "Conserved Domain Database for the functional annotation of proteins."@en | <http://dbpedia.org/resource/Conserved_domain_database> | | "GenBank"@en | "Nucleotide sequences for more than 300 000 organisms with supporting bibliographic and biological annotation."@en | <http://dbpedia.org/resource/GenBank> | | "NCBI Epigenomics"@en | "epigenomic data sets."@en | <http://dbpedia.org/resource/NCBI_Epigenomics> | | "Refseq"@en | "curated non-redundant sequence database of genomes."@en | <http://dbpedia.org/resource/RefSeq> | | "UniGene"@en | <http://dbpedia.org/resource/Transcriptome> | <http://dbpedia.org/resource/UniGene> | | "dbSNP"@en | <http://dbpedia.org/resource/Database> | <http://dbpedia.org/resource/DbSNP> | | "dbSNP"@en | <http://dbpedia.org/resource/Single-nucleotide_polymorphism> | <http://dbpedia.org/resource/DbSNP> | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
List the biological databases having a SPARQL endpoint
SPARQL query
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT ?uri ?endpoint WHERE { ?uri a <http://dbpedia.org/ontology/BiologicalDatabase> . ?uri <http://dbpedia.org/property/sparql> ?endpoint . }
Result:
------------------------------------------------------------------------------------ | uri | endpoint | ==================================================================================== | <http://dbpedia.org/resource/ChEBI> | <http://chebi.bio2rdf.org> | | <http://dbpedia.org/resource/ChEMBL> | <http://rdf.farmbio.uu.se/chembl/snorql/> | ------------------------------------------------------------------------------------That's It, Pierre