EDAM is an ontology of general bioinformatics concepts, including topics and data types, formats, identifiers and operations.
Is your specific subject of research present in this ontology (e.g "RNA-Seq") ? go and have a look at http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM. If it is not, feel free to suggest a new term in the form below. Your term might be included in the next version of the ontology and it might be used as a possible choice for the Bioinformatics Career Survey 2011/2012.
I've written a tool named "apache velocity" which parse json data and processes it with "Apache velocity" (a template engine ). The (javacc) source code is available here:
The NHLBI Exome Sequencing Project (ESP) has released a web service to query their data. "The goal of the NHLBI GO Exome Sequencing Project (ESP) is to discover novel genes and mechanisms contributing to heart, lung and blood disorders by pioneering the application of next-generation sequencing of the protein coding regions of the human genome across diverse, richly-phenotyped populations and to share these datasets and findings with the scientific community to extend and enrich the diagnosis, management and treatment of heart, lung and blood disorders.". In the current post, I'll show how I've used this web service to annotate a VCF file with this information.
The web service provided by the ESP is based on the SOAP protocol. Here is an example of the XML response:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Here is the java code running this client. It scans the VCF, calls the webservice for each variation and insert the annotation as JSON in a new column .
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
curl -s "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/EUR.2of4intersection_allele_freq.20100804.sites.vcf.gz" |\
gunzip -c |\
java -jar evsclient.jar
##fileformat=VCFv4.0
##filedat=20101112
##datarelease=20100804
##samples=629
##description="Where BI calls are present, genotypes and alleles are from BI. In there absence, UM genotypes are used. If neither are available, no genotype information is present and the alleles are from the NCBI calls."
(...)
#CHROM POS ID EVS
1 10469 rs117577454 {"start":10469,"chromosome":"1","stop":10470,"strand":"+","snpList":[],"setOfSiteCoverageInfo":[]}
1 10583 rs58108140 {"start":10583,"chromosome":"1","stop":10584,"strand":"+","snpList":[],"setOfSiteCoverageInfo":[]}
1 11508 . {"start":11508,"chromosome":"1","stop":11509,"strand":"
(...)
1 69511 . {"start":69511,"chromosome":"1","stop":69512,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"1.0","conservationScoreGERP":"0.5","refAllele":"A","ancestralAllele":"G","filters":"PASS","clinicalLink":"unknown","positionString":"1:69511","chrPosition":69511,"alleles":"G/A","uaAlleleCounts":"1373/47","aaAlleleCounts":"880/600","totalAlleleCounts":"2253/647","uaAlleleAndCount":"G=1373/A=47","aaAlleleAndCount":"G=880/A=600","totalAlleleAndCount":"G=2253/A=647","uaMAF":3.3099,"aaMAF":40.5405,"totalMAF":22.3103,"avgSampleReadDepth":185,"geneList":"OR4F5","snpFunction":{"chromosome":"1","position":69511,"conservationScore":"1.0","conservationScoreGERP":"0.5","snpFxnList":[{"mrnaAccession":"NM_001005484","fxnClassGVS":"missense","aminoAcids":"THR,ALA","proteinPos":"141/306","cdnaPos":421,"pphPrediction":"benign","granthamScore":"58"}],"refAllele":"A","ancestralAllele":"G","firstRsId":75062661,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"G","hasAtLeastOneAccession":"true","rsIds":"rs75062661"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":69511,"avgSampleReadDepth":185.0,"totalSamplesCovered":1452,"eaSamplesCovered":712,"avgEaSampleReadDepth":157.0,"aaSamplesCovered":740,"avgAaSampleReadDepth":211.0},{"chromosome":"1","position":69512,"avgSampleReadDepth":180.0,"totalSamplesCovered":1501,"eaSamplesCovered":739,"avgEaSampleReadDepth":153.0,"aaSamplesCovered":762,"avgAaSampleReadDepth":207.0}]}
(...)
1 901923 . {"start":901923,"chromosome":"1","stop":901924,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"1.0","conservationScoreGERP":"5.0","refAllele":"C","ancestralAllele":"C","filters":"PASS","clinicalLink":"unknown","positionString":"1:901923","chrPosition":901923,"alleles":"A/C","uaAlleleCounts":"2/2542","aaAlleleCounts":"52/1934","totalAlleleCounts":"54/4476","uaAlleleAndCount":"A=2/C=2542","aaAlleleAndCount":"A=52/C=1934","totalAlleleAndCount":"A=54/C=4476","uaMAF":0.0786,"aaMAF":2.6183,"totalMAF":1.1921,"avgSampleReadDepth":35,"geneList":"PLEKHN1","snpFunction":{"chromosome":"1","position":901923,"conservationScore":"1.0","conservationScoreGERP":"5.0","snpFxnList":[{"mrnaAccession":"NM_032129","fxnClassGVS":"missense","aminoAcids":"SER,ARG","proteinPos":"4/612","cdnaPos":12,"pphPrediction":"probably-damaging","granthamScore":"110"}],"refAllele":"C","ancestralAllele":"C","firstRsId":0,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"A","hasAtLeastOneAccession":"true","rsIds":"none"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":901923,"avgSampleReadDepth":35.0,"totalSamplesCovered":2280,"eaSamplesCovered":1272,"avgEaSampleReadDepth":32.0,"aaSamplesCovered":1008,"avgAaSampleReadDepth":38.0},{"chromosome":"1","position":901924,"avgSampleReadDepth":35.0,"totalSamplesCovered":2283,"eaSamplesCovered":1273,"avgEaSampleReadDepth":32.0,"aaSamplesCovered":1010,"avgAaSampleReadDepth":38.0}]}
1 902069 rs116147894 {"start":902069,"chromosome":"1","stop":902070,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"0.0","conservationScoreGERP":"1.0","refAllele":"T","ancestralAllele":"T","filters":"PASS","clinicalLink":"unknown","positionString":"1:902069","chrPosition":902069,"alleles":"C/T","uaAlleleCounts":"2/320","aaAlleleCounts":"18/212","totalAlleleCounts":"20/532","uaAlleleAndCount":"C=2/T=320","aaAlleleAndCount":"C=18/T=212","totalAlleleAndCount":"C=20/T=532","uaMAF":0.6211,"aaMAF":7.8261,"totalMAF":3.6232,"avgSampleReadDepth":13,"geneList":"PLEKHN1","snpFunction":{"chromosome":"1","position":902069,"conservationScore":"0.0","conservationScoreGERP":"1.0","snpFxnList":[{"mrnaAccession":"NM_032129","fxnClassGVS":"intron","aminoAcids":"none","proteinPos":"NA","cdnaPos":-1,"pphPrediction":"unknown","granthamScore":"NA"}],"refAllele":"T","ancestralAllele":"T","firstRsId":0,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"C","hasAtLeastOneAccession":"true","rsIds":"none"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":902069,"avgSampleReadDepth":13.0,"totalSamplesCovered":304,"eaSamplesCovered":169,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":135,"avgAaSampleReadDepth":12.0},{"chromosome":"1","position":902070,"avgSampleReadDepth":12.0,"totalSamplesCovered":338,"eaSamplesCovered":190,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":148,"avgAaSampleReadDepth":12.0}]}
1 902108 rs62639981 {"start":902108,"chromosome":"1","stop":902109,"strand":"+","snpList":[{"chromosome":"1","conservationScore":"0.0","conservationScoreGERP":"-8.7","refAllele":"C","ancestralAllele":"unknown","filters":"PASS","clinicalLink":"unknown","positionString":"1:902108","chrPosition":902108,"alleles":"T/C","uaAlleleCounts":"5/333","aaAlleleCounts":"0/248","totalAlleleCounts":"5/581","uaAlleleAndCount":"T=5/C=333","aaAlleleAndCount":"T=0/C=248","totalAlleleAndCount":"T=5/C=581","uaMAF":1.4793,"aaMAF":0.0,"totalMAF":0.8532,"avgSampleReadDepth":13,"geneList":"PLEKHN1","snpFunction":{"chromosome":"1","position":902108,"conservationScore":"0.0","conservationScoreGERP":"-8.7","snpFxnList":[{"mrnaAccession":"NM_032129","fxnClassGVS":"coding-synonymous","aminoAcids":"none","proteinPos":"36/612","cdnaPos":108,"pphPrediction":"unknown","granthamScore":"NA"}],"refAllele":"C","ancestralAllele":"unknown","firstRsId":62639981,"secondRsId":0,"filters":"PASS","clinicalLink":"unknown"},"altAlleles":"T","hasAtLeastOneAccession":"true","rsIds":"rs62639981"}],"setOfSiteCoverageInfo":[{"chromosome":"1","position":902108,"avgSampleReadDepth":13.0,"totalSamplesCovered":294,"eaSamplesCovered":170,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":124,"avgAaSampleReadDepth":13.0},{"chromosome":"1","position":902109,"avgSampleReadDepth":13.0,"totalSamplesCovered":309,"eaSamplesCovered":177,"avgEaSampleReadDepth":13.0,"aaSamplesCovered":132,"avgAaSampleReadDepth":13.0}]}
(...)
.Today, I've been asked to call the variations for a set of BAM files mapped on a reference genome without this 'chr' prefix. One way to get around this problem is to change the header for those BAM. Another way is to create a copy of the faidx file where the 'chr' prefixes have been removed (the faidx is still valid as the positions in the chromosomes didn't change):
sed 's/^chr//' hg19.fa.fai > hg19_NOPREFIX.fa.fai
and to create a symbolic link named hg19_NOPREFIX.fa pointing to the original reference:
ln -s hg19.fa hg19_NOPREFIX.fa
.
The result:
ls -lah
-rw-r--r-- 1 root root 3.0G Jan 4 2011 hg19.fa
-rw-r--r-- 1 root root 788 Jan 27 2011 hg19.fa.fai
lrwxrwxrwx 1 root root 7 Oct 20 16:12 hg19_NOPREFIX.fa -> hg19.fa
-rw-r--r-- 1 root root 713 Oct 20 16:12 hg19_NOPREFIX.fa.fai
This solution worked so far with samtools mpileup.
Knime4Bio: a set of custom nodes for the interpretation of Next Generation Sequencing data with KNIME.
Pierre Lindenbaum, Solena Le Scouarnec, Vincent Portero and Richard Redon
Summary: Here, we describe Knime4Bio, a set of custom nodes for the KNIME (The Konstanz Information Miner) interactive graphical workbench, for the interpretation of large biological datasets. We demonstrate that this tool can be utilised to quickly retrieve previously published scientific findings. Availability: http://code.google.com/p/knime4bio/.
$ cat input.tsv
#CHROM POS REF ALT GENE SAMPLE
chr1 10 A T gene1 indi1
chr1 10 A T gene1 indi2
chr1 11 C G gene1 indi2
chr2 110 C G gene2 indi3
chr3 210 A T gene3 indi1
chr3 211 C T gene3 indi2
chr3 211 C T gene3 indi3
chr3 215 C G gene3 indi3
chr3 216 C T gene3 indi3
chr4 390 C T gene4 indi1
chr4 390 C A gene4 indi3
"PostScript (PS) is an interpreted, stack-based programming language. It is best known for its use as a page description language in the electronic and desktop publishing areas."[wikipedia]. In this post, I'll show how I've used to create a simple and lightweight view
of the genome.
Introduction: just a simple postscript program
The following PS program fills a rectangular gray shape; You can display the result using ghostview, a2ps, etc...
Each Gene is a PS array holding the structure of the UCSC knownGene table, that is to say: name , chromosome, txStart, txEnd, cdsStart, cdsEnd, exonStarts, exonEnds:
Converting a string to interger (loop over each character an increase the current value)
/toInteger
{
3 dict begin
/s exch def
/i 0 def
/n 0 def
s {
n 10 mul /n exch def
s i get 48 sub n add /n exch def %48=ascii('0')
i 1 add /i exch def
} forall
n % leave n on the stack
end
} bind def
Convert a genomic position to a index on the page 'x' axis
/convertPos2pixel
{
minChromStart sub maxChromEnd minChromStart sub div screenWidth mul
} bind def
Extract the chromosome (that is to say, extract the 1st element of the current array on the stack)
/getChrom
{
1 get
} bind def
Create a hyperlink to the UCSC genome browser
/getHyperLink
{
3 dict begin
/E exch def %% END
/S exch def %% START
/C exch def %% CHROMOSOME
[ (http://genome.ucsc.edu/cgi-bin/hgTracks?position=) C (:) S toString (-) E toString (&) (&db=hg19) ] concatstringarray
end
} bind def
Paint a rectangle
/box
{
4 dict begin
/height exch def
/width exch def
/y exch def
/x exch def
x y moveto
width 0 rlineto
0 height rlineto
width -1 mul 0 rlineto
0 height -1 mul rlineto
end
} bind def
Paint a gray gradient
/gradient
{
4 dict begin
/height exch def
/width exch def
/y exch def
/x exch def
/i 0 def
height 2 div /i exch def
0 1 height 2 div {
1 i height 2.0 div div sub setgray
newpath
x
y height 2 div i sub add
width
i 2 mul
box
closepath
fill
i 1 sub /i exch def
}for
newpath
0 setgray
0.4 setlinewidth
x y width height box
closepath
stroke
end
} bind def
Methods extracting a data about the current gene on the PS stack.
Loop over the genes and extract the lowest 5' index:
/getMinChromStart
{
3 dict begin
/genes exch def
/pos 10E9 def
/i 0 def
genes length {
genes i get getTxStart pos min /pos exch def
i 1 add /i exch def
}repeat
pos
end
} bind def
Loop over the genes and extract the highest 3' index:
/getMaxChromEnd
{
3 dict begin
/genes exch def
/pos -1E9 def
/i 0 def
genes length {
genes i get getTxEnd pos max /pos exch def
i 1 add /i exch def
}repeat
pos
end
} bind def
Painting ONE Gene
/paintGene
{
5 dict begin
/gene exch def %% the GENE argument
/midy featureHeight 2.0 div def %the middle of the row
/x0 gene getTxStart convertPos2pixel def % 5' side of the gene in pixel
/x1 gene getTxEnd convertPos2pixel def % 3' side of the gene in pixel
/i 0 def
0.1 setlinewidth
1 0 0 setrgbcolor
newpath
x0 midy moveto
x1 midy lineto
closepath
stroke
% paint ticks
0 1 x1 x0 sub ticksx div{
newpath
gene getStrand 1 eq
{
x0 ticksHeight sub i add midy ticksHeight add moveto
x0 i add midy lineto
x0 ticksHeight sub i add midy ticksHeight sub lineto
}
%else
{
x0 ticksHeight add i add midy ticksHeight add moveto
x0 i add midy lineto
x0 ticksHeight add i add midy ticksHeight sub lineto
} ifelse
stroke
i ticksx add /i exch def
} for
%paint Transcript start-end
0 0 1 setrgbcolor
newpath
gene getCdsStart convertPos2pixel
midy cdsHeight 2 div sub
gene getCdsEnd convertPos2pixel gene getCdsStart convertPos2pixel sub
cdsHeight box
closepath
fill
% loop over exons
0 /i exch def
gene getExonCount
{
gene i getExonStart convertPos2pixel
midy exonHeight 2 div sub
gene i getExonEnd convertPos2pixel gene i getExonStart convertPos2pixel sub
exonHeight gradient
i 1 add /i exch def
} repeat
0 setgray
gene getTxEnd convertPos2pixel 10 add midy moveto
gene getKgName show
%URL
[ /Rect [x0 0 x1 1 add featureHeight]
/Border [1 0 0]
/Color [1 0 0]
/Action << /Subtype /URI /URI gene getChrom gene getTxStart gene getTxEnd getHyperLink >>
/Subtype /Link
/ANN pdfmark
end
} bind def
Paint all Genes
/paintGenes
{
3 dict begin
/genes exch def %the GENE argument (an array)
/i 0 def % loop iterator
/j 0 def % row iterator
% draw 10 vertical lines
i 0 /i exch def
0 setgray
0 1 10 {
%draw a vertical line
screenWidth 10 div i mul 0 moveto
screenWidth 10 div i mul screenHeight lineto
stroke
% print the position at the top rotate by 90°
screenWidth 10 div i mul 10 add screenHeight 5 sub moveto
-90 rotate
maxChromEnd minChromStart sub i 10 div mul minChromStart add toString show
90 rotate
i 1 add /i exch def
} for
0 /i exch def
genes length {
genes i get isVisible
{
gsave
0 j featureHeight 2 add mul translate
genes i get paintGene
j 1 add /j exch def
grestore
} if
i 1 add /i exch def
}repeat
end
} bind def
All in one: the postscript code
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Tabix is a software that is part of the samtools package.
After indexing a file, tabix is able to quickly retrieve data lines overlapping genomic regions (see also my previous post about tabix). Here, I wrote a tool named jointabix that joins the data of a (chrom/start/end) file with a file indexed with tabix. I've posted the code on github at: https://github.com/lindenb/samtools-utilities/blob/master/src/jointabix.c.
In January 2011, I started the project Template:Infobox_biodatabase. The goal of this project is the annotation of the biological databases in wikipedia using an infobox. The pages annotated with this template have now been integrated into DBpedia 3.7 and it is now possible to query the data through a SPARQL endpoint. (Note: during the process of writing the new pages in wikipedia, a few articles have been proposed for deletion for notability reasons: I din't fight against the choise of the WP editors).