18 July 2013

Inside the Variation toolkit: annotating a VCF with the data of NCBI biosystems mapped to BED.

Let's annotate a VCF file with the data from the NCBI biosystem.

First the 'NCBI biosystem' data are mapped to a BED file using the following script. It joins "ncbi;biosystem2gene", "ncbi:biosystem-label" and "biomart-ensembl:gene"

#!/bin/bash
CURLOPT=" "
LC_ALL=C join -t ' ' -1 1 -2 1 <(curl ${CURLOPT} "ftp://ftp.ncbi.nih.gov/pub/biosystems/biosystems.20130711/biosystems_gene.gz" | gunzip -c | LC_ALL=C sort -t ' ' -k1,1 ) <(curl ${CURLOPT} "ftp://ftp.ncbi.nih.gov/pub/biosystems/biosystems.20130711/bsid2info.gz" | gunzip -c | cut -d ' ' -f1,4 | LC_ALL=C sort -t ' ') | LC_ALL=C sort -t ' ' -k2,2 | LC_ALL=C join -t ' ' -1 4 -2 2 <(curl ${CURLOPT} -d "query=%3CQuery%20virtualSchemaName%3D%22default%22%20formatter%3D%22TSV%22%20header%3D%220%22%20uniqueRows%3D%221%22%20count%3D%22%22%20datasetConfigVersion%3D%220.6%22%3E%3CDataset%20name%3D%22hsapiens_gene_ensembl%22%20interface%3D%22default%22%3E%3CAttribute%20name%3D%22chromosome_name%22%2F%3E%3CAttribute%20name%3D%22start_position%22%2F%3E%3CAttribute%20name%3D%22end_position%22%2F%3E%3CAttribute%20name%3D%22entrezgene%22%2F%3E%3C%2FDataset%3E%3C%2FQuery%3E" "http://www.biomart.org/biomart/martservice/result" | awk -F ' ' '($4!="")' | LC_ALL=C sort -t ' ' -k4,4) - | awk -F ' ' '{OFS=" ";print $2,$3,$4,$1,$5,$6,$7;}' | tr " " "_" | LC_ALL=C sort -t ' ' -k1,1 -k2,2n -k3,3n -k4 | bgzip -c > ncbibiosystem.bed.gz && tabix -p bed -f ncbibiosystem.bed.gz


It produces a tabix-inded BED mapping the data of 'NCBI biosystem':
$ gunzip -c ncbibiosystem.bed.gz | head
1 69091 70008 79501 106356 30 Signaling_by_GPCR
1 69091 70008 79501 106383 50 Olfactory_Signaling_Pathway
1 69091 70008 79501 119548 40 GPCR_downstream_signaling
1 69091 70008 79501 477114 30 Signal_Transduction
1 69091 70008 79501 498 40 Olfactory_transduction
1 69091 70008 79501 83087 60 Olfactory_transduction
1 367640 368634 26683 106356 30 Signaling_by_GPCR
1 367640 368634 26683 106383 50 Olfactory_Signaling_Pathway
1 367640 368634 26683 119548 40 GPCR_downstream_signaling
1 367640 368634 26683 477114 30 Signal_Transduction

I wrote a tool named VCFBed: it takes as input a VCF and annotate it with the data of a tabix-ed BED. This tool is available on github at https://github.com/lindenb/jvarkit#vcfbed. Let's annotate a remote VCF with ncbibiosystem.bed.gz :
java -jar dist/vcfbed.jar \
   TABIXFILE=~/ncbibiosystem.bed.gz \
   TAG=NCBIBIOSYS \
   FMT='($4|$5|$6|$7)'|\
grep -E '(^#CHR|NCBI)'

##INFO=<ID=NCBIBIOSYS,Number=.,Type=String,Description="metadata added from /home/lindenb/ncbibiosystem.bed.gz . Format was ($4|$5|$6|$7)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1094PC0005 1094PC0009 1094PC0012 1094PC0013
1 69270 . A G 2694.18 . AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=32.86 GT:AD:DP:GQ:PL ./. ./. 1/1:0,3:3:9.03:106,9,0 1/1:0,6:6:18.05:203,18,0
1 69511 . A G 77777.27 . AC=49;AF=0.875;AN=56;BaseQRankSum=0.150;DP=2816;DS;Dels=0.00;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=21.286;HRun=0;HaplotypeScore=3.8956;InbreedingCoeff=0.0604;MQ=32.32;MQ0=0;MQRankSum=1.653;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=27.68;ReadPosRankSum=2.261 GT:AD:DP:GQ:PL ./. ./. 0/1:2,4:6:15.70:16,0,40 0/1:2,2:4:21.59:22,0,40

That's it,

Pierre

No comments: