Inside the Variation toolkit: annotating a VCF with the data of NCBI biosystems mapped to BED.
Let's annotate a VCF file with the data from the NCBI biosystem.
First the 'NCBI biosystem' data are mapped to a BED file using the following script. It joins "ncbi;biosystem2gene", "ncbi:biosystem-label" and "biomart-ensembl:gene"
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/bin/bash | |
CURLOPT=" " | |
LC_ALL=C join -t ' ' -1 1 -2 1 <(curl ${CURLOPT} "ftp://ftp.ncbi.nih.gov/pub/biosystems/biosystems.20130711/biosystems_gene.gz" | gunzip -c | LC_ALL=C sort -t ' ' -k1,1 ) <(curl ${CURLOPT} "ftp://ftp.ncbi.nih.gov/pub/biosystems/biosystems.20130711/bsid2info.gz" | gunzip -c | cut -d ' ' -f1,4 | LC_ALL=C sort -t ' ') | LC_ALL=C sort -t ' ' -k2,2 | LC_ALL=C join -t ' ' -1 4 -2 2 <(curl ${CURLOPT} -d "query=%3CQuery%20virtualSchemaName%3D%22default%22%20formatter%3D%22TSV%22%20header%3D%220%22%20uniqueRows%3D%221%22%20count%3D%22%22%20datasetConfigVersion%3D%220.6%22%3E%3CDataset%20name%3D%22hsapiens_gene_ensembl%22%20interface%3D%22default%22%3E%3CAttribute%20name%3D%22chromosome_name%22%2F%3E%3CAttribute%20name%3D%22start_position%22%2F%3E%3CAttribute%20name%3D%22end_position%22%2F%3E%3CAttribute%20name%3D%22entrezgene%22%2F%3E%3C%2FDataset%3E%3C%2FQuery%3E" "http://www.biomart.org/biomart/martservice/result" | awk -F ' ' '($4!="")' | LC_ALL=C sort -t ' ' -k4,4) - | awk -F ' ' '{OFS=" ";print $2,$3,$4,$1,$5,$6,$7;}' | tr " " "_" | LC_ALL=C sort -t ' ' -k1,1 -k2,2n -k3,3n -k4 | bgzip -c > ncbibiosystem.bed.gz && tabix -p bed -f ncbibiosystem.bed.gz |
It produces a tabix-inded BED mapping the data of 'NCBI biosystem':
$ gunzip -c ncbibiosystem.bed.gz | head 1 69091 70008 79501 106356 30 Signaling_by_GPCR 1 69091 70008 79501 106383 50 Olfactory_Signaling_Pathway 1 69091 70008 79501 119548 40 GPCR_downstream_signaling 1 69091 70008 79501 477114 30 Signal_Transduction 1 69091 70008 79501 498 40 Olfactory_transduction 1 69091 70008 79501 83087 60 Olfactory_transduction 1 367640 368634 26683 106356 30 Signaling_by_GPCR 1 367640 368634 26683 106383 50 Olfactory_Signaling_Pathway 1 367640 368634 26683 119548 40 GPCR_downstream_signaling 1 367640 368634 26683 477114 30 Signal_Transduction
I wrote a tool named VCFBed: it takes as input a VCF and annotate it with the data of a tabix-ed BED. This tool is available on github at https://github.com/lindenb/jvarkit#vcfbed. Let's annotate a remote VCF with ncbibiosystem.bed.gz :
java -jar dist/vcfbed.jar \ TABIXFILE=~/ncbibiosystem.bed.gz \ TAG=NCBIBIOSYS \ FMT='($4|$5|$6|$7)'|\ grep -E '(^#CHR|NCBI)'
##INFO=<ID=NCBIBIOSYS,Number=.,Type=String,Description="metadata added from /home/lindenb/ncbibiosystem.bed.gz . Format was ($4|$5|$6|$7)">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 1094PC0005 1094PC0009 1094PC0012 1094PC0013
1 69270 . A G 2694.18 . AC=40;AF=1.000;AN=40;DP=83;Dels=0.00;EFF=SYNONYMOUS_CODING(LOW|SILENT|tcA/tcG|S60|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=0.000;HRun=0;HaplotypeScore=0.0000;InbreedingCoeff=-0.0598;MQ=31.06;MQ0=0;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=32.86 GT:AD:DP:GQ:PL ./. ./. 1/1:0,3:3:9.03:106,9,0 1/1:0,6:6:18.05:203,18,0
1 69511 . A G 77777.27 . AC=49;AF=0.875;AN=56;BaseQRankSum=0.150;DP=2816;DS;Dels=0.00;EFF=NON_SYNONYMOUS_CODING(MODERATE|MISSENSE|Aca/Gca|T141A|305|OR4F5|protein_coding|CODING|ENST00000335137|exon_1_69091_70008);FS=21.286;HRun=0;HaplotypeScore=3.8956;InbreedingCoeff=0.0604;MQ=32.32;MQ0=0;MQRankSum=1.653;NCBIBIOSYS=(79501|119548|40|GPCR_downstream_signaling),(79501|106356|30|Signaling_by_GPCR),(79501|498|40|Olfactory_transduction),(79501|83087|60|Olfactory_transduction),(79501|477114|30|Signal_Transduction),(79501|106383|50|Olfactory_Signaling_Pathway);QD=27.68;ReadPosRankSum=2.261 GT:AD:DP:GQ:PL ./. ./. 0/1:2,4:6:15.70:16,0,40 0/1:2,2:4:21.59:22,0,40
That's it,
Pierre
No comments:
Post a Comment