Using the Ensembl Regulatory Build to annotate some VCF files
via UCSC Genome Browser project announcements: "Data from the Ensembl Regulatory Build are now available in the UCSC Genome Browser as a public track hub for both hg19 and hg38. This track hub contains promoters and their flanking regions, enhancers, and many other regulatory features predicted across a number of cell lines using annotated segmentation states".
For example looking at chr21:33037019-33037021 returns the following screen:
- This hub is defined at http://ngs.sanger.ac.uk/production/ensembl/regulation/hub.txt.
- It provides data for the genomes defined in http://ngs.sanger.ac.uk/production/ensembl/regulation/genomes.txt.
- For Hg19, a set of BigBed and BigWig files is defined in http://ngs.sanger.ac.uk/production/ensembl/regulation/hg19/trackDb.txt
track BuildOverview bigDataUrl http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/overview/RegBuild.bb parent RegBuildOverview on shortLabel Ensembl Reg Build longLabel Ensembl Regulatory annotation of regional function type bigBed 9 itemRgb on priority 1 visibility dense track TFBS_Summary bigDataUrl http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/overview/all_tfbs.bw parent RegBuildOverview on shortLabel TFBS Summary longLabel Summary of Ensembl Transcription Factor Binding Site peaks from all cell types type bigWig autoscale on maxHeightPixels 128:64:16 viewLimits 0:1 color 179,123,0 priority 2 visibility dense (...)
$ bigWigSummary -type=mean -udcDir=. \ "http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/segmentation_summaries/Segway_17/1.bw" \ chr1 1 110301 1 1.23587I wrote a java tool for the annotation of VCFs with those files. This tool uses the BigWig library for java ( https://code.google.com/p/bigwig/ ) and is available at: https://github.com/lindenb/jvarkit/wiki/VcfEnsemblReg.
Here is an example with the following VCF:
##fileformat=VCFv4.1 (...) #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample chr21 33037029 . C T 6.20 . . GT:PL:DP:GQ 1/1:35,3,0:1:4VcfEnsemblReg is invoked:
$ java -jar dist/vcfensemblreg.jar in.vcf > out.vcfHere is the content of out.vcf:
##fileformat=VCFv4.1 ##INFO=<ID=AP2ALPHA,Number=1,Type=Float,Description="Overlap summary of AP2ALPHA ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/AP2ALPHA.bw"> ##INFO=<ID=AP2GAMMA,Number=1,Type=Float,Description="Overlap summary of AP2GAMMA ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/AP2GAMMA.bw"> ##INFO=<ID=ATF3,Number=1,Type=Float,Description="Overlap summary of ATF3 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/ATF3.bw"> ##INFO=<ID=BAF155,Number=1,Type=Float,Description="Overlap summary of BAF155 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/BAF155.bw"> ##INFO=<ID=BAF170,Number=1,Type=Float,Description="Overlap summary of BAF170 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/BAF170.bw"> (...) #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample chr21 33037029 . C T 6.20 . BuildOverview=ctcf_45704|CTCFBindingSite;Segway_17_1=3.0;Segway_17_14=7.0;Segway_17_24=3.0;Segway_17_6=1.0;Segway_17_7=2.0;Segway_17_8=1.0;Segway_17_A549_projected=ctcf_45704|InactiveRegions;Segway_17_A549_segments=14_gene_79558|TranscriptionAssociated;Segway_17_DND41_projected=ctcf_45704|InactiveRegions;Segway_17_DND41_segments=1_distal_17115|DistalEnhancer;Segway_17_GM12878_projected=ctcf_45704|InactiveRegions;Segway_17_GM12878_segments=1_distal_29075|DistalEnhancer;Segway_17_H1HESC_projected=ctcf_45704|ActiveCTCFBindingSite;Segway_17_H1HESC_segments=8_ctcf_27831|DistalCTF;Segway_17_HELAS3_projected=ctcf_45704|InactiveRegions;Segway_17_HELAS3_segments=6_distal_76536|DistalEnhancer;Segway_17_HEPG2_projected=ctcf_45704|InactiveRegions;Segway_17_HEPG2_segments=1_distal_21535|DistalEnhancer;Segway_17_HMEC_projected=ctcf_45704|InactiveRegions;Segway_17_HMEC_segments=14_gene_44998|TranscriptionAssociated;Segway_17_HSMMT_projected=ctcf_45704|InactiveRegions;Segway_17_HSMMT_segments=24_gene_70780|TranscriptionAssociated;Segway_17_HSMM_projected=ctcf_45704|InactiveRegions;Segway_17_HSMM_segments=24_gene_80902|TranscriptionAssociated;Segway_17_HUVEC_projected=ctcf_45704|InactiveRegions;Segway_17_K562_projected=ctcf_45704|InactiveRegions;Segway_17_K562_segments=14_gene_68692|TranscriptionAssociated;Segway_17_MONO_projected=ctcf_45704|InactiveRegions;Segway_17_MONO_segments=14_gene_35200|TranscriptionAssociated;Segway_17_NHA_projected=ctcf_45704|InactiveRegions;Segway_17_NHDFAD_projected=ctcf_45704|InactiveRegions;Segway_17_NHDFAD_segments=14_gene_57366|TranscriptionAssociated;Segway_17_NHEK_projected=ctcf_45704|InactiveRegions;Segway_17_NHEK_segments=24_gene_95458|TranscriptionAssociated;Segway_17_NHLF_projected=ctcf_45704|InactiveRegions;Segway_17_NHLF_segments=14_gene_59524|TranscriptionAssociated;Segway_17_OSTEO_projected=ctcf_45704|InactiveRegions;Segway_17_OSTEO_segments=14_gene_61575|TranscriptionAssociated GT:PL:DP:GQ 1/1:35,3,0:1:4Here are the new fields in the INFO column:
Segway_17_1 3.0 Segway_17_14 7.0 Segway_17_24 3.0 Segway_17_6 1.0 Segway_17_7 2.0 Segway_17_8 1.0 Segway_17_A549_projected ctcf_45704|InactiveRegions Segway_17_A549_segments 14_gene_79558|TranscriptionAssociated Segway_17_DND41_projected ctcf_45704|InactiveRegions Segway_17_DND41_segments 1_distal_17115|DistalEnhancer Segway_17_GM12878_projected ctcf_45704|InactiveRegions Segway_17_GM12878_segments 1_distal_29075|DistalEnhancer Segway_17_H1HESC_projected ctcf_45704|ActiveCTCFBindingSite Segway_17_H1HESC_segments 8_ctcf_27831|DistalCTF Segway_17_HELAS3_projected ctcf_45704|InactiveRegions Segway_17_HELAS3_segments 6_distal_76536|DistalEnhancer Segway_17_HEPG2_projected ctcf_45704|InactiveRegions Segway_17_HEPG2_segments 1_distal_21535|DistalEnhancer Segway_17_HMEC_projected ctcf_45704|InactiveRegions Segway_17_HMEC_segments 14_gene_44998|TranscriptionAssociated Segway_17_HSMMT_projected ctcf_45704|InactiveRegions Segway_17_HSMMT_segments 24_gene_70780|TranscriptionAssociated Segway_17_HSMM_projected ctcf_45704|InactiveRegions Segway_17_HSMM_segments 24_gene_80902|TranscriptionAssociated Segway_17_HUVEC_projected ctcf_45704|InactiveRegions Segway_17_K562_projected ctcf_45704|InactiveRegions Segway_17_K562_segments 14_gene_68692|TranscriptionAssociated Segway_17_MONO_projected ctcf_45704|InactiveRegions Segway_17_MONO_segments 14_gene_35200|TranscriptionAssociated Segway_17_NHA_projected ctcf_45704|InactiveRegions Segway_17_NHDFAD_projected ctcf_45704|InactiveRegions Segway_17_NHDFAD_segments 14_gene_57366|TranscriptionAssociated Segway_17_NHEK_projected ctcf_45704|InactiveRegions Segway_17_NHEK_segments 24_gene_95458|TranscriptionAssociated Segway_17_NHLF_projected ctcf_45704|InactiveRegions Segway_17_NHLF_segments 14_gene_59524|TranscriptionAssociated Segway_17_OSTEO_projected ctcf_45704|InactiveRegions Segway_17_OSTEO_segments 14_gene_61575|TranscriptionAssociated
OK, now I've got a VCF containing those 'Ensembl Regulatory' annotations. What can I do with this ? I've currently no idea :-)
That's it,
Pierre