30 September 2014

Using the Ensembl Regulatory Build to annotate some VCF files

via UCSC Genome Browser project announcements: "Data from the Ensembl Regulatory Build are now available in the UCSC Genome Browser as a public track hub for both hg19 and hg38. This track hub contains promoters and their flanking regions, enhancers, and many other regulatory features predicted across a number of cell lines using annotated segmentation states".
For example looking at chr21:33037019-33037021 returns the following screen:

Those new annotations are deployed by the Sanger Institute as a UCSC track hub. By the way, those file can be directly handled using the UCSC standalone tools:
$ bigWigSummary -type=mean -udcDir=.  \
  "http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/segmentation_summaries/Segway_17/1.bw" \
  chr1 1  110301 1

I wrote a java tool for the annotation of VCFs with those files. This tool uses the BigWig library for java ( https://code.google.com/p/bigwig/ ) and is available at: https://github.com/lindenb/jvarkit/wiki/VcfEnsemblReg.
Here is an example with the following VCF:
chr21 33037029 . C T 6.20 . . GT:PL:DP:GQ 1/1:35,3,0:1:4
VcfEnsemblReg is invoked:
$  java -jar dist/vcfensemblreg.jar in.vcf > out.vcf
Here is the content of out.vcf:
##INFO=<ID=AP2ALPHA,Number=1,Type=Float,Description="Overlap summary of AP2ALPHA ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/AP2ALPHA.bw">
##INFO=<ID=AP2GAMMA,Number=1,Type=Float,Description="Overlap summary of AP2GAMMA ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/AP2GAMMA.bw">
##INFO=<ID=ATF3,Number=1,Type=Float,Description="Overlap summary of ATF3 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/ATF3.bw">
##INFO=<ID=BAF155,Number=1,Type=Float,Description="Overlap summary of BAF155 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/BAF155.bw">
##INFO=<ID=BAF170,Number=1,Type=Float,Description="Overlap summary of BAF170 ChipSeq binding peaks across available datasets http://ngs.sanger.ac.uk/production/ensembl/regulation//hg19/tfbs/BAF170.bw">
chr21 33037029 . C T 6.20 . BuildOverview=ctcf_45704|CTCFBindingSite;Segway_17_1=3.0;Segway_17_14=7.0;Segway_17_24=3.0;Segway_17_6=1.0;Segway_17_7=2.0;Segway_17_8=1.0;Segway_17_A549_projected=ctcf_45704|InactiveRegions;Segway_17_A549_segments=14_gene_79558|TranscriptionAssociated;Segway_17_DND41_projected=ctcf_45704|InactiveRegions;Segway_17_DND41_segments=1_distal_17115|DistalEnhancer;Segway_17_GM12878_projected=ctcf_45704|InactiveRegions;Segway_17_GM12878_segments=1_distal_29075|DistalEnhancer;Segway_17_H1HESC_projected=ctcf_45704|ActiveCTCFBindingSite;Segway_17_H1HESC_segments=8_ctcf_27831|DistalCTF;Segway_17_HELAS3_projected=ctcf_45704|InactiveRegions;Segway_17_HELAS3_segments=6_distal_76536|DistalEnhancer;Segway_17_HEPG2_projected=ctcf_45704|InactiveRegions;Segway_17_HEPG2_segments=1_distal_21535|DistalEnhancer;Segway_17_HMEC_projected=ctcf_45704|InactiveRegions;Segway_17_HMEC_segments=14_gene_44998|TranscriptionAssociated;Segway_17_HSMMT_projected=ctcf_45704|InactiveRegions;Segway_17_HSMMT_segments=24_gene_70780|TranscriptionAssociated;Segway_17_HSMM_projected=ctcf_45704|InactiveRegions;Segway_17_HSMM_segments=24_gene_80902|TranscriptionAssociated;Segway_17_HUVEC_projected=ctcf_45704|InactiveRegions;Segway_17_K562_projected=ctcf_45704|InactiveRegions;Segway_17_K562_segments=14_gene_68692|TranscriptionAssociated;Segway_17_MONO_projected=ctcf_45704|InactiveRegions;Segway_17_MONO_segments=14_gene_35200|TranscriptionAssociated;Segway_17_NHA_projected=ctcf_45704|InactiveRegions;Segway_17_NHDFAD_projected=ctcf_45704|InactiveRegions;Segway_17_NHDFAD_segments=14_gene_57366|TranscriptionAssociated;Segway_17_NHEK_projected=ctcf_45704|InactiveRegions;Segway_17_NHEK_segments=24_gene_95458|TranscriptionAssociated;Segway_17_NHLF_projected=ctcf_45704|InactiveRegions;Segway_17_NHLF_segments=14_gene_59524|TranscriptionAssociated;Segway_17_OSTEO_projected=ctcf_45704|InactiveRegions;Segway_17_OSTEO_segments=14_gene_61575|TranscriptionAssociated GT:PL:DP:GQ 1/1:35,3,0:1:4
Here are the new fields in the INFO column:
Segway_17_1 3.0
Segway_17_14 7.0
Segway_17_24 3.0
Segway_17_6 1.0
Segway_17_7 2.0
Segway_17_8 1.0
Segway_17_A549_projected ctcf_45704|InactiveRegions
Segway_17_A549_segments 14_gene_79558|TranscriptionAssociated
Segway_17_DND41_projected ctcf_45704|InactiveRegions
Segway_17_DND41_segments 1_distal_17115|DistalEnhancer
Segway_17_GM12878_projected ctcf_45704|InactiveRegions
Segway_17_GM12878_segments 1_distal_29075|DistalEnhancer
Segway_17_H1HESC_projected ctcf_45704|ActiveCTCFBindingSite
Segway_17_H1HESC_segments 8_ctcf_27831|DistalCTF
Segway_17_HELAS3_projected ctcf_45704|InactiveRegions
Segway_17_HELAS3_segments 6_distal_76536|DistalEnhancer
Segway_17_HEPG2_projected ctcf_45704|InactiveRegions
Segway_17_HEPG2_segments 1_distal_21535|DistalEnhancer
Segway_17_HMEC_projected ctcf_45704|InactiveRegions
Segway_17_HMEC_segments 14_gene_44998|TranscriptionAssociated
Segway_17_HSMMT_projected ctcf_45704|InactiveRegions
Segway_17_HSMMT_segments 24_gene_70780|TranscriptionAssociated
Segway_17_HSMM_projected ctcf_45704|InactiveRegions
Segway_17_HSMM_segments 24_gene_80902|TranscriptionAssociated
Segway_17_HUVEC_projected ctcf_45704|InactiveRegions
Segway_17_K562_projected ctcf_45704|InactiveRegions
Segway_17_K562_segments 14_gene_68692|TranscriptionAssociated
Segway_17_MONO_projected ctcf_45704|InactiveRegions
Segway_17_MONO_segments 14_gene_35200|TranscriptionAssociated
Segway_17_NHA_projected ctcf_45704|InactiveRegions
Segway_17_NHDFAD_projected ctcf_45704|InactiveRegions
Segway_17_NHDFAD_segments 14_gene_57366|TranscriptionAssociated
Segway_17_NHEK_projected ctcf_45704|InactiveRegions
Segway_17_NHEK_segments 24_gene_95458|TranscriptionAssociated
Segway_17_NHLF_projected ctcf_45704|InactiveRegions
Segway_17_NHLF_segments 14_gene_59524|TranscriptionAssociated
Segway_17_OSTEO_projected ctcf_45704|InactiveRegions
Segway_17_OSTEO_segments 14_gene_61575|TranscriptionAssociated

OK, now I've got a VCF containing those 'Ensembl Regulatory' annotations. What can I do with this ? I've currently no idea :-)

That's it,

08 September 2014

It's a kind of magic(5)

The following question was recently asked on biostars.org : Tool that detects data types:"More specifically I am interested for detecting between SAM, BAM, FASTA, FASTQ (if possible descriminating between these types), BED, BED12, BED15, GFF, GFF2, GFF3). The detection should be performed without just looking at the extension of the file....":
I suggested to create a new magic file for the linux file command. As an example, a BAM file starts (position=0) with the magic bytes BAM\1. A magic file definition 'bam' for BAM would be:

0 string BAM\1 BAM file v1.0
Compile the new magic file:
file -C -m bam
Now use this magic:
file -z -m bam.mgc file.bam
file.bam: BAM file v1.0 (data)

Now, I've started a github repo containing some 'magic' patterns for bioinformatics (Fastq, blast, bam, bigwig, etc... ): .
(My current problem is to prioritize some results like differentiating a 'Nucleotide' and a 'Protein' Fasta sequence ( http://unix.stackexchange.com/questions/154096/file1-and-magic5-prioritizing-a-result).)

That's it,

03 September 2014

Parallelizing GNU #Make 4 in a #SLURM infrastructure/cluster

SLURM (https://computing.llnl.gov/linux/slurm/slurm.html) is The Simple Linux Utility for Resource Management . It's an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.

A patch for GNU Make version 3.81 is available as part of the SLURM distribution in https://github.com/SchedMD/slurm/blob/master/contribs/make.slurm.patch This patch will use SLURM to launch tasks across a job's current resource allocation.

The patch for Make 'just' wraps the command into srun: ( "srun is the command sending a parallel job on cluster managed by SLURM. )

Index: job.c
--- job.c (revision 321)
+++ job.c (working copy)
@@ -1959,6 +1959,22 @@
child_execute_job (int stdin_fd, int stdout_fd, char **argv, char **envp)
+ if (getenv("SLURM_JOB_ID")) {
+ int i;
+ static char *argx[128];
+ argx[0] = "srun";
+ argx[1] = "-N1";
+ argx[2] = "-n1";
+ for (i=0; ((i<124)&&(argv[i])); i++) {
+ argx[i+3] = argv[i];
+ }
+ if (i<124) {
+ argx[i+3] = NULL;
+ argv = argx;
+ }
+ }
if (stdin_fd != 0)
(void) dup2 (stdin_fd, 0);
if (stdout_fd != 1)

GNU-Make version 4 was recently released. This new version comes with a number of improvements like GNU Guile integration, Loadable objects (see http://plindenbaum.blogspot.fr/2014/08/a-gnu-make-plug-in-for-illumina-fastqs.html ). It also allows to specify the default shell to be invoked (see http://plindenbaum.blogspot.fr/2014/01/parallelizing-rstats-using-make.html )

http://www.gnu.org/software/make/manual/make.html : The program used as the shell is taken from the variable SHELL. If this variable is not set in your makefile, the program /bin/sh is used as the shell. The argument(s) passed to the shell are taken from the variable .SHELLFLAGS. The default value of .SHELLFLAGS is -c normally, or -ec in POSIX-conforming mode.

So, if you want to parallelize GNU-Make with SLURM you can wrap the shell into srun using SHELL and .SHELLFLAGS. Here is an example, creating and concatenating 100 files containing the hostname:

.SHELLFLAGS= -N1 -n1  bash -c 
NUMBERS=$(shell seq 1 100 )
TARGETS=  $(addsuffix .test,${NUMBERS} )

.PHONY:  all clean

define TEST

$(addsuffix .test,$(1)) : 
        echo -n  $(1) " " > $$@ && hostname >> $$@
        @sleep 5


all: ${TARGETS}
        cat $^

$(foreach N,$(NUMBERS), $(eval $(call TEST,$(N) ) ) )

        rm -f ${TARGETS}

now invoke Make with SLURM and the option -j ( Allow -j N jobs at once ):

$ make -j 10
echo -n  1  " " > 1.test && hostname >> 1.test
echo -n  2  " " > 2.test && hostname >> 2.test
echo -n  3  " " > 3.test && hostname >> 3.test
echo -n  4  " " > 4.test && hostname >> 4.test
echo -n  5  " " > 5.test && hostname >> 5.test
echo -n  6  " " > 6.test && hostname >> 6.test
echo -n  7  " " > 7.test && hostname >> 7.test
echo -n  100  " " > 100.test && hostname >> 100.test
cat 1.test 2.test 3.test 4.test 5.test 6.test 7.test 8.test (...)
1  node004
2  node003
3  node001
4  node002
5  node002
6  node001
7  node002
8  node001
9  node001
10  node002
92  node004
93  node001
94  node001
95  node001
96  node001
97  node002
98  node001
99  node001
100  node001
That's it,