YOKOFAKUN: How I start a bioinformatics project

15 May 2014

Phil Ashton tweeted a link to a paper about how to set up a bioinformatics project file hierarchy: " A Quick Guide to Organizing Computational Biology Projects ".

Nick Loman posted his version yesterday : "How I start a bioinformatics project" on http://nickloman.github.io/2014/05/14/how-i-start-a-bioinformatics-project/.

Here is mine (simplified):

I create a JSON-based description of my data, including the path to the softwares, to the references

view raw config.json hosted with ❤ by GitHub

I create a git submodule for a project hosting an Apache-velocity template transforming a Makefile from config.json :

	REF=${config.reference.fasta}
	.PHONY:all

	all: align/variants.vcf

	align/variants.vcf: #foreach($sample in ${config.samples}) align/${sample.name}_sorted.bam #end

	${config.tools.samtools} mpileup -uf ${REF} $^ \|\
	${config.tools.bcftools} view -vcg - >$@

	#foreach($sample in ${config.samples})

	align/${sample.name}_sorted.bam : ${sample.fastq[0]} ${sample.fastq[1]}
	mkdir -p $(dir $@) && \
	${config.tools.bwa} mem -R '@RG\tID:${sample.getId()}\tSM:${sample.name}' ${REF} $^ \|\
	${config.tools.samtools} view -b -S - \|\
	${config.tools.samtools} sort - $(basename $@) && \
	${config.tools.samtools} index $@

	#end

view raw make.vm hosted with ❤ by GitHub

The Makefile is generated using jsvelocity :

java -jar jsvelocity.jar -f config config.json make.vm > Makefile

view raw exectute.bash hosted with ❤ by GitHub

It produces the following Makefile:

	REF=/path/to/ref.fasta
	.PHONY:all

	all: align/variants.vcf

	align/variants.vcf: align/Sample1_sorted.bam align/Sample2_sorted.bam align/Sample3_sorted.bam
	/path/to/samtools mpileup -uf ${REF} $^ \|\
	/path/to/bcftools view -vcg - >$@


	align/Sample1_sorted.bam : path/to/Sample1/Sample1_1.fq.gz path/to/Sample1/Sample1_2.fq.gz
	mkdir -p $(dir $@) && \
	/path/to/bwa mem -R '@RG\tID:id10\tSM:Sample1' ${REF} $^ \|\
	/path/to/samtools view -b -S - \|\
	/path/to/samtools sort - $(basename $@) && \
	/path/to/samtools index $@


	align/Sample2_sorted.bam : path/to/Sample2/Sample2_1.fq.gz path/to/Sample2/Sample2_2.fq.gz
	mkdir -p $(dir $@) && \
	/path/to/bwa mem -R '@RG\tID:id15\tSM:Sample2' ${REF} $^ \|\
	/path/to/samtools view -b -S - \|\
	/path/to/samtools sort - $(basename $@) && \
	/path/to/samtools index $@


	align/Sample3_sorted.bam : path/to/Sample3/Sample3_1.fq.gz path/to/Sample3/Sample3_2.fq.gz
	mkdir -p $(dir $@) && \
	/path/to/bwa mem -R '@RG\tID:id20\tSM:Sample3' ${REF} $^ \|\
	/path/to/samtools view -b -S - \|\
	/path/to/samtools sort - $(basename $@) && \
	/path/to/samtools index $@

view raw Makefile hosted with ❤ by GitHub

The Makefile is invoked with option -j N(Allow N jobs at once) using GNU-Make or QMake(distributed parallel make, scheduled by Sun Grid Engine)

That's it,

Pierre

YOKOFAKUN