15 May 2014

How I start a bioinformatics project

Phil Ashton tweeted a link to a paper about how to set up a bioinformatics project file hierarchy: " A Quick Guide to Organizing Computational Biology Projects ".

Nick Loman posted his version yesterday : "How I start a bioinformatics project" on http://nickloman.github.io/2014/05/14/how-i-start-a-bioinformatics-project/.

Here is mine (simplified):

  • I start by creating a directory managed by git
  • I create a JSON-based description of my data, including the path to the softwares, to the references
    {
    "reference": {
    "name": "ref",
    "fasta": "/path/to/ref.fasta"
    },
    "samples": [
    {
    "fastq": [
    "path/to/Sample1/Sample1_1.fq.gz",
    "path/to/Sample1/Sample1_2.fq.gz"
    ],
    "name": "Sample1"
    },
    {
    "fastq": [
    "path/to/Sample2/Sample2_1.fq.gz",
    "path/to/Sample2/Sample2_2.fq.gz"
    ],
    "name": "Sample2"
    },
    {
    "fastq": [
    "path/to/Sample3/Sample3_1.fq.gz",
    "path/to/Sample3/Sample3_2.fq.gz"
    ],
    "name": "Sample3"
    }
    ],
    "tools": {
    "bcftools": "/path/to/bcftools",
    "bwa": "/path/to/bwa",
    "samtools": "/path/to/samtools"
    }
    }
    view raw config.json hosted with ❤ by GitHub
  • I create a git submodule for a project hosting an Apache-velocity template transforming a Makefile from config.json :
    REF=${config.reference.fasta}
    .PHONY:all
    all: align/variants.vcf
    align/variants.vcf: #foreach($sample in ${config.samples}) align/${sample.name}_sorted.bam #end
    ${config.tools.samtools} mpileup -uf ${REF} $^ |\
    ${config.tools.bcftools} view -vcg - >$@
    #foreach($sample in ${config.samples})
    align/${sample.name}_sorted.bam : ${sample.fastq[0]} ${sample.fastq[1]}
    mkdir -p $(dir $@) && \
    ${config.tools.bwa} mem -R '@RG\tID:${sample.getId()}\tSM:${sample.name}' ${REF} $^ |\
    ${config.tools.samtools} view -b -S - |\
    ${config.tools.samtools} sort - $(basename $@) && \
    ${config.tools.samtools} index $@
    #end
    view raw make.vm hosted with ❤ by GitHub
  • The Makefile is generated using jsvelocity :
    java -jar jsvelocity.jar -f config config.json make.vm > Makefile
    view raw exectute.bash hosted with ❤ by GitHub
    It produces the following Makefile:
    REF=/path/to/ref.fasta
    .PHONY:all
    all: align/variants.vcf
    align/variants.vcf: align/Sample1_sorted.bam align/Sample2_sorted.bam align/Sample3_sorted.bam
    /path/to/samtools mpileup -uf ${REF} $^ |\
    /path/to/bcftools view -vcg - >$@
    align/Sample1_sorted.bam : path/to/Sample1/Sample1_1.fq.gz path/to/Sample1/Sample1_2.fq.gz
    mkdir -p $(dir $@) && \
    /path/to/bwa mem -R '@RG\tID:id10\tSM:Sample1' ${REF} $^ |\
    /path/to/samtools view -b -S - |\
    /path/to/samtools sort - $(basename $@) && \
    /path/to/samtools index $@
    align/Sample2_sorted.bam : path/to/Sample2/Sample2_1.fq.gz path/to/Sample2/Sample2_2.fq.gz
    mkdir -p $(dir $@) && \
    /path/to/bwa mem -R '@RG\tID:id15\tSM:Sample2' ${REF} $^ |\
    /path/to/samtools view -b -S - |\
    /path/to/samtools sort - $(basename $@) && \
    /path/to/samtools index $@
    align/Sample3_sorted.bam : path/to/Sample3/Sample3_1.fq.gz path/to/Sample3/Sample3_2.fq.gz
    mkdir -p $(dir $@) && \
    /path/to/bwa mem -R '@RG\tID:id20\tSM:Sample3' ${REF} $^ |\
    /path/to/samtools view -b -S - |\
    /path/to/samtools sort - $(basename $@) && \
    /path/to/samtools index $@
    view raw Makefile hosted with ❤ by GitHub
  • The Makefile is invoked with option -j N(Allow N jobs at once) using GNU-Make or QMake(distributed parallel make, scheduled by Sun Grid Engine)

That's it,

Pierre

No comments: